pstore was using mutex locking to protect read/write access to the
backend plug-ins. This causes problems when pstore is executed in
an NMI context through panic() -> kmsg_dump().
This patch changes the mutex to a spin_lock_irqsave then also checks to
see if we are in an NMI context. If we are in an NMI and can't get the
lock, just print a message stating that and blow by the locking.
All this is probably a hack around the bigger locking problem but it
solves my current situation of trying to sleep in an NMI context.
Tested by loading the lkdtm module and executing a HARDLOCKUP which
will cause the machine to panic inside the nmi handler.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Life is simple for all the kernel terminating types of kmsg_dump
call backs - pstore just saves the tail end of the console log. But
for "oops" the situation is more complex - the kernel may carry on
running (possibly for ever). So we'd like to make the logged copy
of the oops appear in the pstore filesystem - so that the user has
a handle to clear the entry from the persistent backing store (if
we don't, the store may fill with "oops" entries (that are also
safely stashed in /var/log/messages) leaving no space for real
errors.
Current code calls pstore_mkfile() immediately. But this may
not be safe. The oops could have happened with arbitrary locks
held, or in interrupt or NMI context. So allocating memory and
calling into generic filesystem code seems unwise.
This patch defers making the entry appear. At the time
of the oops, we merely set a flag "pstore_new_entry" noting that
a new entry has been added. A periodic timer checks once a minute
to see if the flag is set - if so, it schedules a work queue to
rescan the backing store and make all new entries appear in the
pstore filesystem.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Running the cthon tests on a recent kernel caused this message to pop
occasionally:
CIFS VFS: did not end path lookup where expected namelen is 0
Some added debugging showed that namelen and dfsplen were both 0 when
this occurred. That means that the read_seqretry returned true.
Assuming that the comment inside the if statement is true, this should
be harmless and just means that we raced with a rename. If that is the
case, then there's no need for alarm and we can demote this to cFYI.
While we're at it, print the dfsplen too so that we can see what
happened here if the message pops during debugging.
Cc: stable@kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
A 'path' consists of a starting ino and relative component. Encode even
when there is no relative component. This is primarily needed by the
NFS reexport code.
Signed-off-by: Sage Weil <sage@newdream.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: Do not set cifs/ntfs acl using a file handle (try #4)
[CIFS] Cleanup use of CONFIG_CIFS_STATS2 ifdef to make transport routines more readable
Bug discovered by Jan Kara:
Finally, commit 1449032be1 returned back
the old IO submission code but apparently it forgot to return the old
handling of uninitialized buffers so we unconditionnaly call
block_write_full_page() without specifying end_io function. So AFAICS
we never convert unwritten extents to written in some cases. For
example when I mount the fs as: mount -t ext4 -o
nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do
int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600);
char buf[1024];
memset(buf, 'a', sizeof(buf));
fallocate(fd, 0, 0, 16384);
write(fd, buf, sizeof(buf));
I get a file full of zeros (after remounting the filesystem so that
pagecache is dropped) instead of seeing the first KB contain 'a's.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
should be done simultaneously since ext4_end_io_nolock always clear
the flag and decrease the counter in the same time.
We don't increase i_aiodio_unwritten when setting
EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process
to wait forever.
Part of the patch came from Eric in his e-mail, but it doesn't fix the
problem met by Michael actually.
http://marc.info/?l=linux-ext4&m=131316851417460&w=2
Reported-and-Tested-by: Michael Tokarev<mjt@tls.msk.ru>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Flush inode's i_completed_io_list before calling ext4_io_wait to
prevent the following deadlock scenario: A page fault happens while
some process is writing inode A. During page fault,
shrink_icache_memory is called that in turn evicts another inode
B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.
Also moves ext4_flush_completed_IO and ext4_ioend_wait from
ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
ext4_evict_inode() is called before ext4_destroy_inode() and in
ext4_evict_inode(), we may call ext4_truncate() without holding
i_mutex lock. As a result, there is a race between flush_completed_IO
that is called from ext4_ext_truncate() and ext4_end_io_work, which
may cause corruption on an io_end structure. This change moves
ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
to ext4_evict_inode() to resolve the race between ext4_truncate() and
ext4_end_io_work during inode deletion.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
ext4_should_writeback_data() had an incorrect sequence of
tests to determine if it should return 0 or 1: in
particular, even in no-journal mode, 0 was being returned
for a non-regular-file inode.
This meant that, in non-journal mode, we would use
ext4_journalled_aops for directories, symlinks, and other
non-regular files. However, calling journalled aop
callbacks when there is no valid handle, can cause problems.
This would cause a kernel crash with Jan Kara's commit
2d859db3e4 ("ext4: fix data corruption in inodes with
journalled data"), because we now dereference 'handle' in
ext4_journalled_write_end().
I also added BUG_ONs to check for a valid handle in the
obviously journal-only aops callbacks.
I tested this running xfstests with a scratch device in
these modes:
- no-journal
- data=ordered
- data=writeback
- data=journal
All work fine; the data=journal run has many failures and a
crash in xfstests 074, but this is no different from a
vanilla kernel.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: replace xfs_buf_geterror() with bp->b_error
xfs: Check the return value of xfs_buf_read() for NULL
"xfs: fix error handling for synchronous writes" revisited
xfs: set cursor in xfs_ail_splice() even when AIL was empty
xfs: Remove the macro XFS_BUFTARG_NAME
xfs: Remove the macro XFS_BUF_TARGET
xfs: Remove the macro XFS_BUF_SET_TARGET
Replace the macro XFS_BUF_ISPINNED with helper xfs_buf_ispinned
xfs: Remove the macro XFS_BUF_SET_PTR
xfs: Remove the macro XFS_BUF_PTR
xfs: Remove macro XFS_BUF_SET_START
xfs: Remove macro XFS_BUF_HOLD
xfs: Remove macro XFS_BUF_BUSY and family
xfs: Remove the macro XFS_BUF_ERROR and family
xfs: Remove the macro XFS_BUF_BFLAGS
Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
annoying subdirectories in the XFS source code. Besides the large
amount of file rename the only changes are to the Makefile, a few
files including headers with the subdirectory prefix, and the binary
sysctl compat code that includes a header under fs/xfs/ from
kernel/.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Fix up some #include directives in preparation for moving a few
header files out of xfs source subdirectories.
Note that "xfs_linux.h" also got its quoting convention for included
files switched.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Since we just checked bp for NULL, it is ok to replace
xfs_buf_geterror() with bp->b_error in these places.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Check the return value of xfs_buf_read() for NULL and return ENOMEM
if it is NULL. This is necessary in a few spots to avoid subsequent
code blindly dereferencing the null buffer pointer.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
e1000e: increase driver version number
e1000e: alternate MAC address update
e1000e: do not disable receiver on 82574/82583
e1000e: alternate MAC address does not work on device id 0x1060
PCnet: Fix section mismatch
bnx2x: disable dcb on 578xx since not supported yet
bnx2x: properly clean indirect addresses
bnx2x: prevent race between undi_unload and load flows
bnx2x: fix select_queue when FCoE is disabled
bnx2x: init FCOE FP only once
ipv4: some rt_iif -> rt_route_iif conversions
net/bridge/netfilter/ebtables.c: use available error handling code
net/netlabel/netlabel_kapi.c: add missing cleanup code
net/irda: sh_sir: tidyup compile warning
net/irda: sh_sir: add missing header
net/irda: sh_irda: add missing header
slcan: ldisc generated skbs are received in softirq context
scm: Capture the full credentials of the scm sender
tcp: initialize variable ecn_ok in syncookies path
drivers/net/wireless/wl1251: add missing kfree
...
Local XATTR_TRUSTED_PREFIX_LEN and XATTR_SECURITY_PREFIX_LEN definitions
redefined ones in 'linux/xattr.h'. This was caused by commit 9d8f13ba3f
("security: new security_inode_init_security API adds function callback")
including 'linux/xattr.h' in 'linux/security.h'.
In file included from include/linux/security.h:39,
from include/net/sock.h:54,
from fs/cifs/cifspdu.h:25,
from fs/cifs/xattr.c:26:
This patch removes the local definitions.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Just like files-layout, blocks & objects layouts are part of the
NFS 4.1 protocol and should be automatically selected if NFS_4_1
is selected. The small problem is that these depend on other
Kernel support being present, while files only depends on NFS
itself.
This patch removes from the user choice the presence of objects
and blocks layout. But makes sure these are selected only if
the depended subsystems are present in the Kernel.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Acked-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit df5e622340 ("ext4: fix deadlock in ext4_symlink() in ENOSPC
conditions") recalculated the number of credits needed for a long
symlink, in the process of splitting it into two transactions. However,
the first credit calculation under-counted because if selinux is
enabled, credits are needed to create the selinux xattr as well.
Overrunning the reservation will result in an OOPS in
jbd2_journal_dirty_metadata() due to this assert:
J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
Fix this by increasing the reservation size.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ae54870a1d ("ext3: Fix lock inversion in ext3_symlink()")
recalculated the number of credits needed for a long symlink, in the
process of splitting it into two transactions. However, the first
credit calculation under-counted because if selinux is enabled, credits
are needed to create the selinux xattr as well.
Overrunning the reservation will result in an OOPS in
journal_dirty_metadata() due to this assert:
J_ASSERT_JH(jh, handle->h_buffer_credits > 0);
Fix this by increasing the reservation size.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
check in set_user() to check for NPROC exceeding via setuid() and
similar functions.
Before the check there was a possibility to greatly exceed the allowed
number of processes by an unprivileged user if the program relied on
rlimit only. But the check created new security threat: many poorly
written programs simply don't check setuid() return code and believe it
cannot fail if executed with root privileges. So, the check is removed
in this patch because of too often privilege escalations related to
buggy programs.
The NPROC can still be enforced in the common code flow of daemons
spawning user processes. Most of daemons do fork()+setuid()+execve().
The check introduced in execve() (1) enforces the same limit as in
setuid() and (2) doesn't create similar security issues.
Neil Brown suggested to track what specific process has exceeded the
limit by setting PF_NPROC_EXCEEDED process flag. With the change only
this process would fail on execve(), and other processes' execve()
behaviour is not changed.
Solar Designer suggested to re-check whether NPROC limit is still
exceeded at the moment of execve(). If the process was sleeping for
days between set*uid() and execve(), and the NPROC counter step down
under the limit, the defered execve() failure because NPROC limit was
exceeded days ago would be unexpected. If the limit is not exceeded
anymore, we clear the flag on successful calls to execve() and fork().
The flag is also cleared on successful calls to set_user() as the limit
was exceeded for the previous user, not the current one.
Similar check was introduced in -ow patches (without the process flag).
v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set security descriptor using path name instead of a file handle.
We can't be sure that the file handle has adequate permission to
set a security descriptor (to modify DACL).
Function set_cifs_acl_by_fid() has been removed since we can't be
sure how a file was opened for writing, a valid request can fail
if the file was not opened with two above mentioned permissions.
We could have opted to add on WRITE_DAC and WRITE_OWNER permissions
to file opens and then use that file handle but adding addtional
permissions such as WRITE_DAC and WRITE_OWNER could cause an
any open to fail.
And it was incorrect to look for read file handle to set a
security descriptor anyway.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Christoph had requested that the stats related code (in
CONFIG_CIFS_STATS2) be moved into helpers to make code flow more
readable. This patch should help. For example the following
section from transport.c
spin_unlock(&GlobalMid_Lock);
atomic_inc(&ses->server->num_waiters);
wait_event(ses->server->request_q,
atomic_read(&ses->server->inFlight)
< cifs_max_pending);
atomic_dec(&ses->server->num_waiters);
spin_lock(&GlobalMid_Lock);
becomes simpler (with the patch below):
spin_unlock(&GlobalMid_Lock);
cifs_num_waiters_inc(server);
wait_event(server->request_q,
atomic_read(&server->inFlight)
< cifs_max_pending);
cifs_num_waiters_dec(server);
spin_lock(&GlobalMid_Lock);
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
PNFS_BLOCK needs BLK_DEV_DM/MD, which is not a dependency for other
pnfs layout drivers. Seperate it out so others can still build when
BLK_DEV_DM/MD is not enabled.
Also change select to depends on to avoid build failures.
Reported-and-tested-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Acked-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xfs: fix for hang during synchronous buffer write error
If removed storage while synchronous buffer write underway,
"xfslogd" hangs.
Detailed log http://oss.sgi.com/archives/xfs/2011-07/msg00740.html
Related work bfc60177f8
"xfs: fix error handling for synchronous writes"
Given that xfs_bwrite actually does the shutdown already after
waiting for the b_iodone completion and given that we actually
found that calling xfs_force_shutdown from inside
xfs_buf_iodone_callbacks was a major contributor the problem
it better to drop this call.
Signed-off-by: Ajeet Yadav <ajeet.yadav.77@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Close a TOCTOU race for mounts done via ecryptfs-mount-private. The mount
source (device) can be raced when the ownership test is done in userspace.
Provide Ecryptfs a means to force the uid check at mount time.
Signed-off-by: John Johansen <john.johansen@canonical.com>
Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
In xfs_ail_splice(), if a cursor is provided it is updated to
point to the last item on the list being spliced into the AIL.
But if the AIL was found to be empty, the cursor (if provided)
is just initialized instead.
There is no reason the empty AIL case needs to be treated any
differently. And treating it the same way allows this code
to be rearranged a bit, with a somewhat tidier result.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
fs/ecryptfs/keystore.c: In function ‘ecryptfs_generate_key_packet_set’:
fs/ecryptfs/keystore.c:1991:28: warning: ‘payload_len’ may be used uninitialized in this function [-Wuninitialized]
fs/ecryptfs/keystore.c:1976:9: note: ‘payload_len’ was declared here
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
This patch fixes the compile error reported at the address:
https://bugzilla.kernel.org/show_bug.cgi?id=40292
The problem arises when compiling eCryptfs as built-in and the 'encrypted'
key type as a module. The patch prevents this combination from being set in
the kernel configuration, by fixing the eCryptfs dependencies.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Reported-by: David Hill <hilld@binarystorm.net>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
When an eCryptfs inode's lower file has been closed, and the pointer has
been set to NULL, return an error when trying to do a lower read or
write rather than calling BUG().
https://bugzilla.kernel.org/show_bug.cgi?id=37292
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
The previous comit made the autofs4 debug printouts check types against
the printout format, and uncovered this bug:
fs/autofs4/waitq.c:106:2: warning: format ‘%08lx’ expects type ‘long unsigned int’, but argument 4 has type ‘autofs_wqt_t’
which is due to the insane type for wait_queue_token. That thing should
be some fixed well-defined size (preferably just 'unsigned int' or
'u32') but for unexplained reasons it is randomly either 'unsigned long'
or 'unsigned int' depending on the architecture.
For now, cast it to 'unsigned long' for printing, the way we do
elsewhere. Somebody else can try to explain the typedef mess.
(There's a reason we don't support excessive use of typedefs in the
kernel: it's usually just a good way of confusing yourself).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use 'pr_debug()' for DPRINTK, which will do the proper type checking on
the arguments (without generating code) even when DEBUG isn't #defined.
Also, use the standard __VA_ARGS__ for the macros, and stop the
pointless abuse of 'do { xyz } while (0)' when the macro is already a
perfectly well-formed single statement.
Reported-by: David Howells <dhowells@redhat.com>
Suggested-by: Joe Perches <joe@perches.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As fuse does not use the page cache library functions when userspace
writes to a file, it did not benefit from 'c8236db mm: mark page
accessed before we write_end()' that made sure pages are properly
marked accessed when written to.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Ever since 'ea9b990 fuse: implement perform_write', the .write_begin
and .write_end aops have been dead code.
Their task - acquiring a page from the page cache, sending out a write
request and releasing the page again - is now done batch-wise to
maximize the number of pages send per userspace request.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Commit a9ff4f87 "fuse: support BSD locking semantics" overlooked a
number of issues with supporing flock locks over existing POSIX
locking infrastructure:
- it's not backward compatible, passing flock(2) calls to userspace
unconditionally (if userspace sets FUSE_POSIX_LOCKS)
- it doesn't cater for the fact that flock locks are automatically
unlocked on file release
- it doesn't take into account the fact that flock exclusive locks
(write locks) don't need an fd opened for write.
The last one invalidates the original premise of the patch that flock
locks can be emulated with POSIX locks.
This patch fixes the first two issues. The last one needs to be fixed
in userspace if the filesystem assumed that a write lock will happen
only on a file operned for write (as in the case of the current fuse
library).
Reported-by: Sebastian Pipping <webmaster@hartwork.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
fixes following error seen on x86_64 kernel:
ioctl32(openl2tpd:7480): Unknown cmd fd(14) cmd(80487436){t:'t';sz:72} arg(ffa7e6c0) on socket:[105094]
The argument (struct pppol2tp_ioc_stats) uses "aligned_u64" and thus doesn't need
fixups.
Cc: James Chapman <jchapman@katalix.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Al points out that the do_follow_link() helper function really is
misnamed - it's about whether we should try to follow a symlink or not,
not about actually doing the following.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 3567866bf2: "RCUify freeing acls, let check_acl() go ahead in
RCU mode if acl is cached" posix_acl_permission is being called with an
unsupported flag and the permission check fails. This patch fixes the issue.
Signed-off-by: Ari Savolainen <ari.m.savolainen@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
ore: Make ore its own module
exofs: Rename raid engine from exofs/ios.c => ore
exofs: ios: Move to a per inode components & device-table
exofs: Move exofs specific osd operations out of ios.c
exofs: Add offset/length to exofs_get_io_state
exofs: Fix truncate for the raid-groups case
exofs: Small cleanup of exofs_fill_super
exofs: BUG: Avoid sbi realloc
exofs: Remove pnfs-osd private definitions
nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4
The inode structure layout is largely random, and some of the vfs paths
really do care. The path lookup in particular is already quite D$
intensive, and profiles show that accessing the 'inode->i_op->xyz'
fields is quite costly.
We already optimized the dcache to not unnecessarily load the d_op
structure for members that are often NULL using the DCACHE_OP_xyz bits
in dentry->d_flags, and this does something very similar for the inode
ops that are used during pathname lookup.
It also re-orders the fields so that the fields accessed by 'stat' are
together at the beginning of the inode structure, and roughly in the
order accessed.
The effect of this seems to be in the 1-2% range for an empty kernel
"make -j" run (which is fairly kernel-intensive, mostly in filename
lookup), so it's visible. The numbers are fairly noisy, though, and
likely depend a lot on exact microarchitecture. So there's more tuning
to be done.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gcc tends to generate better code with small integers, including the
DCACHE_xyz flag tests - so move the common ones to be first in the list.
Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
tree.
And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
common case to be a nice straight-line fall-through.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export everything from ore need exporting. Change Kbuild and Kconfig
to build ore.ko as an independent module. Import ore from exofs
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
ORE stands for "Objects Raid Engine"
This patch is a mechanical rename of everything that was in ios.c
and its API declaration to an ore.c and an osd_ore.h header. The ore
engine will later be used by the pnfs objects layout driver.
* File ios.c => ore.c
* Declaration of types and API are moved from exofs.h to a new
osd_ore.h
* All used types are prefixed by ore_ from their exofs_ name.
* Shift includes from exofs.h to osd_ore.h so osd_ore.h is
independent, include it from exofs.h.
Other than a pure rename there are no other changes. Next patch
will move the ore into it's own module and will export the API
to be used by exofs and later the layout driver
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Exofs raid engine was saving on memory space by having a single layout-info,
single pid, and a single device-table, global to the filesystem. Then passing
a credential and object_id info at the io_state level, private for each
inode. It would also devise this contraption of rotating the device table
view for each inode->ino to spread out the device usage.
This is not compatible with the pnfs-objects standard, demanding that
each inode can have it's own layout-info, device-table, and each object
component it's own pid, oid and creds.
So: Bring exofs raid engine to be usable for generic pnfs-objects use by:
* Define an exofs_comp structure that holds obj_id and credential info.
* Break up exofs_layout struct to an exofs_components structure that holds a
possible array of exofs_comp and the array of devices + the size of the
arrays.
* Add a "comps" parameter to get_io_state() that specifies the ids creds
and device array to use for each IO.
This enables to keep the layout global, but the device-table view, creds
and IDs at the inode level. It only adds two 64bit to each inode, since
some of these members already existed in another form.
* ios raid engine now access layout-info and comps-info through the passed
pointers. Everything is pre-prepared by caller for generic access of
these structures and arrays.
At the exofs Level:
* Super block holds an exofs_components struct that holds the device
array, previously in layout. The devices there are in device-table
order. The device-array is twice bigger and repeats the device-table
twice so now each inode's device array can point to a random device
and have a round-robin view of the table, making it compatible to
previous exofs versions.
* Each inode has an exofs_components struct that is initialized at
load time, with it's own view of the device table IDs and creds.
When doing IO this gets passed to the io_state together with the
layout.
While preforming this change. Bugs where found where credentials with the
wrong IDs where used to access the different SB objects (super.c). As well
as some dead code. It was never noticed because the target we use does not
check the credentials.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
ios.c will be moving to an external library, for use by the
objects-layout-driver. Remove from it some exofs specific functions.
Also g_attr_logical_length is used both by inode.c and ios.c
move definition to the later, to keep it independent
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
In future raid code we will need to know the IO offset/length
and if it's a read or write to determine some of the array
sizes we'll need.
So add a new exofs_get_rw_state() API for use when
writeing/reading. All other simple cases are left using the
old way.
The major change to this is that now we need to call
exofs_get_io_state later at inode.c::read_exec and
inode.c::write_exec when we actually know these things. So this
patch is kept separate so I can test things apart from other
changes.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: cope with negative dentries in cifs_get_root
cifs: convert prefixpath delimiters in cifs_build_path_to_root
CIFS: Fix missing a decrement of inFlight value
cifs: demote DFS referral lookup errors to cFYI
Revert "cifs: advertise the right receive buffer size to the server"
The CLOEXE bit is magical, and for performance (and semantic) reasons we
don't actually maintain it in the file descriptor itself, but in a
separate bit array. Which means that when we show f_flags, the CLOEXE
status is shown incorrectly: we show the status not as it is now, but as
it was when the file was opened.
Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
array.
Uli needs this in order to re-implement the pfiles program:
"For normal file descriptors (not sockets) this was the last piece of
information which wasn't available. This is all part of my 'give
Solaris users no reason to not switch' effort. I intend to offer the
code to the util-linux-ng maintainers."
Requested-by: Ulrich Drepper <drepper@akkadia.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WARN_ONCE() is very annoying, in that it shows the stack trace that we
don't care about at all, and also triggers various user-level "kernel
oopsed" logic that we really don't care about. And it's not like the
user can do anything about the applications (sshd) in question, it's a
distro issue.
Requested-by: Andi Kleen <andi@firstfloor.org> (and many others)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Btrfs does bio submissions from a worker thread, and each device
has a list of high priority bios and regular priority bios.
Synchronous writes go to the high priority thread while async writes
go to regular list. This commit brings back an explicit unplug
any time we switch from high to regular priority, which makes it
easier for the block layer to give us low latencies.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The loop around lookup_one_len doesn't handle the case where it might
return a negative dentry, which can cause an oops on the next pass
through the loop. Check for that and break out of the loop with an
error of -ENOENT if there is one.
Fixes the panic reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=727927
Reported-by: TR Bentley <home@trarbentley.net>
Reported-by: Iain Arnell <iarnell@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Regression from 2.6.39...
The delimiters in the prefixpath are not being converted based on
whether posix paths are in effect. Fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=727834
Reported-and-Tested-by: Iain Arnell <iarnell@gmail.com>
Reported-by: Patrick Oltmann <patrick.oltmann@gmx.net>
Cc: Pavel Shilovsky <piastryyy@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached
get rid of boilerplate switches in posix_acl.h
fix block device fallout from ->fsync() changes
In the general raid-group case the truncate was wrong in that
it did not also fix the object length of the neighboring groups.
There are two bad cases in the old code:
1. Space that should be freed was not.
2. If a file That was big is truncated small, then made bigger
again, the holes would not contain zeros but could expose old data.
(If the growing of the file expands to more than a full
groups cycle + group size (> S + T))
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Since the beginning we realloced the sbi structure when a bigger
then one device table was specified. (I know that was really stupid).
Then much later when "register bdi" was added (By Jens) it was
registering the pointer to sbi->bdi before the realloc.
We never saw this problem because up till now the realloc did not
do anything since the device table was small enough to fit in the
original allocation. But once we starting testing with large device
tables (Bigger then 28) we noticed the crash of writeback operating
on a deallocated pointer.
* Avoid the all mess by allocating the device-table as a second array
and get rid of the variable-sized structure and the rest of this
mess.
* Take the chance to clean near by structures and comments.
* Add a needed dprint on startup to indicate the loaded layout.
* Also move the bdi registration to the very end because it will
only fail in a low memory, which will probably fail before hand.
There are many more likely causes to not load before that. This
way the error handling is made simpler. (Just doing this would be
enough to fix the BUG)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
If the client is in the process of resetting the session when it receives
a callback, then returning NFS4ERR_DELAY may cause a deadlock with the
DESTROY_SESSION call.
Basically, if the client returns NFS4ERR_DELAY in response to the
CB_SEQUENCE call, then the server is entitled to believe that the
client is busy because it is already processing that call. In that
case, the server is perfectly entitled to respond with a
NFS4ERR_BACK_CHAN_BUSY to any DESTROY_SESSION call.
Fix this by having the client reply with a NFS4ERR_BADSESSION in
response to the callback if it is resetting the session.
Cc: stable@kernel.org [2.6.38+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, there is no guarantee that we will call nfs4_cb_take_slot() even
though nfs4_callback_compound() will consistently call
nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field.
The result is that we can trigger the BUG_ON() upon the next call to
nfs4_cb_take_slot().
This patch fixes the above problem by using the slot id that was taken in
the CB_SEQUENCE operation as a flag for whether or not we need to call
nfs4_cb_free_slot().
It also fixes an atomicity problem: we need to set tbl->highest_used_slotid
atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up
racing with the various tests in nfs4_begin_drain_session().
Cc: stable@kernel.org [2.6.38+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There were bugs in the case of partial layout where olo_comp_index
is not zero. This used to work and was tested but one of the later
cleanup SQUASHMEs broke it and was not tested since.
Also add a dprint that specify those received layout parameters.
Everything else was already printed.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we have a situation that the number of pages we want
to encode is bigger then the size of the bio. (Which can
currently happen only when all IO is going to a single device
.e.g group_width==1) then the IO is submitted short and we
report back only the amount of bytes we actually wrote/read
and all is fine. BUT ...
There was a bug that the current length counter was advanced
before the fail to add the extra page, and we come to a situation
that the CDB length was one-page longer then the actual bio size,
which is of course rejected by the osd-target.
While here also fix the bio size calculation, in the case
that we received more then one group of devices.
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix this compile error on s390:
CC [M] fs/nfs/blocklayout/blocklayout.o
fs/nfs/blocklayout/blocklayout.c: In function 'bl_end_io_read':
fs/nfs/blocklayout/blocklayout.c:201:4: error: implicit declaration of function 'prefetchw'
Introduced with 9549ec01 "pnfsblock: bl_read_pagelist".
Cc: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Expand the fs/Kconfig "help" info to clarify why it's a bad idea to
deselect the TMPFS_POSIX_ACL config variable.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
go through shmem_add_to_page_cache().
Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().
And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix new kernel-doc warning in fs/dcache.c:
Warning(fs/dcache.c:797): No description found for parameter 'sb'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
if we failed on getting mid entry in cifs_call_async.
Cc: stable@kernel.org
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 9933fc0i (ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and
ext4_kvfree()) intruduced wrappers around k*alloc/vmalloc but introduced
a typo for ext4_kzalloc() by not using kzalloc() but kmalloc().
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
Btrfs: don't call writepages from within write_full_page
Btrfs: Remove unused variable 'last_index' in file.c
Btrfs: clean up for find_first_extent_bit()
Btrfs: clean up for wait_extent_bit()
Btrfs: clean up for insert_state()
Btrfs: remove unused members from struct extent_state
Btrfs: clean up code for merging extent maps
Btrfs: clean up code for extent_map lookup
Btrfs: clean up search_extent_mapping()
Btrfs: remove redundant code for dir item lookup
Btrfs: make acl functions really no-op if acl is not enabled
Btrfs: remove remaining ref-cache code
Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
Btrfs: use wait_event()
Btrfs: check the nodatasum flag when writing compressed files
Btrfs: copy string correctly in INO_LOOKUP ioctl
Btrfs: don't print the leaf if we had an error
btrfs: make btrfs_set_root_node void
Btrfs: fix oops while writing data to SSD partitions
Btrfs: Protect the readonly flag of block group
...
Fix up trivial conflicts (due to acl and writeback cleanups) in
- fs/btrfs/acl.c
- fs/btrfs/ctree.h
- fs/btrfs/extent_io.c
cifs: demote DFS referral lookup errors to cFYI
Now that we call into this routine on every mount, anyone who doesn't
have the upcall configured will get multiple printks about failed lookups.
Reported-and-Tested-by: Martijn Uffing <mp3project@sarijopen.student.utwente.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits)
ext4: prevent memory leaks from ext4_mb_init_backend() on error path
ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode #
ext4: use ext4_msg() instead of printk in mballoc
ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
ext4: use the correct error exit path in ext4_init_inode_table()
ext4: add missing kfree() on error return path in add_new_gdb()
ext4: change umode_t in tracepoint headers to be an explicit __u16
ext4: fix races in ext4_sync_parent()
ext4: Fix overflow caused by missing cast in ext4_fallocate()
ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole
ext4: simplify parameters of reserve_backup_gdb()
ext4: simplify parameters of add_new_gdb()
ext4: remove lock_buffer in bclean() and setup_new_group_blocks()
ext4: simplify journal handling in setup_new_group_blocks()
ext4: let setup_new_group_blocks() set multiple bits at a time
ext4: fix a typo in ext4_group_extend()
ext4: let ext4_group_add_blocks() handle 0 blocks quickly
ext4: let ext4_group_add_blocks() return an error code
ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()
...
Fix up conflict in fs/ext4/inode.c: commit aacfc19c62 ("fs: simplify
the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO()
function for the new simplified calling convention, while commit
dae1e52cb1 ("ext4: move ext4_ind_* functions from inode.c to
indirect.c") moved the function to another file.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set
VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock
VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()
VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()
switch posix_acl_chmod() to umode_t
switch posix_acl_from_mode() to umode_t
switch posix_acl_equiv_mode() to umode_t *
switch posix_acl_create() to umode_t *
block: initialise bd_super in bdget()
vfs: avoid call to inode_lru_list_del() if possible
vfs: avoid taking inode_hash_lock on pipes and sockets
vfs: conditionally call inode_wb_list_del()
VFS: Fix automount for negative autofs dentries
Btrfs: load the key from the dir item in readdir into a fake dentry
devtmpfs: missing initialialization in never-hit case
hppfs: missing include
* 'pstore-efi' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
efivars: Introduce PSTORE_EFI_ATTRIBUTES
efivars: Use string functions in pstore_write
efivars: introduce utf16_strncmp
efivars: String functions
efi: Add support for using efivars as a pstore backend
pstore: Allow the user to explicitly choose a backend
pstore: Make "part" unsigned
pstore: Add extra context for writes and erases
pstore: Extend API for more flexibility in new backends
In ext4_mb_init(), if the s_locality_group allocation fails it will
currently cause the allocations made in ext4_mb_init_backend() to
be leaked. Moving the ext4_mb_init_backend() allocation after the
s_locality_group allocation avoids that problem.
Signed-off-by: Yu Jian <yujian@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When doing a writepage we call writepages to try and write out any other dirty
pages in the area. This could cause problems where we commit a transaction and
then have somebody else dirtying metadata in the area as we could end up writing
out a lot more than we care about, which could cause latency on anybody who is
waiting for the transaction to completely finish committing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The variable 'last_index' is calculated in the __btrfs_buffered_write
function and passed as a parameter to the prepare_pages function,
but is not used anywhere in the prepare_pages function.
Remove instances of 'last_index' in these functions.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_first_extent_bit() and find_first_extent_bit_state() share
most of the code, and we can just make the former call the latter.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can just use cond_resched_lock().
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These members are not used at all.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
unpin_extent_cache() and add_extent_mapping() shares the same code
that merges extent maps.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
lookup_extent_map() and search_extent_map() can share most of code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
rb_node returned by __tree_search() can be a valid pointer or NULL,
but won't be some errno.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we search a dir item with a specific hash code, we can
just return NULL without further checking if btrfs_search_slot()
returns 1.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since commit f2a97a9dbd
("btrfs: remove all unused functions"), there's no extern functions
at all in ref-cache.c, so just remove the remaining dead code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use wait_event() when possible to avoid code duplication.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If mounting with nodatasum option, we won't csum file data for
general write or direct-io write, and this rule should also be
applied when writing compressed files.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Memory areas [ptr, ptr+total_len] and [name, name+total_len]
may overlap, so it's wrong to use memcpy().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In __btrfs_free_extent we will print the leaf if we fail to find the extent we
wanted, but the problem is if we get an error we won't have a leaf so often this
leads to a NULL pointer dereference and we lose the error that actually
occurred. So only print the leaf if ret > 0, which means we didn't find the
item we were looking for but we didn't error either. This way the error is
preserved.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is fairly trivial - btrfs_set_root_node() - always returns zero so we
can just make it void. All callers ignore the return code now anyway. I
also made sure to check that none of the functions that
btrfs_set_root_node() calls returns an error that we might have needed to
catch and pass back.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Here I have a two SSD-partitions btrfs, and they are defaultly set to
"data=raid0, metadata=raid1", then I try to fill my btrfs partition
till "No space left on device", via "dd if=/dev/zero of=/mnt/btrfs/tmp".
I get an oops panic from kernel BUG at fs/btrfs/extent-tree.c:5199!, which
refers to find_free_extent's
BUG_ON(index != get_block_group_index(block_group));
In SSD mode, in order to find enough space to alloc, we may check the
block_group cache which has been checked sometime before, but the index is not
updated, where it hits the BUG_ON.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Acked-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The access for ro in btrfs_block_group_cache should be protected
because of the racy lock in relocation.
Signed-off-by: Wu Bo <wu.bo@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The set/clear bit and the extent split/merge hooks only ever return 0.
Changing them to return void simplifies the error handling cases later.
This patch changes the hook prototypes, the single implementation of each,
and the functions that call them to return void instead.
Since all four of these hooks execute under a spinlock, they're necessarily
simple.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We passed the wrong value to btrfs_force_ra(). Fix this by changing
the argument of btrfs_force_ra() from last_index to nr_page.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs_unlink_inode() and btrfs_orphan_add() in btrfs_unlink()
are error, the error code is returned to the caller instead of
BUG_ON().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Don't need to check the return value of __btrfs_add_inode_defrag(),
since it will always return 0.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a race during unmount. We need to not only make sure that
the journal is completely written, but that the metadata changes make
it to disk before releasing ipimap and ipbmap.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
CIFS: Cleanup demupltiplex thread exiting code
CIFS: Move mid search to a separate function
CIFS: Move RFC1002 check to a separate function
CIFS: Simplify socket reading in demultiplex thread
CIFS: Move buffer allocation to a separate function
cifs: remove unneeded variable initialization in cifs_reconnect_tcon
cifs: simplify refcounting for oplock breaks
cifs: fix compiler warning in CIFSSMBQAllEAs
cifs: fix name parsing in CIFSSMBQAllEAs
cifs: don't start signing too early
cifs: trivial: goto out here is unnecessary
cifs: advertise the right receive buffer size to the server
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Move reading to separate function and remove csocket variable.
Also change semantic in a little: goto incomplete_rcv only when
we get -EAGAIN (or a familiar error) while reading rfc1002 header.
In this case we don't check for echo timeout when we don't get whole
header at once, as it was before.
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Introduce new helper functions which try kmalloc, and then fall back
to vmalloc if necessary, and use them for allocating and deallocating
s_flex_groups.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch lets ext4_init_inode_table() handle errors right.
ext4_init_inode_table() should down_write() alloc_sem which
has been up_write()ed and stop the started journal handle.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
commit 4e34e719e4, that takes the ACL checks to common code,
accidentely broke the build when CONFIG_FS_POSIX_ACL is not set:
CC fs/xfs/linux-2.6/xfs_iops.o
fs/xfs/linux-2.6/xfs_iops.c:1025:14: error: ‘xfs_get_acl’ undeclared here (not in a function)
Fix this by declaring xfs_get_acl a static inline function.
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reorganise shrink_dcache_for_umount_subtree() in light of the demise of
dcache_lock. Without that dcache_lock, there is no need for the batching of
removal of dentries from the system under it (we wanted to make intensive use
of the locked data whilst we held it, but didn't want to hold it for long at a
time).
This works, provided the preceding patch is correct in its removal of locking
on dentry->d_lock on the basis that no one should be locking these dentries any
more as the whole superblock is defunct.
With this patch, the calls to dentry_lru_del() and __d_shrink() are placed at
the point where each dentry is detached handled.
It is possible that, as an alternative, the batching should still be done -
but only for dentry_lru_del() of all a dentry's children in one go. In such a
case, the batching would be done under dcache_lru_lock.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Locks of the dcache_lock were replaced by locks of dentry->d_lock in commits
such as:
23044507832fd6b7f507
as part of the RCU-based pathwalk changes, despite the fact that the caller
(shrink_dcache_for_umount()) notes in the banner comment the reasons that
d_lock is not necessary in these functions:
/*
* destroy the dentries attached to a superblock on unmounting
* - we don't need to use dentry->d_lock because:
* - the superblock is detached from all mountings and open files, so the
* dentry trees will not be rearranged by the VFS
* - s_umount is write-locked, so the memory pressure shrinker will ignore
* any dentries belonging to this superblock that it comes across
* - the filesystem itself is no longer permitted to rearrange the dentries
* in this superblock
*/
So remove these locks. If the locks are actually necessary, then this banner
comment should be altered instead.
The hash table chains are protected by 1-bit locks in the hash table heads, so
those shouldn't be a problem.
Note that to make this work, __d_drop() has to be split so that the RCUwalk
barrier can be avoided. This causes problems otherwise as it has an assertion
that dentry->d_lock is locked - but there is no need for that as no one else
can be trying to access this dentry, except to step over it (and that should
be handled by d_free(), I think).
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the detached-dentry counter from shrink_dcache_for_umount_subtree() as
the value it computes is no longer used as of commit
312d3ca856 which made the nr_dentry counters
summed per-CPU rather than global atomic.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
bd_super is currently reset to NULL in kill_block_super() so we rely on previous
users of the block_device object to initialise this value for the next user.
This quirk was exposed on RHEL5 when a third party filesystem did not always use
kill_block_super() and therefore bd_super wasn't being reset when a block_device
object was recycled within the cache. This may not be a problem upstream but
makes sense to be defensive.
Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
inode_lru_list_del() is expensive because of per superblock lru locking,
while some inodes are not in lru list.
Adding a check in iput_final() can speedup pipe/sockets workloads on
SMP.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some inodes (pipes, sockets, ...) are not hashed, no need to take
contended inode_hash_lock at dismantle time.
nice speedup on SMP machines on socket intensive workloads.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some inodes (pipes, sockets, ...) are not in bdi writeback list.
evict() can avoid calling inode_wb_list_del() and its expensive spinlock
by checking inode i_wb_list being empty or not.
At this point, no other cpu/user can concurrently manipulate this inode
i_wb_list
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Autofs may set the DCACHE_NEED_AUTOMOUNT flag on negative dentries. These
need attention from the automounter daemon regardless of the LOOKUP_FOLLOW flag.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In btrfs we have 2 indexes for inodes. One is for readdir, it's in this nice
sequential order and works out brilliantly for readdir. However if you use ls,
it usually stat's each file it gets from readdir. This is where the second
index comes in, which is based on a hash of the name of the file. So then the
lookup has to lookup this index, and then lookup the inode. The index lookup is
going to be in random order (since its based on the name hash), which gives us
less than stellar performance. Since we know the inode location from the
readdir index, I create a dummy dentry and copy the location key into
dentry->d_fsdata. Then on lookup if we have d_fsdata we use that location to
lookup the inode, avoiding looking up the other directory index. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, we take a sb->s_active reference and a cifsFileInfo reference
when an oplock break workqueue job is queued. This is unnecessary and
more complicated than it needs to be. Also as Al points out,
deactivate_super has non-trivial locking implications so it's best to
avoid that if we can.
Instead, just cancel any pending oplock breaks for this filehandle
synchronously in cifsFileInfo_put after taking it off the lists.
That should ensure that this job doesn't outlive the structures it
depends on.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The recent fix to the above function causes this compiler warning to pop
on some gcc versions:
CC [M] fs/cifs/cifssmb.o
fs/cifs/cifssmb.c: In function ‘CIFSSMBQAllEAs’:
fs/cifs/cifssmb.c:5708: warning: ‘ea_name_len’ may be used uninitialized in
this function
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The code that matches EA names in CIFSSMBQAllEAs is incorrect. It
uses strncmp to do the comparison with the length limited to the
name_len sent in the response.
Problem: Suppose we're looking for an attribute named "foobar" and
have an attribute before it in the EA list named "foo". The
comparison will succeed since we're only looking at the first 3
characters. Fix this by also comparing the length of the provided
ea_name with the name_len in the response. If they're not equal then
it shouldn't match.
Reported-by: Jian Li <jiali@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Sniffing traffic on the wire shows that windows clients send a zeroed
out signature field in a NEGOTIATE request, and send "BSRSPYL" in the
signature field during SESSION_SETUP. Make the cifs client behave the
same way.
It doesn't seem to make much difference in any server that I've tested
against, but it's probably best to follow windows behavior as closely as
possible here.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, we mirror the same size back to the server that it sends us.
That makes little sense. Instead we should be sending the server the
maximum buffer size that we can handle -- CIFSMaxBufSize minus the
4 byte RFC1001 header.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
For invalid extents, find other pages in the same fsblock and write them out.
[pnfsblock: write_begin]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.
When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.
[pnfsblock: bl_write_pagelist support functions]
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
[SQUASHME: pnfsblock: mds_offset is set in the generic layer]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fixup blksize alignment in bl_setup_layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.
When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.
[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: read path error handling]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new read_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.
In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:
1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)
2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup
3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
Signed-off-by: Jim Rees <rees@umich.edu>
[pnfsblock: introduce bl_committing list]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: cleanup_layoutcommit wants a status parameter]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.
In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:
1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)
2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup
3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: prevent commit list corruption]
[pnfsblock: fix layoutcommit with an empty opaque]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Adds working implementations of various support functions
to handle INVAL extents, needed by writes, such as
bl_mark_sectors_init and bl_is_sector_init.
[pnfsblock: fix 64-bit compiler warnings for extent manipulation]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[Implement release_inval_marks]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implement bl_find_get_extent(), one of the core extent manipulation
routines.
[pnfsblock: Lookup list entry of layouts and tags in reverse order]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Jim Rees <rees@umich.edu>
pnfsblock: fix print format warnings for sector_t and size_t
gcc spews warnings about these on x86_64, e.g.:
fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’
fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
XDR decodes the block layout payload sent in LAYOUTGET result, storing
the result in an extent list.
[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO
for each device returned.
[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
[pnfsblock: fix pnfs_deviceid references]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix print format warnings for sector_t and size_t]
[pnfs-block: #include <linux/vmalloc.h>]
[pnfsblock: no PNFS_NFS_SERVER]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfsblock: fix bug determining size of striped volume]
[pnfsblock: fix oops when using multiple devices]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: get rid of vmap and deviceid->area structure]
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace a stub, so that extents underlying the layouts are properly
added, merged, or ignored as necessary.
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: delete the new node before put it]
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This gives layout driver a chance to cleanup structures they put in at
encode_layoutcommit.
Signed-off-by: Andy Adamson <andros@netapp.com>
[fixup layout header pointer for layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To allow layout driver to issue getdevicelist at mount time, and clean up
at umount time.
[fixup non NFS_V4_1 set_pnfs_layoutdriver definition]
[pnfs: pass mntfh down the init_pnfs path]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Using NFS4_MAX_UINT64 will break current protocol.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There can be multiple lseg per file, so layoutcommit should be
able to handle it.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
No need to save it for every lseg.
No need to save it at every pnfs_set_layoutcommit.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
No need to save it for every lseg.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This fixes a soft lockup on conditions
a) the flusher is working on a work by __bdi_start_writeback(), while
b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
grab sb->s_umount and enqueue a new work for the flusher to execute
The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb->list_lock, which further blocks other tasks.
Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().
bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html
Reported-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Print out the name of the file that triggers the cookie loop message to
make it slightly easier to track down the cause.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the directory contents change, then we have to accept that the
file->f_pos value may shrink if we do a 'search-by-cookie'. In that
case, we should turn off the loop detection and let the NFS client
try to recover.
The patch also fixes a second loop detection bug by ensuring
that after turning on the ctx->duped flag, we read at least one new
cookie into ctx->dir_cookie before attempting to match with
ctx->dup_cookie.
Reported-by: Petr Vandrovec <petr@vandrovec.name>
Cc: stable@kernel.org [2.6.39+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We added some more error handling in b40971426a "ext4: add error
checking to calls to ext4_handle_dirty_metadata()". But we need to
call kfree() as well to avoid a memory leak.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix problems if fsync() races against a rename of a parent directory
as pointed out by Al Viro in his own inimitable way:
>While we are at it, could somebody please explain what the hell is ext4
>doing in
>static int ext4_sync_parent(struct inode *inode)
>{
> struct writeback_control wbc;
> struct dentry *dentry = NULL;
> int ret = 0;
>
> while (inode && ext4_test_inode_state(inode, EXT4_STATE_NEWENTRY)) {
> ext4_clear_inode_state(inode, EXT4_STATE_NEWENTRY);
> dentry = list_entry(inode->i_dentry.next,
> struct dentry, d_alias);
> if (!dentry || !dentry->d_parent || !dentry->d_parent->d_inode)
> break;
> inode = dentry->d_parent->d_inode;
> ret = sync_mapping_buffers(inode->i_mapping);
> ...
>Note that dentry obviously can't be NULL there. dentry->d_parent is never
>NULL. And dentry->d_parent would better not be negative, for crying out
>loud! What's worse, there's no guarantees that dentry->d_parent will
>remain our parent over that sync_mapping_buffers() *and* that inode won't
>just be freed under us (after rename() and memory pressure leading to
>eviction of what used to be our dentry->d_parent)......
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
ecryptfs: Make inode bdi consistent with superblock bdi
eCryptfs: Unlock keys needed by ecryptfsd
When commit 4e34e719e4 ("fs: take the ACL checks to common code")
changed the xyz_check_acl() functions into the more natural
xyz_get_acl() interface, we grew two copies of the
#define ext2_get_acl NULL
define for the non-acl case.
Remove the extra one.
Reported-by: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 4e34e719e4, that takes the ACL checks to common code,
accidentely broke the build when CONFIG_FS_POSIX_ACL is not set:
CC fs/xfs/linux-2.6/xfs_iops.o
fs/xfs/linux-2.6/xfs_iops.c:1025:14: error: ‘xfs_get_acl’ undeclared here (not in a function)
Fix this by declaring xfs_get_acl a static inline function.
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Make the inode mapping bdi consistent with the superblock bdi so that
dirty pages are flushed properly.
Signed-off-by: Thieu Le <thieule@chromium.org>
Cc: <stable@kernel.org> [2.6.39+]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Fixes a regression caused by b5695d0463
Kernel keyring keys containing eCryptfs authentication tokens should not
be write locked when calling out to ecryptfsd to wrap and unwrap file
encryption keys. The eCryptfs kernel code can not hold the key's write
lock because ecryptfsd needs to request the key after receiving such a
request from the kernel.
Without this fix, all file opens and creates will timeout and fail when
using the eCryptfs PKI infrastructure. This is not an issue when using
passphrase-based mount keys, which is the most widely deployed eCryptfs
configuration.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Acked-by: Roberto Sassu <roberto.sassu@polito.it>
Tested-by: Roberto Sassu <roberto.sassu@polito.it>
Tested-by: Alexis Hafner1 <haf@zurich.ibm.com>
Cc: <stable@kernel.org> [2.6.39+]
When someone writes to an inode, readers accessing the same inode via
ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
do_generic_file_read() looks up the page again and retries ->readpage()
when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
readers, they can occupy all CPUs and in non-preempt kernel the system is
deadlocked because writer holding ip_alloc_sem is never run to release the
semaphore. Fix the problem by making reader block on ip_alloc_sem to break
the busy loop.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Fix a corruption that can happen when we have (two or more) outstanding
aio's to an overlapping unaligned region. Ext4
(e9e3bcecf4) and xfs recently had to fix
similar issues.
In our case what happens is that we can have an outstanding aio on a region
and if a write comes in with some bytes overlapping the original aio we may
decide to read that region into a page before continuing (typically because
of buffered-io fallback). Since we have no ordering guarantees with the
aio, we can read stale or bad data into the page and then write it back out.
If the i/o is page and block aligned, then we avoid this issue as there
won't be any need to read data from disk.
I took the same approach as Eric in the ext4 patch and introduced some
serialization of unaligned async direct i/o. I don't expect this to have an
effect on the most common cases of AIO. Unaligned aio will be slower
though, but that's far more acceptable than data corruption.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (54 commits)
tpm_nsc: Fix bug when loading multiple TPM drivers
tpm: Move tpm_tis_reenable_interrupts out of CONFIG_PNP block
tpm: Fix compilation warning when CONFIG_PNP is not defined
TOMOYO: Update kernel-doc.
tpm: Fix a typo
tpm_tis: Probing function for Intel iTPM bug
tpm_tis: Fix the probing for interrupts
tpm_tis: Delay ACPI S3 suspend while the TPM is busy
tpm_tis: Re-enable interrupts upon (S3) resume
tpm: Fix display of data in pubek sysfs entry
tpm_tis: Add timeouts sysfs entry
tpm: Adjust interface timeouts if they are too small
tpm: Use interface timeouts returned from the TPM
tpm_tis: Introduce durations sysfs entry
tpm: Adjust the durations if they are too small
tpm: Use durations returned from TPM
TOMOYO: Enable conditional ACL.
TOMOYO: Allow using argv[]/envp[] of execve() as conditions.
TOMOYO: Allow using executable's realpath and symlink's target as conditions.
TOMOYO: Allow using owner/group etc. of file objects as conditions.
...
Fix up trivial conflict in security/tomoyo/realpath.c
The logical block number in map.l_blk is a __u32, and so before we
shift it left, by the block size, we neeed cast it to a 64-bit size.
Otherwise i_size can be corrupted on an ENOSPC.
# df -T /mnt/mp1
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/sda6 ext4 9843276 153056 9190200 2% /mnt/mp1
# fallocate -o 0 -l 2199023251456 /mnt/mp1/testfile
fallocate: /mnt/mp1/testfile: fallocate failed: No space left on device
# stat /mnt/mp1/testfile
File: `/mnt/mp1/testfile'
Size: 4293656576 Blocks: 19380440 IO Block: 4096 regular file
Device: 806h/2054d Inode: 12 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2011-07-25 13:01:31.414490496 +0900
Modify: 2011-07-25 13:01:31.414490496 +0900
Change: 2011-07-25 13:01:31.454490495 +0900
Signed-off-by: Utako Kusaka <u-kusaka@wm.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
--
fs/ext4/extents.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
The old function ext4_ext_rm_idx is used only for truncate case
because it just remove last index in extent-index-block. When punching
hole, it usually needed to remove "middle" index, therefore we must
move indexes which after it forward.
(I create a file with 1 depth extent tree and punch hole in the middle
of it, the last index in index-block strangly gone, so I find out this
bug)
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The reserve_backup_gdb() function only needs the block group number;
there's no need to pass a pointer to struct ext4_new_group_data to it.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
add_new_gdb() only needs the block group number; there is no need to
pass a pointer to struct ext4_new_group_data to add_new_gdb().
Instead of filling in a pointer the struct buffer_head in
add_new_gdb(), it's simpler to have the caller fetch it from the
s_group_desc[] array.
[Fixed error path to handle the case where struct buffer_head *primary
hasn't been set yet. -- Ted]
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There is no need to lock the buffers since no one else should be
touching these buffers besides the file system.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
Btrfs: use the commit_root for reading free_space_inode crcs
Btrfs: reduce extent_state lock contention for metadata
Btrfs: remove lockdep magic from btrfs_next_leaf
Btrfs: make a lockdep class for each root
Btrfs: switch the btrfs tree locks to reader/writer
Btrfs: fix deadlock when throttling transactions
Btrfs: stop using highmem for extent_buffers
Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
Btrfs: tag pages for writeback in sync
Btrfs: fix enospc problems with delalloc
Btrfs: don't flush delalloc arbitrarily
Btrfs: use find_or_create_page instead of grab_cache_page
Btrfs: use a worker thread to do caching
Btrfs: fix how we merge extent states and deal with cached states
Btrfs: use the normal checksumming infrastructure for free space cache
Btrfs: serialize flushers in reserve_metadata_bytes
Btrfs: do transaction space reservation before joining the transaction
Btrfs: try to only do one btrfs_search_slot in do_setxattr
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: optimize the negative xattr caching
xfs: prevent against ioend livelocks in xfs_file_fsync
xfs: flag all buffers as metadata
xfs: encapsulate a block of debug code
* 'nfs-for-3.1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (44 commits)
NFSv4: Don't use the delegation->inode in nfs_mark_return_delegation()
nfs: don't use d_move in nfs_async_rename_done
RDMA: Increasing RPCRDMA_MAX_DATA_SEGS
SUNRPC: Replace xprt->resend and xprt->sending with a priority queue
SUNRPC: Allow caller of rpc_sleep_on() to select priority levels
SUNRPC: Support dynamic slot allocation for TCP connections
SUNRPC: Clean up the slot table allocation
SUNRPC: Initalise the struct xprt upon allocation
SUNRPC: Ensure that we grab the XPRT_LOCK before calling xprt_alloc_slot
pnfs: simplify pnfs files module autoloading
nfs: document nfsv4 sillyrename issues
NFS: Convert nfs4_set_ds_client to EXPORT_SYMBOL_GPL
SUNRPC: Convert the backchannel exports to EXPORT_SYMBOL_GPL
SUNRPC: sunrpc should not explicitly depend on NFS config options
NFS: Clean up - simplify the switch to read/write-through-MDS
NFS: Move the pnfs write code into pnfs.c
NFS: Move the pnfs read code into pnfs.c
NFS: Allow the nfs_pageio_descriptor to signal that a re-coalesce is needed
NFS: Use the nfs_pageio_descriptor->pg_bsize in the read/write request
NFS: Cache rpc_ops in struct nfs_pageio_descriptor
...
The btrfs transaction code will return any errors that come from
reserve_metadata_bytes. We need to make sure we don't return funny
things like 1 or EAGAIN.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since __proc_create() appends the name it is given to the end of the PDE
structure that it allocates, there isn't a need to store a name pointer.
Instead we can just replace the name pointer with a terminal char array of
_unspecified_ length. The compiler will simply append the string to statically
defined variables of PDE type overlapping any hole at the end of the structure
and, unlike specifying an explicitly _zero_ length array, won't give a warning
if you try to statically initialise it with a string of more than zero length.
Also, whilst we're at it:
(1) Move namelen to end just prior to name and reduce it to a single byte
(name shouldn't be longer than NAME_MAX).
(2) Move pde_unload_lock two places further on so that if it's four bytes in
size on a 64-bit machine, it won't cause an unused hole in the PDE struct.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we are using regular file crcs for the free space cache,
we can deadlock if we try to read the free_space_inode while we are
updating the crc tree.
This commit fixes things by using the commit_root to read the crcs. This is
safe because we the free space cache file would already be loaded if
that block group had been changed in the current transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
For metadata buffers that don't straddle pages (all of them), btrfs
can safely use the page uptodate bits and extent_buffer uptodate bit
instead of needing to use the extent_state tree.
This greatly reduces contention on the state tree lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before the reader/writer locks, btrfs_next_leaf needed to keep
the path blocking to avoid making lockdep upset.
Now that btrfs_next_leaf only takes read locks, this isn't required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch was originally from Tejun Heo. lockdep complains about the btrfs
locking because we sometimes take btree locks from two different trees at the
same time. The current classes are based only on level in the btree, which
isn't enough information for lockdep to figure out if the lock is safe.
This patch makes a class for each type of tree, and lumps all the FS trees that
actually have files and directories into the same class.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs metadata btree is the source of significant
lock contention, especially in the root node. This
commit changes our locking to use a reader/writer
lock.
The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking. Atomics
count the number of blocking readers or writers at any
given time.
It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.
In read heavy workloads this is dramatically faster. In write
heavy workloads we're still faster because of less contention
on the root node lock.
We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hit this nice little deadlock. What happens is this
__btrfs_end_transaction with throttle set, --use_count so it equals 0
btrfs_commit_transaction
<somebody else actually manages to start the commit>
btrfs_end_transaction --use_count so now its -1 <== BAD
we just return and wait on the transaction
This is bad because we just return after our use_count is -1 and don't let go
of our num_writer count on the transaction, so the guy committing the
transaction just sits there forever. Fix this by inc'ing our use_count if we're
going to call commit_transaction so that if we call btrfs_end_transaction it's
valid. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.
The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.
This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we balanced the chunks across the devices, BUG_ON() in
__finish_chunk_alloc() was triggered.
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:2568!
[SNIP]
Call Trace:
[<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
[<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
[<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
[<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
[<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
[<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
[<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
[<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
[<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
[<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
[<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
[<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
[<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
[<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
[<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
[<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
[<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
[<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
[<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
[<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
[<ffffffff81159c63>] ? generic_permission+0x23/0xb0
[<ffffffff8115b415>] vfs_create+0xa5/0xc0
[<ffffffff8115ce6e>] do_last+0x5fe/0x880
[<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
[<ffffffff8115e029>] do_filp_open+0x49/0xa0
[<ffffffff8116a965>] ? alloc_fd+0x95/0x160
[<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
[<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
[<ffffffff8114f1e0>] sys_open+0x20/0x30
[<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
The reason is:
Task1 Space balance task
do_chunk_alloc()
__finish_chunk_alloc()
update device info
in the chunk tree
alloc system metadata block
relocate system metadata block group
set system metadata block group
readonly, This block group is the
only one that can allocate space. So
there is no free space that can be
allocated now.
find no space and don't try
to alloc new chunk, and then
return ENOSPC
BUG_ON() in __finish_chunk_alloc()
was triggered.
Fix this bug by allocating a new system metadata chunk before relocating the
old one if we find there is no free space which can be allocated after setting
the old block group to be read-only.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everybody else does this, we need to do it too. If we're syncing, we need to
tag the pages we're going to write for writeback so we don't end up writing the
same stuff over and over again if somebody is constantly redirtying our file.
This will keep us from having latencies with heavy sync workloads. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
So I had this brilliant idea to use atomic counters for outstanding and reserved
extents, but this turned out to be a bad idea. Consider this where we have 1
outstanding extent and 1 reserved extent
Reserver Releaser
atomic_dec(outstanding) now 0
atomic_read(outstanding)+1 get 1
atomic_read(reserved) get 1
don't actually reserve anything because
they are the same
atomic_cmpxchg(reserved, 1, 0)
atomic_inc(outstanding)
atomic_add(0, reserved)
free reserved space for 1 extent
Then the reserver now has no actual space reserved for it, and when it goes to
finish the ordered IO it won't have enough space to do it's allocation and you
get those lovely warnings.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Kill the check to see if we have 512mb of reserved space in delalloc and
shrink_delalloc if we do. This causes unexpected latencies and we have other
logic to see if we need to throttle. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we
need GFP_NOFS so we don't deadlock. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported a deadlock when copying a bunch of files. This is because they
were low on memory and kthreadd got hung up trying to migrate pages for an
allocation when starting the caching kthread. The page was locked by the person
starting the caching kthread. To fix this we just need to use the async thread
stuff so that the threads are already created and we don't have to worry about
deadlocks. Thanks,
Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <josef@redhat.com>
Since the addition of file capabilities every write needs to read xattrs to
check if we have any capabilities to clear. In Linux 3.0 Andi Kleen added
a flag to cache the fact that we do not have any attributes on an inode.
Make sure to already mark a file as not having any attributes when reading
it from disk in case it doesn't even have an attribute fork. Based on an
earlier patch from Andi Kleen.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We need to take some locks to prevent new ioends from coming in when we wait
for all existing ones to go away. Up to Linux 3.0 that was done using the
i_mutex held by the VFS fsync code, but now that we are called without
it we need to take care of it ourselves. Use the I/O lock instead of
i_mutex just like we do in other places.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that REQ_META bios aren't treated specially in the CFQ I/O schedule
anymore, we can tag all buffers as metadata to make blktrace traces more
meaningful. Note that we use buffers also to zero out partial blocks
in the preallocation / hole punching code, and while they operate on
data blocks the zeros written certainly aren't data. I think this case
is borderline metadata enough to not bother special casing it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Pull into a helper function some debug-only code that validates a
xfs_da_blkinfo structure that's been read from disk.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
This patch simplifies journal handling in setup_new_group_blocks().
In previous code, block bitmap is modified everywhere in
setup_new_group_blocks(), ext4_get_write_access() in
extend_or_restart_transaction() is used to guarantee that the block
bitmap stays in the new handle, this makes things complicated.
The previous commit changed things so that the modifications on the
block bitmap are batched and done by ext4_set_bits() at the end of the
for loop. This allows us to simplify things.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Rename mb_set_bits() to ext4_set_bits() and make it a global function
so that setup_new_group_blocks() can use it.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If ext4_group_add_blocks() is called with 0 block, make it return 0
without doing any extra work.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch lets ext4_group_add_blocks() return an error code if it
fails, so that upper functions can handle error correctly.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A filesystem with errors is not allowed to being resized, otherwise,
it is easy to destroy the filesystem.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Before this patch, parallel resizers are allowed and protected by a
mutex lock, actually, there is no need to support parallel resizer, so
this patch prevents parallel resizers by atmoic bit ops, like
lock_page() and unlock_page() do.
To do this, the patch removed the mutex lock s_resize_lock from struct
ext4_sb_info and added a unsigned long field named s_resize_flags
which inidicates if there is a resizer.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
merge fchmod() and fchmodat() guts, kill ancient broken kludge
xfs: fix misspelled S_IS...()
xfs: get rid of open-coded S_ISREG(), etc.
vfs: document locking requirements for d_move, __d_move and d_materialise_unique
omfs: fix (mode & S_IFDIR) abuse
btrfs: S_ISREG(mode) is not mode & S_IFREG...
ima: fmode_t misspelled as mode_t...
pci-label.c: size_t misspelled as mode_t
jffs2: S_ISLNK(mode & S_IFMT) is pointless
snd_msnd ->mode is fmode_t, not mode_t
v9fs_iop_get_acl: get rid of unused variable
vfs: dont chain pipe/anon/socket on superblock s_inodes list
Documentation: Exporting: update description of d_splice_alias
fs: add missing unlock in default_llseek()
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
acct_arg_size() takes ->page_table_lock around add_mm_counter() if
!SPLIT_RSS_COUNTING. This is not needed after commit 172703b08c ("mm:
delete non-atomic mm counter implementation").
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats
handler because the list will not be modified by request_module().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Richard Weinberger <richard@nod.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, search_binary_handler() tries to load binary loader module
using request_module() if a loader for the requested program is not yet
loaded. But second attempt of request_module() does not affect the result
of search_binary_handler().
If request_module() triggered recursion, calling request_module() twice
causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions. It is
not an infinite loop but is sufficient for users to consider as a hang up.
Therefore, this patch changes not to call request_module() twice, making 1
to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Richard Weinberger <richard@nod.at>
Tested-by: Richard Weinberger <richard@nod.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a8bef8ff6e ("mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS
and VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile
time one, so BUILD_BUG_ON is more appropriate.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an inode's mode permits opening /proc/PID/io and the resulting file
descriptor is kept across execve() of a setuid or similar binary, the
ptrace_may_access() check tries to prevent using this fd against the
task with escalated privileges.
Unfortunately, there is a race in the check against execve(). If
execve() is processed after the ptrace check, but before the actual io
information gathering, io statistics will be gathered from the
privileged process. At least in theory this might lead to gathering
sensible information (like ssh/ftp password length) that wouldn't be
available otherwise.
Holding task->signal->cred_guard_mutex while gathering the io
information should protect against the race.
The order of locking is similar to the one inside of ptrace_attach():
first goes cred_guard_mutex, then lock_task_sighand().
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the return value to ENOENT. This return value is then returned
when opening the proc entry that have been removed. For example,
open("/proc/bus/pci/XX/YY") when the corresponding device is being
hot-removed.
Signed-off-by: Daisuke Ogino <ogino.daisuke@jp.fujitsu.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_coredump() assumes that if format_corename() fails it should return
-ENOMEM. This is not true, for example cn_print_exe_file() can propagate
the error from d_path. Even if it was true, this is too fragile. Change
the code to check "ispipe < 0".
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change every occurence of / in comm and hostname to !. If the process
changes its name to contain /, the core is not dumped (if the directory
tree doesn't exist like that). The same with hostname being something
like myhost/3. Fix this behaviour by using the escape loop used in %E.
(We extract it to a separate function.)
Now both with comm == myprocess/1 and hostname == myhost/1, the core is
dumped like (kernel.core_pattern='core.%p.%e.%h):
core.2349.myprocess!1.myhost!1
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we don't know the file corresponding to the binary (i.e. exe_file is
unknown), use "task->comm (path unknown)" instead of simple "(unknown)"
as suggested by ak.
The fallback is the same as %e except it will append "(path unknown)".
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
ceph: document unlocked d_parent accesses
ceph: explicitly reference rename old_dentry parent dir in request
ceph: document locking for ceph_set_dentry_offset
ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
ceph: protect d_parent access in ceph_d_revalidate
ceph: protect access to d_parent
ceph: handle racing calls to ceph_init_dentry
ceph: set dir complete frag after adding capability
rbd: set blk_queue request sizes to object size
ceph: set up readahead size when rsize is not passed
rbd: cancel watch request when releasing the device
ceph: ignore lease mask
ceph: fix ceph_lookup_open intent usage
ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
ceph: fix bad parent_inode calc in ceph_lookup_open
ceph: avoid carrying Fw cap during write into page cache
libceph: don't time out osd requests that haven't been received
ceph: report f_bfree based on kb_avail rather than diffing.
ceph: only queue capsnap if caps are dirty
ceph: fix snap writeback when racing with writes
...
The kludge in question is undocumented and doesn't work for 32bit
binaries on amd64, sparc64 and s390. Passing (mode_t)-1 as
mode had (since 0.99.14v and contrary to behaviour of any
other Unix, prescriptions of POSIX, SuS and our own manpages)
was kinda-sorta no-op. Note that any software relying on
that (and looking for examples shows none) would be visibly
broken on sparc64, where practically all userland is built
32bit. No such complaints noticed...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t
ext3.txt: update the links in the section "useful links" to the latest ones
ext3: Fix data corruption in inodes with journalled data
ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get
ext3: Fix compilation with -DDX_DEBUG
quota: Remove unused declaration
jbd: Use WRITE_SYNC in journal checkpoint.
jbd: Fix oops in journal_remove_journal_head()
ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs()
ext3/ioctl.c: silence sparse warnings about different address spaces
ext3/ext4 Documentation: remove bh/nobh since it has been deprecated
ext3: Improve truncate error handling
ext3: use proper little-endian bitops
ext2: include fs.h into ext2_fs.h
ext3: Fix oops in ext3_try_to_allocate_with_rsv()
jbd: fix a bug of leaking jh->b_jcount
jbd: remove dependency on __GFP_NOFAIL
ext3: Convert ext3 to new truncate calling convention
jbd: Add fixed tracepoints
ext3: Add fixed tracepoints
Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and
new fixed tracepoints.
For the most part we don't care about racing with rename when directing
MDS requests; either the old or new parent is fine. Document that, and
do some minor cleanup.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We carry a pin on the parent directory for the rename source and dest
dentries. For the source it's r_locked_dir; we need to explicitly
reference the old_dentry parent as well, since the dentry's d_parent may
change between when the request was created and pinned and when it is
freed.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Have caller pass in a safely-obtained reference to the parent directory
for calculating a dentry's hash valud.
While we're here, simpify the flow through ceph_encode_fh() so that there
is a single exit point and cleanup.
Also fix a bug with the dentry hash calculation: calculate the hash for the
dentry we were given, not its parent.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Protect d_parent with d_lock. Carry a reference. Simplify the flow so
that there is a single exit point and cleanup.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode. Also take a reference and drop it in the caller to avoid
a use-after-free.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The ->lookup() and prepopulate_readdir() callers are working with unhashed
dentries, so we don't have to worry. The export.c callers, though, need
to initialize something they got back from d_obtain_alias() and are
potentially racing with other callers. Make sure we don't return unless
the dentry is properly initialized (by us or someone else).
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Curretly ceph_add_cap clears the complete bit if we are newly issued the
FILE_SHARED cap, which is normally the case for a newly issue cap on a new
directory. That means we clear the just-set bit. Move the check that sets
the flag to after the cap is added/updated.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
This should improve the default read performance, as without it
readahead is practically disabled.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
The lease mask is no longer used (and it changed a while back). Instead,
use a non-zero duration to indicate that there is a lease being issued.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We weren't properly calling lookup_instantiate_filp when setting up the
lookup intent, which could lead to file leakage on errors. So:
- use separate helper for the hidden snapdir translation, immediately
following the mds request
- use ceph_finish_lookup for the final dentry/return value dance in the
exit path
- lookup_instantiate_filp on success
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We only need to put these on the directory unsafe list if they have
side effects that fsync(2) should flush out.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We were always getting NULL here because the intent file f_dentry is always
NULL at this point, which means we were always passing NULL to
ceph_mdsc_do_request. In reality, this was fine, since this isn't
currently ever a write operation that needs to get strung on the dir's
unsafe list.
Use the dir explicitly, and only pass it if this open has side-effects that
a dir fsync should flush.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The generic_file_aio_write call may block on balance_dirty_pages while we
flush data to the OSDs. If we hold a reference to the FILE_WR cap during
that interval revocation by the MDS (e.g., to do a stat(2)) may be very
slow.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We used to go into this branch if i_wrbuffer_ref_head was non-zero. This
was an ancient check from before we were careful about dealing with all
kinds of caps (and not just dirty pages). It is cleaner to only queue a
capsnap if there is an actual dirty cap. If we are racing with...
something...we will end up here with ci->i_wrbuffer_refs but no dirty
caps.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
There are two problems that come up when we try to queue a capsnap while a
write is in progress:
- The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap
with dirty == 0. That will crash later in __ceph_flush_snaps(). Or
on the FILE_WR cap if a write is in progress.
- We may not have i_head_snapc set, which causes problems pretty quickly.
Look to the snaprealm in this case.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
This allows us to force IO through the sync path which you normally only
get when multiple clients are reading/writing to the same file or by
mounting with -o sync. Among other things, this lets test programs verify
correctness with a single mount.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: Cleanup: check return codes of crypto api calls
CIFS: Fix oops while mounting with prefixpath
[CIFS] Redundant null check after dereference
cifs: use cifs_dirent in cifs_save_resume_key
cifs: use cifs_dirent to replace cifs_get_name_from_search_buf
cifs: introduce cifs_dirent
cifs: cleanup cifs_filldir
Adding a comment to d_materialise_unique per Al's request...
d_move and __d_move have some pretty substantial locking requirements,
but they are not clearly documented. Add some comments spelling them
out. Also, document the requirement for the i_mutex of the parent in
d_materialise_unique.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
granted, on a filesystem that has only regular files and directories
it happens to work, but really should be S_ISDIR(mode)...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Workloads using pipes and sockets hit inode_sb_list_lock contention.
superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A recent change in linux-next, 982d816581 "fs: add SEEK_HOLE and
SEEK_DATA flags" added some direct returns on error, but it should
have been a goto out.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When journalling data for an inode (either because it is a symlink or
because the filesystem is mounted in data=journal mode), ext4_evict_inode()
can discard unwritten data by calling truncate_inode_pages(). This is
because we don't mark the buffer / page dirty when journalling data but only
add the buffer to the running transaction and thus mm does not know there
are still unwritten data.
Fix the problem by carefully tracking transaction containing inode's data,
committing this transaction, and writing uncheckpointed buffers when inode
should be reaped.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Depending upon the order of userspace/kernel during the
mount process, this can result in a hang without the
_all version of the completion.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Commit 4e34e719e4 ("fs: take the ACL checks to common code") removed
the use of the 'acl' variable in v9fs_iop_get_acl(), but left the
variable definition around. Remove it to get rid of the warning:
fs/9p/acl.c: In function ‘v9fs_iop_get_acl’:
fs/9p/acl.c:101:20: warning: unused variable ‘acl’
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: Make ZLIB compression support optional
Squashfs: Update documentation for XZ and add squashfs-tools devel tree
* 'for-3.1' of git://linux-nfs.org/~bfields/linux:
nfsd: don't break lease on CLAIM_DELEGATE_CUR
locks: rename lock-manager ops
nfsd4: update nfsv4.1 implementation notes
nfsd: turn on reply cache for NFSv4
nfsd4: call nfsd4_release_compoundargs from pc_release
nfsd41: Deny new lock before RECLAIM_COMPLETE done
fs: locks: remove init_once
nfsd41: check the size of request
nfsd41: error out when client sets maxreq_sz or maxresp_sz too small
nfsd4: fix file leak on open_downgrade
nfsd4: remember to put RW access on stateid destruction
NFSD: Added TEST_STATEID operation
NFSD: added FREE_STATEID operation
svcrpc: fix list-corrupting race on nfsd shutdown
rpc: allow autoloading of gss mechanisms
svcauth_unix.c: quiet sparse noise
svcsock.c: include sunrpc.h to quiet sparse noise
nfsd: Remove deprecated nfsctl system call and related code.
NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
Commit e77819e57f ("vfs: move ACL cache lookup into generic code")
didn't take the FS_POSIX_ACL config variable into account - when that is
not set, ACL's go away, and the cache helper functions do not exist,
causing compile errors like
fs/namei.c: In function 'check_acl':
fs/namei.c:191:10: error: implicit declaration of function 'negative_cached_acl'
fs/namei.c:196:2: error: implicit declaration of function 'get_cached_acl'
fs/namei.c:196:6: warning: assignment makes pointer from integer without a cast
fs/namei.c:212:11: error: implicit declaration of function 'set_cached_acl'
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge akpm patch series: (122 commits)
drivers/connector/cn_proc.c: remove unused local
Documentation/SubmitChecklist: add RCU debug config options
reiserfs: use hweight_long()
reiserfs: use proper little-endian bitops
pnpacpi: register disabled resources
drivers/rtc/rtc-tegra.c: properly initialize spinlock
drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time()
drivers/rtc: add support for Qualcomm PMIC8xxx RTC
drivers/rtc/rtc-s3c.c: support clock gating
drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200
init: skip calibration delay if previously done
misc/eeprom: add eeprom access driver for digsy_mtc board
misc/eeprom: add driver for microwire 93xx46 EEPROMs
checkpatch.pl: update $logFunctions
checkpatch: make utf-8 test --strict
checkpatch.pl: add ability to ignore various messages
checkpatch: add a "prefer __aligned" check
checkpatch: validate signature styles and To: and Cc: lines
checkpatch: add __rcu as a sparse modifier
checkpatch: suggest using min_t or max_t
...
Did this as a merge because of (trivial) conflicts in
- Documentation/feature-removal-schedule.txt
- arch/xtensa/include/asm/uaccess.h
that were just easier to fix up in the merge than in the patch series.
Use hweight_long() to count free bits in the bitmap.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using __test_and_{set,clear}_bit_le() with ignoring its return value can
be replaced with __{set,clear}_bit_le().
This introduces reiserfs_{set,clear}_le_bit for __{set,clear}_bit_le and
does the above change with them.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Copy __generic_file_splice_read() and generic_file_splice_read() from
fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make
page_cache_pipe_buf_ops and spd_release_page() accessible to it.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
according to Documentation/feature-removal-schedule.txt.
This patch makes the warning more verbose by making it appear as a more
serious problem (the presence of a stack trace and being multiline should
attract more attention) so that applications still using the old interface
can get fixed.
Very popular users of the old interface have been converted since the oom
killer rewrite has been introduced. udevd switched to the
/proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
opensshd switched in 5.7p1.
At the start of 2012, this should be changed into a WARN() to emit all
such incidents and then finally remove the tunable in August 2012 as
scheduled.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This:
vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT)
is incorrect on 32-bit. It causes us to & the pgoff with something that
looks like this (for a 4m hugepage): 0xfff003ff. The mask should be
flipped and *then* shifted, to give you 0x0000_03fff.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "fsuid is the inode owner" case is not necessarily always the likely
case, but it's the case that doesn't do anything odd and that we want in
straight-line code. Make gcc not generate random "jump around for the
fun of it" code.
This just helps me read profiles. That thing is one of the hottest
parts of the whole pathname lookup.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check return codes of crypto api calls and either log an error or log
an error and return from the calling function with error.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
commit fec11dd9a0 caused
a regression when we have already mounted //server/share/a
and want to mount //server/share/a/b.
The problem is that lookup_one_len calls __lookup_hash
with nd pointer as NULL. Then __lookup_hash calls
do_revalidate in the case when dentry exists and we end
up with NULL pointer deference in cifs_d_revalidate:
if (nd->flags & LOOKUP_RCU)
return -ECHILD;
Fix this by checking nd for NULL.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
ocfs2 implements its own llseek() to provide the SEEK_HOLE/SEEK_DATA
functionality.
SEEK_HOLE sets the file pointer to the start of either a hole or an unwritten
(preallocated) extent, that is greater than or equal to the supplied offset.
SEEK_DATA sets the file pointer to the start of an allocated extent (not
unwritten) that is greater than or equal to the supplied offset.
If the supplied offset is on a desired region, then the file pointer is set
to it. Offsets greater than or equal to the file size return -ENXIO.
Unwritten (preallocated) extents are considered holes because the file system
treats reads to such regions in the same way as it does to holes.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
This allows us to parse the on the wire structures only once in
cifs_filldir.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Introduce a generic directory entry structure, and factor the parsing
of the various on the wire structures that can represent one into
a common helper. Switch cifs_entry_is_dot over to use it as a start.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In addition to properly handling allocation failure from btrfs_alloc_path, I
also fixed up the kzalloc error handling code immediately below it.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I also removed the BUG_ON from error return of find_next_chunk in
init_first_rw_device(). It turns out that the only caller of
init_first_rw_device() also BUGS on any nonzero return so no actual behavior
change has occurred here.
do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
which can now return -ENOMEM. Instead of setting space_info->full on any
error from btrfs_alloc_chunk() I catch and return every error value _except_
-ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Use sensible variable names and formatting and remove some superflous
checks on entry.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...
Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)
Remove the definition and usages of the macro XFS_BUFTARG_NAME.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the definition and usages of the macro XFS_BUF_TARGET
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the macro XFS_BUF_SET_TARGET.
hch: As all the buffer allocator already set ->b_target it should be safe
to simply remove these calls.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Replace the macro XFS_BUF_ISPINNED with an inline helper function
xfs_buf_ispinned, and change all its usages.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the definition and usages of the macro XFS_BUF_SET_PTR.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the definition and usages of the macro XFS_BUF_PTR.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the definition and usage of the macro XFS_BUF_SET_START.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the definition and usage of the macro XFS_BUF_HOLD
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the definitions and uses of the macros XFS_BUF_BUSY,
XFS_BUF_UNBUSY, and XFS_BUF_ISBUSY.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the definitions and usage of the macros XFS_BUF_ERROR,
XFS_BUF_GETERROR and XFS_BUF_ISERROR.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the definition of the macro XFS_BUF_BFLAGS and its usage.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: take the ACL checks to common code
bury posix_acl_..._masq() variants
kill boilerplates around posix_acl_create_masq()
generic_acl: no need to clone acl just to push it to set_cached_acl()
kill boilerplate around posix_acl_chmod_masq()
reiserfs: cache negative ACLs for v1 stat format
xfs: cache negative ACLs if there is no attribute fork
9p: do no return 0 from ->check_acl without actually checking
vfs: move ACL cache lookup into generic code
CIFS: Fix oops while mounting with prefixpath
xfs: Fix wrong return value of xfs_file_aio_write
fix devtmpfs race
caam: don't pass bogus S_IFCHR to debugfs_create_...()
get rid of create_proc_entry() abuses - proc_mkdir() is there for purpose
asus-wmi: ->is_visible() can't return negative
fix jffs2 ACLs on big-endian with 16bit mode_t
9p: close ACL leaks
ocfs2_init_acl(): fix a leak
VFS : mount lock scalability for internal mounts
nfs_mark_return_delegation() is usually called without any locking, and
so it is not safe to dereference delegation->inode. Since the inode is
only used to discover the nfs_client anyway, it makes more sense to
have the callers pass a valid pointer to the nfs_server as a parameter.
Reported-by: Ian Kent <raven@themaw.net>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the task that initiated the sillyrename ends up being killed by a
fatal signal, then it will eventually return back to userspace and end
up releasing the i_mutex. d_move however needs to be done while holding
the i_mutex.
Instead of using d_move here, just unhash the old and new dentries to
prevent them from being found by lookups. With this change though, the
dentries are now incorrect post-rename and do not reflect the actual
name of the file on the server. I'm proceeding under the assumption
that since they are unhashed that this isn't really a problem.
In order for the sillydelete to still work though, the dname must be
copied earlier when setting up the sillydelete info, and the name must
be recopied if the sillydelete info has to be moved to a new dentry.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
made static; no callers left outside of posix_acl.c. posix_acl_clone() also
has lost all external callers and became static...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with
modified clone, on failure releases acl and replaces with NULL.
Returns 0 or -ve on error. All callers of posix_acl_create_masq()
switched.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: posix_acl_chmod(&acl, gfp, mode). Replaces acl with modified
clone or with NULL if that has failed; returns 0 or -ve on error. All
callers of posix_acl_chmod_masq() switched to that - they'd been doing
exactly the same thing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Always set up a negative ACL cache entry if the inode can't have ACLs.
That behaves much better than doing this check inside ->check_acl.
Also remove the left over MAY_NOT_BLOCK check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Always set up a negative ACL cache entry if the inode doesn't have an
attribute fork. That behaves much better than doing this check inside
->check_acl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we do not want to use ACLs we at least need to perform normal Unix
permission checks. From the comment I'm not quite sure that's what
is intended, but if 0p wants to do permission checks entirely on the
server it needs to do so in ->permission, not in ->check_acl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This moves logic for checking the cached ACL values from low-level
filesystems into generic code. The end result is a streamlined ACL
check that doesn't need to load the inode->i_op->check_acl pointer at
all for the common cached case.
The filesystems also don't need to check for a non-blocking RCU walk
case in their acl_check() functions, because that is all handled at a
VFS layer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
commit fec11dd9a0 caused
a regression when we have already mounted //server/share/a
and want to mount //server/share/a/b.
The problem is that lookup_one_len calls __lookup_hash
with nd pointer as NULL. Then __lookup_hash calls
do_revalidate in the case when dentry exists and we end
up with NULL pointer deference in cifs_d_revalidate:
if (nd->flags & LOOKUP_RCU)
return -ECHILD;
Fix this by checking nd for NULL.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The fsync prototype change commit 02c24a8218 accidentally overwrote
the ssize_t return value of xfs_file_aio_write with 0 for SYNC type
writes. Fix this by checking if an error occured when calling
xfs_file_fsync and only change the return value in this case.
In addition xfs_file_fsync actually returns a normal negative error, so
fix this, too.
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
block: strict rq_affinity
backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
block: fix patch import error in max_discard_sectors check
block: reorder request_queue to remove 64 bit alignment padding
CFQ: add think time check for group
CFQ: add think time check for service tree
CFQ: move think time check variables to a separate struct
fixlet: Remove fs_excl from struct task.
cfq: Remove special treatment for metadata rqs.
block: document blk_plug list access
block: avoid building too big plug list
compat_ioctl: fix make headers_check regression
block: eliminate potential for infinite loop in blkdev_issue_discard
compat_ioctl: fix warning caused by qemu
block: flush MEDIA_CHANGE from drivers on close(2)
blk-throttle: Make total_nr_queued unsigned
block: Add __attribute__((format(printf...) and fix fallout
fs/partitions/check.c: make local symbols static
block:remove some spare spaces in genhd.c
block:fix the comment error in blkdev.h
...
This patch address two shortcomings in ocfs2_page_mkwrite():
1. Makes the function return better VM_FAULT_* errors.
2. It handles a error that is triggered when a page is dropped from the mapping
due to memory pressure. This patch locks the page to prevent that.
[Patch was cleaned up by Sunil Mushran.]
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
The cluster up check only checks to see if the node is heartbeating or not.
If yes it continues assuming that the node is connected to all the nodes. But
if that is not the case, the cluster join aborts with a stack of errors that
are not easy to comprehend.
This patch adds the network connect check upfront and prints the nodes that
the node is not yet connected to, before aborting.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Patch adds function o2net_fill_node_map() to return the bitmap of nodes that
it is connected to. This bitmap is also accessible by the user via the debugfs
file, /sys/kernel/debug/o2net/connected_nodes.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
The o2hb debugfs file, elapsed_time_in_ms, should return values only after the
timer is armed atleast once.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
In dlmlock_remote(), we wait for the resource to stop being active before
setting the inprogress flag. Active includes recovery, migration, etc.
The problem here is that if the resource was being recovered or migrated, the
new owner could very well be that node itself (and thus not a remote node).
This problem was observed in Oracle bug#12583620. The error messages observed
were as follows:
dlm_send_remote_lock_request:337 ERROR: Error -40 (ELOOP) when sending message 503 (key 0xd6d8c7) to node 2
dlmlock_remote:271 ERROR: dlm status = DLM_BADARGS
dlmlock:751 ERROR: dlm status = DLM_BADARGS
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
The inflight reference count, in the lock resource, is taken to pin the resource
in memory. We take it when a new resource is created and release it after a
lock is attached to it. We do this to prevent the resource from getting purged
prematurely.
Earlier this reference count was being taken for locally mastered resources
only. This patch extends the same functionality for remotely mastered ones.
We are doing this because the same premature purging could occur for remotely
mastered resources if the remote node were to die before completion of the
create lock.
Fix for Oracle bug#12405575.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Currently if the heartbeat device is hard-ro, the o2hb thread keeps chugging
along and dumping errors along the way. The user needs to manually stop the
heartbeat.
The patch addresses this shortcoming by adding a limit to the number of times
the hb thread will iterate in an unsteady state. If the hb thread does not
ready steady state in that many interation, the start is aborted.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
casting int * to mode_t * is not a good thing - on a *lot* of big-endian
architectures mode_t happens to be smaller than int and there it breaks
quite spectaculary...
Fucked-up-by: commit cfc8dc6f6f
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For a number of file systems that don't have a mount point (e.g. sockfs
and pipefs), they are not marked as long term. Therefore in
mntput_no_expire, all locks in vfs_mount lock are taken instead of just
local cpu's lock to aggregate reference counts when we release
reference to file objects. In fact, only local lock need to have been
taken to update ref counts as these file systems are in no danger of
going away until we are ready to unregister them.
The attached patch marks file systems using kern_mount without
mount point as long term. The contentions of vfs_mount lock
is now eliminated. Before un-registering such file system,
kern_unmount should be called to remove the long term flag and
make the mount point ready to be freed.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix a system hang bug introduced by commit b7a2441f99 ("writeback:
remove writeback_control.more_io") and e8dfc3058 ("writeback: elevate
queue_io() into wb_writeback()") easily reproducible with high memory
pressure and lots of file creation/deletions, for example, a kernel
build in limited memory.
It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE
state, the flusher will get stuck busy retrying that inode, never
releasing wb->list_lock. The lock in turn blocks all kinds of other
tasks when they are trying to grab it.
As put by Jan, it's a safe change regarding data integrity. I_FREEING or
I_WILL_FREE inodes are written back by iput_final() and it is reclaim
code that is responsible for eventually removing them. So writeback code
can safely ignore them. I_NEW inodes should move out of this state when
they are fully set up and in the writeback round following that, we will
consider them for writeback. So the change makes sense.
CC: Jan Kara <jack@suse.cz>
Reported-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
The debug message in ext4_ext_insert_extent before moving extent
is incorrect (the "from xx to xx").
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The argument "inode" in function ext4_ext_next_allocated_block looks useless,
so clean it.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ac_repeats isn't referenced in the mballoc code. So remove it.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_mb_release, we use s_mb_buddies_generated++. Although
the output is OK, but I don't think we need this extra ++.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_mb_load_buddy() calls ext4_get_group_info() for setting both
"grp" and "e4b->bd_info", but it could do "e4b->bd_info = grp".
Reported-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CLAIM_DELEGATE_CUR is used in response to a broken lease; allowing it
to break the lease and return EAGAIN leaves the client unable to make
progress in returning the delegation
nfs4_get_vfs_file() now takes struct nfsd4_open for access to the
claim type, and calls nfsd_open() with NFSD_MAY_NOT_BREAK_LEASE when
claim type is CLAIM_DELEGATE_CUR
Signed-off-by: Casey Bodley <cbodley@citi.umich.edu>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
unlinkat - Remove a directory entry
size[4] Tunlinkat tag[2] dirfid[4] name[s] flag[4]
size[4] Runlinkat tag[2]
older Tremove have the below request format
size[4] Tremove tag[2] fid[4]
The remove message is used to remove a directory entry either file or directory
The remove opreation is actually a directory opertation and should ideally have
dirfid, if not we cannot represent the fid on server with anything other than
name. We will have to derive the directory name from fid in the Tremove request.
NOTE: The operation doesn't clunk the unlink fid.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
renameat - change name of file or directory
size[4] Trenameat tag[2] olddirfid[4] oldname[s] newdirfid[4] newname[s]
size[4] Rrenameat tag[2]
older Trename have the below request format
size[4] Trename tag[2] fid[4] newdirfid[4] name[s]
The rename message is used to change the name of a file, possibly moving it
to a new directory. The rename opreation is actually a directory opertation
and should ideally have olddirfid, if not we cannot represent the fid on server
with anything other than name. We will have to derive the old directory name
from fid in the Trename request.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This make sure we don't end up reusing the unlinked inode object.
The ideal way is to use inode i_generation. But i_generation is
not available in userspace always.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Without this fix, if any invalid mount options/args are passed while mouting
the 9p fs, no error (-EINVAL) is returned and default arg value is assigned.
This fix returns -EINVAL when an invalid arguement is found while parsing
mount options.
Signed-off-by: Prem Karat <prem.karat@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This make sure we don't use wrong inode from the inode hash. The inode number
of the file deleted is reused by the next file system object created
and if we only use inode number for inode hash lookup we could end up
with wrong struct inode.
Also compare inode generation number. Not all Linux file system provide
st_gen in userspace. So it could be 0;
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Now that VFS does the right thing remove the work around.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
isofs: Remove global fs lock
jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
mm/truncate.c: fix build for CONFIG_BLOCK not enabled
fs:update the NOTE of the file_operations structure
Remove dead code in dget_parent()
AFS: Fix silly characters in a comment
switch d_add_ci() to d_splice_alias() in "found negative" case as well
simplify gfs2_lookup()
jfs_lookup(): don't bother with . or ..
get rid of useless dget_parent() in btrfs rename() and link()
get rid of useless dget_parent() in fs/btrfs/ioctl.c
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
drivers: fix up various ->llseek() implementations
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Btrfs: implement our own ->llseek
fs: add SEEK_HOLE and SEEK_DATA flags
reiserfs: make reiserfs default to barrier=flush
...
Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
When journalling data for an inode (either because it is a symlink or
because the filesystem is mounted in data=journal mode), ext3_evict_inode()
can discard unwritten data by calling truncate_inode_pages(). This is
because we don't mark the buffer / page dirty when journalling data but only
add the buffer to the running transaction and thus mm does not know there
are still unwritten data.
Fix the problem by carefully tracking transaction containing inode's data,
committing this transaction, and writing uncheckpointed buffers when inode
should be reaped.
Signed-off-by: Jan Kara <jack@suse.cz>
sbi->s_mutex isn't needed for isofs at all so we can just remove it. Generally,
since isofs is always mounted read-only, filesystem structure cannot change
under us. So buffer_head contents stays constant after it's filled in. That
leaves us with possible changes of global data structures. Superblock changes
only during filesystem mount (even remount does not change it), inodes are only
filled in during reading from disk. So there are no changes of these structures
to bother about.
Arguments why sbi->s_mutex can be removed at each place:
isofs_readdir: Accesses sb, inode, filp, local variables => s_mutex not needed
isofs_lookup: Protected by directory's i_mutex. Accesses sb, inode, dentry,
local variables => s_mutex not needed
rock_ridge_symlink_readpage: Protected by page lock. Accesses sb, inode,
local variables => s_mutex not needed.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We don't generate IN_DELETE_SELF on victim of overwriting rename() if
it happens to be a directory. Trivially fixed by doing to ->i_nlink
what we do ->pino_nlink a couple of lines later in jffs2_rename().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On ramfs and other simple_rename() users IN_DELETE_SELF is not generated
for victim of overwriting rename() if it's is a directory. Works on
most of the local filesystems and really trivial to fix...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pstore only allows one backend to be registered at present, but the
system may provide several. Add a parameter to allow the user to choose
which backend will be used rather than just relying on load order.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
We'll never have a negative part, so just make this an unsigned int.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
EFI only provides small amounts of individual storage, and conventionally
puts metadata in the storage variable name. Rather than add a metadata
header to the (already limited) variable storage, it's easier for us to
modify pstore to pass all the information we need to construct a unique
variable name to the appropriate functions.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Some pstore implementations may not have a static context, so extend the
API to pass the pstore_info struct to all calls and allow for a context
pointer.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
connector: add an event for monitoring process tracers
ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
ptrace_init_task: initialize child->jobctl explicitly
has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
ptrace: ptrace_reparented() should check same_thread_group()
redefine thread_group_leader() as exit_signal >= 0
do not change dead_task->exit_signal
kill task_detached()
reparent_leader: check EXIT_DEAD instead of task_detached()
make do_notify_parent() __must_check, update the callers
__ptrace_detach: avoid task_detached(), check do_notify_parent()
kill tracehook_notify_death()
make do_notify_parent() return bool
ptrace: s/tracehook_tracer_task()/ptrace_parent()/
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (49 commits)
xfs: add size update tracepoint to IO completion
xfs: convert AIL cursors to use struct list_head
xfs: remove confusing ail cursor wrapper
xfs: use a cursor for bulk AIL insertion
xfs: failure mapping nfs fh to inode should return ESTALE
xfs: Remove the second parameter to xfs_sb_count()
xfs: remove the dead XFS_DABUF_DEBUG code
xfs: remove leftovers of the old btree tracing code
xfs: remove the dead QUOTADEBUG code
xfs: remove the unused xfs_buf_delwri_sort function
xfs: remove wrappers around b_iodone
xfs: remove wrappers around b_fspriv
xfs: add a proper transaction pointer to struct xfs_buf
xfs: factor out xfs_da_grow_inode_int
xfs: factor out xfs_dir2_leaf_find_stale
xfs: cleanup struct xfs_dir2_free
xfs: reshuffle dir2 headers
xfs: start periodic workers later
Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"
xfs: remove variables that serve no purpose in xfs_alloc_ag_vextent_exact()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: don't limit active work items
dlm: use workqueue for callbacks
dlm: remove deadlock debug print
dlm: improve rsb searches
dlm: keep lkbs in idr
dlm: fix kmalloc args
dlm: don't do pointless NULL check, use kzalloc and fix order of arguments
dlm: dump address of unknown node
dlm: use vmalloc for hash tables
dlm: show addresses in configfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: ensure bio requests are not smaller than the hardware sectors
hfsplus: Add additional range check to handle on-disk corruptions
hfsplus: Add error propagation for hfsplus_ext_write_extent_locked
hfsplus: add error checking for hfs_find_init()
hfsplus: lift the 2TB size limit
hfsplus: fix overflow in hfsplus_read_wrapper
hfsplus: fix overflow in hfsplus_get_block
hfsplus: assignments inside `if' condition clean-up
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
GFS2: combine duplicated block freeing routines
GFS2: Add S_NOSEC support
GFS2: Automatically adjust glock min hold time
GFS2: Cache dir hash table in a contiguous buffer
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (32 commits)
MAINTAINERS: change e-mail of Adrian Hunter
UBIFS: fix master node recovery
UBIFS: improve power cut emulation testing
UBIFS: rename recovery testing variables
UBIFS: remove custom list of superblocks
UBIFS: stop re-defining UBI operations
UBIFS: switch to I/O helpers
UBIFS: switch to ubifs_leb_write
UBIFS: switch to ubifs_leb_read
UBIFS: introduce more I/O helpers
UBIFS: always print stacktrace when switching to R/O mode
UBIFS: remove unused and unneeded debugging function
UBIFS: add global debugfs knobs
UBIFS: introduce debugfs helpers
UBIFS: re-arrange debugging code a bit
UBIFS: be more informative in failure mode
UBIFS: switch self-check knobs to debugfs
UBIFS: lessen amount of debugging check types
UBIFS: introduce helper functions for debugging checks and tests
UBIFS: amend debugging inode size check function prototype
...
In ext2_xattr_get(), the code will acquire xattr_sem first, later checks
the length of xattr name_len > 255. It's unnecessarily time consuming and
also ext2_xattr_set() checks the length before other checks. So move the
check before acquiring xattr_sem to make these two functions consistent.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently all bio requests are 512 bytes, which may fail for media
whose physical sector size is larger than this. Ensure these
requests are not smaller than the block device logical block size.
BugLink: http://bugs.launchpad.net/bugs/734883
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
'recoff' is read from disk and used for an argument to memcpy, so if
the value read from disk is larger than the page size, it result to
"general protection fault". This patch add additional range check for
the value, so that disk fuzz won't cause such fault.
Signed-off-by: Naohiro Aota <naota@elisp.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Test-case:
void *tfunc(void *arg)
{
execvp("true", NULL);
return NULL;
}
int main(void)
{
int pid;
if (fork()) {
pthread_t t;
kill(getpid(), SIGSTOP);
pthread_create(&t, NULL, tfunc, NULL);
for (;;)
pause();
}
pid = getppid();
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
while (wait(NULL) > 0)
ptrace(PTRACE_CONT, pid, 0,0);
return 0;
}
It is racy, exit_notify() does __wake_up_parent() too. But in the
likely case it triggers the problem: de_thread() does release_task()
and the old leader goes away without the notification, the tracer
sleeps in do_wait() without children/tracees.
Change de_thread() to do __wake_up_parent(traced_leader->parent).
Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
EXIT_DEAD threads do not exist from do_wait's pov.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Squashfs now supports XZ and LZO compression in addition to ZLIB.
As such it no longer makes sense to always include ZLIB support.
In particular embedded systems may only use LZO or XZ compression, and
the ability to exclude ZLIB support will reduce kernel size.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
It seems to hurt performance in real life. Yes, the inode will be used
later, but the conditional doesn't seem to predict all that well
(negative dentries are not uncommon) and it looks like the cost of
prefetching is simply higher than depending on the cache doing the right
thing.
As usual.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The compiler, at least for ix86 and m68k, validly warns that the
comparison:
next <= (loff_t)-1
is always true (and it's always true also for x86-64 and probably all
other arches - as long as pgoff_t isn't wider than loff_t). The
intention appears to be to avoid wrapping of "next", so rather than
eliminating the pointless comparison, fix the loop to indeed get exited
when "next" would otherwise wrap.
On m68k the following warning is observed:
fs/fscache/page.c: In function '__fscache_uncache_all_inode_pages':
fs/fscache/page.c:979: warning: comparison is always false due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Suresh Jayaraman <sjayaraman@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix silly characters in a comment in AFS code (some weird characters replaced
the word 'flag' some point way back).
Reported-by: viro@ZenIV.linux.org.uk
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
both callers there have dentry->d_parent stabilized by the fact that
their caller had obtained dentry from lookup_one_len() and had not
dropped ->i_mutex on parent since then.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since Ext4 has its own lseek we need to make sure it handles
SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic
case, somebody else can come along and make it do fancy things later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck. We need to match solaris
here, and the definitions are
*o* If /whence/ is SEEK_HOLE, the offset of the start of the
next hole greater than or equal to the supplied offset
is returned. The definition of a hole is provided near
the end of the DESCRIPTION.
*o* If /whence/ is SEEK_DATA, the file pointer is set to the
start of the next non-hole file region greater than or
equal to the supplied offset.
So in the generic case the entire file is data and there is a virtual hole at
the end. That means we will just return i_size for SEEK_HOLE and will return
the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
the same way.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change the default reiserfs mount option to barrier=flush. Based on a patch
from Jeff Mahoney in the SuSE tree.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch turns on barriers by default for ext3. mount -o barrier=0
will turn them off. Based on a patch from Chris Mason in the SuSE tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Ted Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.
All current users are switched over to use the new counter.
Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For filesystems that delay their end_io processing we should keep our
i_dio_count until the the processing is done. Enable this by moving
the inode_dio_done call to the end_io handler if one exist. Note that
the actual move to the workqueue for ext4 and XFS is not done in
this patch yet, but left to the filesystem maintainers. At least
for XFS it's not needed yet either as XFS has an internal equivalent
to i_dio_count.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Simple filesystems always pass inode->i_sb_bdev as the block device
argument, and never need a end_io handler. Let's simply things for
them and for my grepping activity by dropping these arguments. The
only thing not falling into that scheme is ext4, which passes and
end_io handler without needing special flags (yet), but given how
messy the direct I/O code there is use of __blockdev_direct_IO
in one instead of two out of three cases isn't going to make a large
difference anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code. Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.
Behaviour doesn't change for filesystems never calling inode_dio_wait.
For ext4 behaviour changes when using the dioread_nonlock option, which
previously was missing any protection between truncate and direct I/O reads.
For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code now enable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand. This means filesystem-specific locks to prevent
new dio referenes from appearing can be held. This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.
Replace it with a hand-grown construct:
- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.
This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
removes the need of using i_alloc_sem to avoid races with truncate which
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
be problematic because we can decide to flush delay-allocated blocks which
will acquire s_umount semaphore - again creating unpleasant lock dependency
if not directly a deadlock.
Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new rw_semaphore to protect bmap against truncate. Previous
i_alloc_sem was abused for this, but it's going away in this series.
Note that we can't simply use i_mutex, given that the swapon code
calls ->bmap under it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The flags parameter went away in
d749519b444db985e40b897f73ce1898b11f997e
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert the inode reclaim shrinker to use the new per-sb shrinker
operations. This allows much bigger reclaim batches to be used, and
allows the XFS inode cache to be shrunk in proportion with the VFS
dentry and inode caches. This avoids the problem of the VFS caches
being shrunk significantly before the XFS inode cache is shrunk
resulting in imbalances in the caches during reclaim.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that the per-sb shrinker is responsible for shrinking 2 or more
caches, increase the batch size to keep econmies of scale for
shrinking each cache. Increase the shrinker batch size to 1024
objects.
To allow for a large increase in batch size, add a conditional
reschedule to prune_icache_sb() so that we don't hold the LRU spin
lock for too long. This mirrors the behaviour of the
__shrink_dcache_sb(), and allows us to increase the batch size
without needing to worry about problems caused by long lock hold
times.
To ensure that filesystems using the per-sb shrinker callouts don't
cause problems, document that the object freeing method must
reschedule appropriately inside loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.
Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that we have per-sb shrinkers with a lifecycle that is a subset
of the superblock lifecycle and can reliably detect a filesystem
being unmounted, there is not longer any race condition for the
iprune_sem to protect against. Hence we can remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well. The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.
With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Both the filesystem and the lock manager can associate operations with a
lock. Confusingly, one of them (fl_release_private) actually has the
same name in both operation structures.
It would save some confusion to give the lock-manager ops different
names.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
For improving insight into IO completion behaviour.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The list of active AIL cursors uses a roll-your-own linked list with
special casing for the AIL push cursor. Simplify this code by
replacing the list with standard struct list_head lists, and use a
separate list_head to track the active cursors. This allows us to
treat the AIL push cursor as a generic cursor rather than as a
special case, further simplifying the code.
Further, fix the duplicate push cursor initialisation that the
special case handling was hiding, and clean up all the comments
around the active cursor list handling.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_trans_ail_cursor_set() doesn't set the cursor to the current log
item, it sets it to the next item. There is already a function for
doing this - xfs_trans_ail_cursor_next() - and the _set function is
simply a two line wrapper. Remove it and open code the setting of
the cursor in the two locations that call it to remove the
confusion.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Delayed logging can insert tens of thousands of log items into the
AIL at the same LSN. When the committing of log commit records
occur, we can get insertions occurring at an LSN that is not at the
end of the AIL. If there are thousands of items in the AIL on the
tail LSN, each insertion has to walk the AIL to find the correct
place to insert the new item into the AIL. This can consume large
amounts of CPU time and block other operations from occurring while
the traversals are in progress.
To avoid this repeated walk, use a AIL cursor to record
where we should be inserting the new items into the AIL without
having to repeat the walk. The cursor infrastructure already
provides this functionality for push walks, so is a simple extension
of existing code. While this will not avoid the initial walk, it
will avoid repeating it tens of thousands of times during a single
checkpoint commit.
This version includes logic improvements from Christoph Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
On xfs exports, nfsd is incorrectly returning ENOENT instead of
ESTALE on attempts to use a filehandle of a deleted file (spotted
with pynfs test PUTFH3). The ENOENT was coming from xfs_iget.
(It's tempting to wonder whether we should just map all xfs_iget
errors to ESTALE, but I don't believe so--xfs_iget can also return
ENOMEM at least, which we wouldn't want mapped to ESTALE.)
While we're at it, the other return of ENOENT in xfs_nfs_get_inode()
also looks wrong.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the second parameter to xfs_sb_count() since all callers of
the function set them.
Also, fix the header comment regarding it being called periodically.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Compilation of ext3/namei.c brought up an error and warning messages
when compiled with -DDX_DEBUG.
Signed-off-by: Bernd Schubert<bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: Jan Kara <jack@suse.cz>
The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.
pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.
To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.
The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL
so no need for that if (inode) ... in there (or ERR_PTR(0), for that
matter)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
Allocates dentry, sets its ->d_sb to given superblock and sets
->d_op accordingly. Old d_alloc(NULL, name) callers are converted
to that (all of them know what superblock they want). d_alloc()
itself is left only for parent != NULl case; uses __d_alloc(),
inserts result into the list of parent's children.
Note that now ->d_sb is assign-once and never NULL *and*
->d_parent is never NULL either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
combination of kern_path_parent() and lookup_create(). Does *not*
expose struct nameidata to caller. Syscalls converted to that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of playing with removal of LOOKUP_OPEN, mangling (and
restoring) nd->path, just pass NULL to vfs_create(). The whole
point of what's being done there is to suppress any attempts
to open file by underlying fs, which is what nd == NULL indicates.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE)
b) default (!LOOKUP_OPEN) open_flags is O_CREAT|O_EXCL|FMODE_READ, not 0
c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN;
otherwise we need to issue CLOSE, lest we leak stateid on server.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and get rid of a bogus typecast, while we are at it; it's not
just that we want a function returning int and not void, but cast
to pointer to function taking void * and returning void would be
(void (*)(void *)) and not (void *)(void *), TYVM...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pass mask instead; kill security_inode_exec_permission() since we can use
security_inode_permission() instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: would_dump(bprm, file). Checks if we are allowed to
read the file and if we are not - sets ENFORCE_NODUMP. Exported,
used in places that previously open-coded the same logics.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
capability overrides apply only to the default case; if fs has ->permission()
that does _not_ call generic_permission(), we have no business doing them.
Moreover, if it has ->permission() that does call generic_permission(), we
have no need to recheck capabilities.
Besides, the capability overrides should apply only if we got EACCES from
acl_permission_check(); any other value (-EIO, etc.) should be returned
to caller, capabilities or not capabilities.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Call the given function for all superblocks of given type. Function
gets a superblock (with s_umount locked shared) and (void *) argument
supplied by caller of iterator.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
for readdir, but unfortunately in the case of anything that stats the files in
order that readdir spits back (like oh say ls) that means we still have to do
the normal lookup of the file, which means looking up our other index and then
looking up the inode. What I want is a way to create dummy dentries when we
find them in readdir so that when ls or anything else subsequently does a
stat(), we already have the location information in the dentry and can go
straight to the inode itself. The lookup stuff just assumes that if it finds a
dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP
flag so that the lookup code knows it still needs to run i_op->lookup() on the
parent to get the inode for the dentry. I have tested this with btrfs and I
went from something that looks like this
http://people.redhat.com/jwhiter/ls-noreada.png
To this
http://people.redhat.com/jwhiter/ls-good.png
Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Assume that /sys/kernel/debug/dummy64 is debugfs file created by
debugfs_create_x64().
# cd /sys/kernel/debug
# echo 0x1234567812345678 > dummy64
# cat dummy64
0x0000000012345678
# echo 0x80000000 > dummy64
# cat dummy64
0xffffffff80000000
A value larger than INT_MAX cannot be written to the debugfs file created
by debugfs_create_u64 or debugfs_create_x64 on 32bit machine. Because
simple_attr_write() uses simple_strtol() for the conversion.
To fix this, use simple_strtoll() instead.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: fix race in rcu lookup of pruned dentry
Fix cifs_get_root()
[ Edited the last commit to get rid of a 'unused variable "seq"'
warning due to Al editing the patch. - Linus ]
Don't update *inode in __follow_mount_rcu() until we'd verified that
there is mountpoint there. Kudos to Hugh Dickins for catching that
one in the first place and eventually figuring out the solution (and
catching a braino in the earlier version of patch).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow multiple workqueue items (locks with callbacks) to be
processed concurrently. There should be no reason not to
take advantage of this workqueue feature.
Signed-off-by: David Teigland <teigland@redhat.com>
Add missing ->i_mutex, convert to lookup_one_len() instead of
(broken) open-coded analog, cope with getting something like
a//b as relative pathname. Simplify the hell out of it, while
we are there...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Changing the inode's metadata may require the 'security.evm' extended
attribute to be re-calculated and updated.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
vfs_getxattr_alloc() and vfs_xattr_cmp() are two new kernel xattr helper
functions. vfs_getxattr_alloc() first allocates memory for the requested
xattr and then retrieves it. vfs_xattr_cmp() compares a given value with
the contents of an extended attribute.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
This patch changes the security_inode_init_security API by adding a
filesystem specific callback to write security extended attributes.
This change is in preparation for supporting the initialization of
multiple LSM xattrs and the EVM xattr. Initially the callback function
walks an array of xattrs, writing each xattr separately, but could be
optimized to write multiple xattrs at once.
For existing security_inode_init_security() calls, which have not yet
been converted to use the new callback function, such as those in
reiserfs and ocfs2, this patch defines security_old_inode_init_security().
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
It's sort of ridiculous that we've never had a working reply cache for
NFSv4.
On the other hand, we may still not: our current reply cache is likely
not very good, especially in the TCP case (which is the only case that
matters for v4). What we really need here is some serious testing.
Anyway, here's a start.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If eh_entries is equal to (or greater than) eh_max, the operation of
inserting new extent_idx will make number of entries overflow.
So check eh_entries before inserting the new extent_idx.
Although there is no bug case according the code (function
ext4_ext_insert_index is called by ext4_ext_split and ext4_ext_split
is called only if the index block has free space), the right logic
should be "lookup the capacity before insertion".
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch avoids an extraneous lookup of the extent cache
in ext4_ext_map_blocks() when the flag
EXT4_GET_BLOCKS_PUNCH_OUT_EXT is absent.
The existing logic was performing the lookup but not making
use of the result. The patch simply reverses the order of evaluation
in the condition.
Since ext4_ext_in_cache() does not initialize newex on misses, bypassing
its invocation does not introduce any new issue in this regard.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Eric Gouriou <egouriou@google.com>
... and it's getting it wrong, too - missing ->d_revalidate() calls when
it's dealing with filesystem (procfs) that has non-trivial ->d_revalidate()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch removes the extra parameter in ext4_ext_remove_space()
which is no longer needed.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch optimizes the punch hole operation by skipping the
tree walking code that is used by truncate. Since punch hole
is done through map blocks, the path to the extent is already
known in this function, so we do not need to look it up again.
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the stripe width was set to 1, then this patch will ignore
that stripe width and ext4 will act as if the stripe width
were 0 with respect to optimizing allocations.
Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Previously, if a stripe width was provided, then it would be used
as the preallocation granularity, with no santiy checking and no
way to override this. Now, mb_prealloc_size defaults to the smallest
multiple of stripe size that is greater than or equal to the old
default mb_prealloc_size, and this can be overridden with the sysfs
interface.
Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] update cifs to version 1.74
[CIFS] update limit for snprintf in cifs_construct_tcon
cifs: Fix signing failure when server mandates signing for NTLMSSP
deal with d_move() races properly; rename_lock read-retry loop,
rcu_read_lock() held while walking to root, d_lock held over
subtraction from namelen and copying the component to stabilize
->d_name.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Compilation of ext4/namei.c brought up an error and warning messages
when compiled with -DDX_DEBUG
Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Embed the necessary alias into the module rather than waiting for
someone to add it to /etc/modprobe.conf
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Somebody working on this code asked what the deal was with NFSv4, since
this comment notes that it's v2/v3's statelessness that requires
sillyrename. Shouldn't hurt to document the answer.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Before nfs41 client's RECLAIM_COMPLETE done, nfs server should deny any
new locks or opens.
rfc5661:
" Whenever a client establishes a new client ID and before it does
the first non-reclaim operation that obtains a lock, it MUST send a
RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no
locks to reclaim. If non-reclaim locking operations are done before
the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. "
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
From: Miklos Szeredi <mszeredi@suse.cz>
Remove SLAB initialization entirely, as suggested by Bruce and Linus.
Allocate with __GFP_ZERO instead and only initialize list heads.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Check in SEQUENCE that the request doesn't exceed maxreq_sz for the
given session.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
According to RFC5661, 18.36.3,
"if the client selects a value for ca_maxresponsesize such that
a replier on a channel could never send a response,the server
SHOULD return NFS4ERR_TOOSMALL in the CREATE_SESSION reply."
So, error out when the client sets a maxreq_sz less than the minimum
possible SEQUENCE request size, or sets a maxresp_sz less than the
minimum possible SEQUENCE reply size.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Stateid's hold a read reference for a read open, a write reference for a
write open, and an additional one of each for each read+write open. The
latter wasn't getting put on a downgrade, so something like:
open RW
open R
downgrade to R
was resulting in a file leak.
Also fix an imbalance in an error path.
Regression from 7d94784293 "nfsd4: fix
downgrade/lock logic".
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Without this, for example,
open read
open read+write
close
will result in a struct file leak.
Regression from 7d94784293 "nfsd4: fix
downgrade/lock logic".
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This operation is used by the client to check the validity of a list of
stateids.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This operation is used by the client to tell the server to free a
stateid.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As promised in feature-removal-schedule.txt it is time to
remove the nfsctl system call.
Userspace has perferred to not use this call throughout 2.6 and it has been
excluded in the default configuration since 2.6.36 (9 months ago).
So this patch removes all the code that was being compiled out.
There are still references to sys_nfsctl in various arch systemcall tables
and related code. These should be cleaned out too, probably in the next
merge window.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
DESTROY_CLIENTID MAY be preceded with a SEQUENCE operation as long as
the client ID derived from the session ID of SEQUENCE is not the same
as the client ID to be destroyed. If the client IDs are the same,
then the server MUST return NFS4ERR_CLIENTID_BUSY.
(that's not implemented yet)
If DESTROY_CLIENTID is not prefixed by SEQUENCE, it MUST be the only
operation in the COMPOUND request (otherwise, the server MUST return
NFS4ERR_NOT_ONLY_OP).
This fixes the error return; before, we returned
NFS4ERR_OP_NOT_IN_SESSION; after this patch, we return NFS4ERR_NOTSUPP.
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Instead of creating our own kthread (dlm_astd) to deliver
callbacks for all lockspaces, use a per-lockspace workqueue
to deliver the callbacks. This eliminates complications and
slowdowns from many lockspaces sharing the same thread.
Signed-off-by: David Teigland <teigland@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fix loop checks in d_materialise_unique()
Fix ->d_lock locking order in unlazy_walk()
Change explicit references to CONFIG_NFS_V4_1 to implicit ones
Get rid of the unnecessary defines in backchannel_rqst.c and
bc_svc.c: the Makefile takes care of those dependency.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use nfs_pageio_reset_read_mds and nfs_pageio_reset_write_mds instead of
completely reinitialising the struct nfs_pageio_descriptor.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we recoalese to take into account differences in
differences in block sizes when falling back to write through the MDS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we recoalese to take into account differences in
block sizes when falling back to read through the MDS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If an attempt to do pNFS fails, and we have to fall back to writing through
the MDS, then we may want to re-coalesce the requests that we already have
since the block size for the MDS read/writes may be different to that of
the DS read/writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of looking up the rsize and wsize, the routines that generate the
RPC requests should really be using the pg_bsize, since that is what we
use when deciding whether or not to coalesce write requests...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
__gfs2_free_data and __gfs2_free_meta are almost identical, and
can be trivially combined.
[This is as per Eric's original patch minus gfs2_free_data() which had
no callers left and plus the conversion of the bmap.c calls to these
functions. All in all, a nice clean up]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds S_NOSEC support to GFS2. We set/reset the flag either when
a user calls setattr or when we have just regained the glock
from another node. The flag is only set if there are no xattrs
on the inode and there is no suid bit set.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
This patch is a performance improvement for GFS2 in a clustered
environment. It makes the glock hold time self-adjusting.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds a cache for the hash table to the directory code
in order to help simplify the way in which the hash table is
accessed. This is intended to be a first step towards introducing
some performance improvements in the directory code.
There are two follow ups that I'm hoping to see fairly shortly. One
is to simplify the hash table reading code now that we always read the
complete hash table, whether we want one entry or all of them. The
other is to introduce readahead on the heads of the hash chains
which are referred to from the table.
The hash table is a maximum of 128k in size, so it is not worth trying
to read it in small chunks.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Both __d_unalias() and __d_materialise_dentry() need loop prevention.
Grab rename_lock in caller, check for loops there...
As a side benefit, we have dentry_lock_for_move() called only under
rename_lock, which seriously reduces deadlock potential of the
execrable "locking order" used for ->d_lock.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dealing with this seems trivial - the only caller of btrfs_balance() is
btrfs_ioctl() which passes the error code directly back to userspace. There
also isn't much state to unwind (if I'm wrong about this point, we can
always safely move the allocation to the top of btrfs_balance() anyway).
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
btrfs_iget() also needed an update so that errors from btrfs_locked_inode()
are caught and bubbled back up.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I moved the path allocation up a few lines to the top of the function so
that we couldn't get into the state where we've dropped delayed items and
the extent cache but fail due to -ENOMEM.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The two ->process_func call sites in tree-log.c which were ignoring a return
code have also been updated to gracefully exit as well.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: Resolve inode eviction and ail list interaction bug
GFS2: Fix race during filesystem mount
GFS2: force a log flush when invalidating the rindex glock
This patch contains a few misc fixes which resolve a recently
reported issue. This patch has been a real team effort and has
received a lot of testing.
The first issue is that the ail lock needs to be held over a few
more operations. The lock thats added into gfs2_releasepage() may
possibly be a candidate for replacing with RCU at some future
point, but at this stage we've gone for the obvious fix.
The second issue is that gfs2_write_inode() can end up calling
a glock recursively when called from gfs2_evict_inode() via the
syncing code, so it needs a guard added.
The third issue is that we either need to not truncate the metadata
pages of inodes which have zero link count, but which we cannot
deallocate due to them still being in use by other nodes, or we need
to ensure that those pages have all made it through the journal and
ail lists first. This patch takes the former approach, but the
latter has also been tested and there is nothing to choose between
them performance-wise. So again, we could revise that decision
in the future.
Also, the inode eviction process is now better documented.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
Reported-by: Barry J. Marson <bmarson@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Fix use of static variable in rpcb_getport_async
NFSv4.1: update nfs4_fattr_bitmap_maxsz
SUNRPC: Fix a race between work-queue and rpc_killall_tasks
pnfs: write: Set mds_offset in the generic layer - it is needed by all LDs
Remove various bits left over from the old kdb-only btree tracing code, but
leave the actual trace point stubs in place to ease adding new event based
btree tracing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the dead hash table test rid which has been rotting away under
QUOTADEBUG, including some code that was compiled for normal debug
builds, but not actually called without QUOTADEBUG, and enable a few
cheap debug checks that were hidden under QUOTADEBUG for normal
debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Replace the typeless b_fspriv2 and the ugly macros around it with a properly
typed transaction pointer. As a fallout the log buffer state debug checks
are also removed. We could have kept them using casts, but as they do
not have a real purpose we can as well just remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_da_grow_inode and xfs_dir2_grow_inode are mostly duplicate code. Factor
the meat of those two functions into a new common helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Change the bests array to be a proper variable sized entry. This is done
easily as no one relies on the size of the structure. Also change
XFS_DIR2_MAX_FREE_BESTS to an inline function while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Replace the current mess of dir2 headers with just three that have a clear
purpose:
- xfs_dir2_format.h for all format definitions, including the inline helpers
to access our variable size structures
- xfs_dir2_priv.h for all prototypes that are internal to the dir2 code
and not needed by anything outside of the directory code. For this
purpose xfs_da_btree.c, and phase6.c in xfs_repair are considered part
of the directory code.
- xfs_dir2.h for the public interface to the directory code
In addition to the reshuffle I have also update the comments to not only
match the new file structure, but also to describe the directory format
better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Start the periodic sync workers only after we have finished xfs_mountfs
and thus fully set up the filesystem structures. Without this we can
call into xfs_qm_sync before the quotainfo strucute is set up if the
mount takes unusually long, and probably hit other incomplete states
as well.
Also clean up the xfs_fs_fill_super error path by using consistent
label names, and removing an impossible to reach case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Reviewed-by: Alex Elder <aelder@sgi.com>
Make sure that child is still a child of parent before nested locking
of child->d_lock in unlazy_walk(); otherwise we are risking a violation
of locking order and deadlocks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
By pre-allocating rsb structs before searching the hash
table, they can be inserted immediately. This avoids
always having to repeat the search when adding the struct
to hash list.
This also adds space to the rsb struct for a max resource
name, so an rsb allocation can be used by any request.
The constant size also allows us to finally use a slab
for the rsb structs.
Signed-off-by: David Teigland <teigland@redhat.com>
In 34c87901e1 "Shrink stack space usage in cifs_construct_tcon" we
change the size of the username name buffer from MAX_USERNAME_SIZE
(256) to 28. This call to snprintf() needs to be updated as well.
Reported by Dan Carpenter.
Reviewed-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When using NTLMSSP authentication mechanism, if server mandates
signing, keep the flags in type 3 messages of the NTLMSSP exchange
same as in type 1 messages (i.e. keep the indicated capabilities same).
Some of the servers such as Samba, expect the flags such as
Negotiate_Key_Exchange in type 3 message of NTLMSSP exchange as well.
Some servers like Windows do not.
https://bugzilla.samba.org/show_bug.cgi?id=8212
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Split them up into two parts: one which sets up the struct nfs_read/write_data,
the other which sets up the actual RPC call or pNFS call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we're writing back data, and the FLUSH_STABLE flag is set, then we
always want to use NFS_FILE_SYNC, since we're always in a situation where
we're doing page reclaim, and so we want to free up the page as quickly
as possible.
If we're in the FLUSH_COND_STABLE case, then we either want to use another
unstable write (if we have to do a commit anyway) or again, we want to
use NFS_FILE_SYNC because we know that we have no more pages to write
out.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Mark all deviceids established under an expired MDS clientid as invalid.
Stop all new i/o through DS and send through the MDS.
Don't use any new LAYOUTGETs that use the invalid deviceid. Purge all layouts
established under the expired MDS clientid.
Remove the MDS clientid deviceid and data servers reference
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since we take a reference to it, we really ought to pass the a pointer to
the layout header in the arguments instead of assuming that
NFS_I(inode)->layout will forever point to the correct object.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ask for whole file layouts. Until support for layout segments is fully
supported in the file layout code, discard non-whole file layouts.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The fact that the global device id cache holds a reference to the
nfs4_deviceid_node until it is invisible to rcu lookups implies that
we can always assume that the reference count is non-zero in
_find_get_deviceid.
Also clean up nfs4_put_deviceid_node and the removal of the device id
from the cache.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to ensure that the layouts are set up before we can decide to
coalesce requests. To do so, we want to further split up the struct
nfs_pageio_descriptor operations into an initialisation callback, a
coalescing test callback, and a 'do i/o' callback.
This patch cleans up the existing callback methods before adding the
'initialisation' callback.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When recovering open files and locks, the stateid should be tested
against the server and freed if it is invalid. This patch adds new
recovery functions for NFS v4.1.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
FREE_STATEID is used to tell the server that we want to free a stateid
that no longer has any locks associated with it. This allows the client
to reclaim locks without encountering edge conditions documented in
section 8.4.3 of RFC 5661.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds in the xdr for doing a TEST_STATEID call with a single
stateid. RFC 5661 allows multiple stateids to be tested in a single
call, but only testing one keeps things simpler for now.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the client is using NFS v4.1, then we can use SECINFO_NO_NAME to find
the secflavor for the initial mount. If the server doesn't support
SECINFO_NO_NAME then I fall back on the "guess and check" method used
for v4.0 mounts.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Layouts should be tracked per nfs_server (aka superblock)
instead of per struct nfs_client, which may have multiple FSIDs associated
with it.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
can be skipped if the "eir_server_scope" from the exchange_id proc differs from
previous calls.
Also, in the future server_scope will be useful for determining whether client
trunking is available
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't just use the first addr in the multipath list - instead, loop
over addresses when calling nfs4_set_ds_client() (which calls connect)
until it is successful.
Although this is not real multipath support, it's a quick fix to handle when
an MDS sends a list of addresses for a DS and some of the addr families are
unsupported or misconfigured (like no routable ipv6 addr assigned).
This will attempt all paths to the DS before giving up, instead of immediately
falling back to the MDS.
As before, an error encountered after a successful connect() will cause all
i/o to fall back to the MDS.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This parses and stores all addresses associated with each data server,
laying the groundwork for supporting multipath to data servers.
- Skips over addresses that cannot be parsed (ie IPv6 addrs if v6 is not
enabled). Only fails if none of the addresses are recognizable
- Currently only uses the first address that parsed cleanly
- Tested against pynfs server (modified to support multipath)
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Handle ipv6 remote addresses from GETDEVICEINFO
- supports netid "tcp" for ipv4 and "tcp6" for ipv6 as rfc 5665 specifies
- added ds_remotestr to avoid having to handle different AFs in every dprintk
- tested against pynfs 4.1 server, submitting ipv6 support patch to pynfs
- tested with IPv6 disabled, it compiles cleanly and relies on rpc_pton to
refuse to accept IPv6 addresses
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
lockd: server returns status 50331648
it's quite hard to understand that number in this message is 3 in big endian
Signed-off-by: Vasily Averin <vvs@sw.ru>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is a potential race during filesystem mounting which has recently
been reported. It occurs when the userland gfs_controld is able to
process requests fast enough that it tries to use the sysfs interface
before the lock module is properly initialised. This is a pretty
unusual case as normally the lock module initialisation is very quick
compared with gfs_controld.
This patch adds an interruptible completion which is used to ensure that
userland will wait for the initialisation of the lock module to
complete.
There are other potential solutions to this problem, but this is the
quickest at this stage and has been tested both with and without
mount.gfs2 present in the system.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Booher <dbooher@adams.net>
Right now, there is nothing that forces the log to get flushed when a node
drops its rindex glock so that another node can grow the filesystem. If the
log doesn't get flushed, GFS2 can corrupt the sd_log_le_rg list in the
following way.
A node puts an rgd on the list in rg_lo_add(), and then the rindex glock is
dropped so the other node can grow the filesystem. When the node reacquires the
rindex glock, that rgd gets deleted in clear_rgrpdi() before ever being
removed from the list by gfs2_log_flush().
This code simply forces a log flush when the rindex glock is invalidated,
solving the problem.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
fs_excl is a poor man's priority inheritance for filesystems to hint to
the block layer that an operation is important. It was never clearly
specified, not widely adopted, and will not prevent starvation in many
cases (like across cgroups).
fs_excl was introduced with the time sliced CFQ IO scheduler, to
indicate when a process held FS exclusive resources and thus needed
a boost.
It doesn't cover all file systems, and it was never fully complete.
Lets kill it.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Attribute IDs assigned in RFC 5661 now require three bitmaps.
Fixes hitting a BUG_ON in xdr_shrink_bufhead when getting ACLs.
Signed-off-by: Andy Adamson <andros@netapp.com>
Cc:stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The comment from Al Viro about possible race in the ext4_orphan_add() is
not justified. There is no race possible as we always have either i_mutex
locked, or the inode can not be referenced from outside hence the
J_ASSERS should not be hit from the reason described in comment.
This commit replaces it with notion that we are holding i_mutex so it
should not be possible for i_nlink to be changed while waiting for
s_orphan_lock.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we meet with an error in ext4_mb_add_groupinfo, we kfree
sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)], but fail to
reset it to NULL. So the caller ext4_mb_init_backend will try to kfree
it again and causes a double free. So fix it by resetting it to NULL.
Some typo in comments of mballoc.c are also changed.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_groupinfo_create_slab, we create ext4_groupinfo_caches within
ext4_grpinfo_slab_create_mutex, but set it outside the lock, and there
does exist some case that we may create it twice and causes a memory
leak. So set it before we call mutex_unlock.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Optimize ext4_ext_insert_extent() by avoiding
ext4_ext_next_leaf_block() when the result is not used/needed.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: drop spinlock before calling cifs_put_tlink
cifs: fix expand_dfs_referral
cifs: move bdi_setup_and_register outside of CONFIG_CIFS_DFS_UPCALL
cifs: factor smb_vol allocation out of cifs_setup_volume_info
cifs: have cifs_cleanup_volume_info not take a double pointer
cifs: fix build_unc_path_to_root to account for a prefixpath
cifs: remove bogus call to cifs_cleanup_volume_info
If eh->eh_entries is smaller than eh->eh_max, the routine will
go to the "repeat" and then go to "has_space" directlly ,
since argument "depth" and "eh" are not even changed.
Therefore, goto "has_space" directly and remove redundant "repeat" tag.
Signed-off-by: Robin Dong <sanbai@taobao.com>
This reverts commit 7a249cf83d.
That commit created a situation that could lead to a filesystem
hang. As Dave Chinner pointed out, xfs_trans_alloc() could hold a
reference to m_active_trans (i.e., keep it non-zero) and then wait
for SB_FREEZE_TRANS to complete. Meanwhile a filesystem freeze
request could set SB_FREEZE_TRANS and then wait for m_active_trans
to drop to zero. Nobody benefits from this sequence of events...
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
First, we can sometimes free the state we're merging, which means anybody who
calls merge_state() may have the state it passed in free'ed. This is
problematic because we could end up caching the state, which makes caching
useless as the state will no longer be part of the tree. So instead of free'ing
the state we passed into merge_state(), set it's end to the other->end and free
the other state. This way we are sure to cache the correct state. Also because
we can merge states together, instead of only using the cache'd state if it's
start == the start we are looking for, go ahead and use it if the start we are
looking for is within the range of the cached state. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We used to store the checksums of the space cache directly in the space cache,
however that doesn't work out too well if we have more space than we can fit the
checksums into the first page. So instead use the normal checksumming
infrastructure. There were problems with doing this originally but those
problems don't exist now so this works out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We keep having problems with early enospc, and that's because our method of
making space is inherently racy. The problem is we can have one guy trying to
make space for himself, and in the meantime people come in and steal his
reservation. In order to stop this we make a waitqueue and put anybody who
comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
make more space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have to do weird things when handling enospc in the transaction joining code.
Because we've already joined the transaction we cannot commit the transaction
within the reservation code since it will deadlock, so we have to return EAGAIN
and then make sure we don't retry too many times. Instead of doing this, just
do the reservation the normal way before we join the transaction, that way we
can do whatever we want to try and reclaim space, and then if it fails we know
for sure we are out of space and we can return ENOSPC. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I've been watching how many btrfs_search_slot()'s we do and I noticed that when
we create a file with selinux enabled we were doing 2 each time we initialize
the security context. That's because we lookup the xattr first so we can delete
it if we're setting a new value to an existing xattr. But in the create case we
don't have any xattrs, so it is completely useless to have the extra lookup. So
re-arrange things so that we only lookup first if we specifically have
XATTR_REPLACE. That way in the basic case we only do 1 search, and in the more
complicated case we do the normal 2 lookups. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This is simpler and quicker than the hash table, and
avoids needing to search the hash list for every new
lkid to check if it's used.
Signed-off-by: David Teigland <teigland@redhat.com>
In fs/dlm/lock.c in the dlm_scan_waiters() function there are 3 small
issues:
1) There's no need to test the return value of the allocation and do a
memset if is succeedes. Just use kzalloc() to obtain zeroed memory.
2) Since kfree() handles NULL pointers gracefully, the test of
'warned' against NULL before the kfree() after the loop is completely
pointless. Remove it.
3) The arguments to kmalloc() (now kzalloc()) were swapped. Thanks to
Dr. David Alan Gilbert for pointing this out.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David Teigland <teigland@redhat.com>
at ext4_trim_all_free() comment, there is no longer an @e4b parameter,
instead it is @group.
Reported-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4, when FITRIM is called every time, we iterate all the
groups and do trim one by one. It is a bit time wasting if the
group has been trimmed and there is no change since the last
trim.
So this patch adds a new flag in ext4_group_info->bb_state to
indicate that the group has been trimmed, and it will be cleared
if some blocks is freed(in release_blocks_on_commit). Another
trim_minlen is added in ext4_sb_info to record the last minlen
we use to trim the volume, so that if the caller provide a small
one, we will go on the trim regardless of the bb_state.
A simple test with my intel x25m ssd:
df -h shows:
/dev/sdb1 40G 21G 17G 56% /mnt/ext4
Block size: 4096
run the FITRIM with the following parameter:
range.start = 0;
range.len = UINT64_MAX;
range.minlen = 1048576;
without the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.505s
user 0m0.000s
sys 0m1.224s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.359s
user 0m0.000s
sys 0m1.178s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.228s
user 0m0.000s
sys 0m1.151s
with the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m5.625s
user 0m0.000s
sys 0m1.269s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m0.002s
user 0m0.000s
sys 0m0.001s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real 0m0.002s
user 0m0.000s
sys 0m0.001s
A big improvement for the 2nd and 3rd run.
Even after I delete some big image files, it is still much
faster than iterating the whole disk.
[root@boyu-tm test]# time ./ftrim /mnt/ext4/a
real 0m1.217s
user 0m0.000s
sys 0m0.196s
Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we trim some free blocks in a group of ext4, we need to
calculate the free blocks properly and check whether there are
enough freed blocks left for us to trim. Current solution will
only calculate free spaces if they are large for a trim which
isn't appropriate.
Let us see a small example:
a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k.
And minblocks is 1M. With current solution, we have to iterate
the whole group since these 300k will never be subtracted from
1.5M. But actually we should exit after we find the first 2
free spaces since the left 3 chunks only sum up to 900K if we
subtract the first 600K although they can't be trimed.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In 0f0a25b, we adjust 'len' with s_first_data_block - start, but
it could underflow in case blocksize=1K, fstrim_range.len=512 and
fstrim_range.start = 0. In this case, when we run the code:
len -= first_data_blk - start; len will be underflow to -1ULL.
In the end, although we are safe that last_group check later will limit
the trim to the whole volume, but that isn't what the user really want.
So this patch fix it. It also adds the check for 'start' like ext3 so that
we can break immediately if the start is invalid.
Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Using function calls in TP_printk causes perf heartburn, so print the
MAJOR/MINOR device numbers instead.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Upon corrupted inode or disk failures, we may fail after we already
allocate some blocks from the inode or take some blocks from the
inode's preallocation list, but before we successfully insert the
corresponding extent to the extent tree. In this case, we should free
any allocated blocks and discard the inode's preallocated blocks
because the entries in the inode's preallocation list may be in an
inconsistent state.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
The current implementation of ext4_free_blocks() always calls
dquot_free_block This looks quite sensible in the most cases: blocks
to be freed are associated with inode and were accounted in quota and
i_blocks some time ago.
However, there is a case when blocks to free were not accounted by the
time calling ext4_free_blocks() yet:
1. delalloc is on, write_begin pre-allocated some space in quota
2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks()
3. then ext4_ext_map_blocks() gets an error (e.g. ENOSPC) from
ext4_ext_insert_extent() and calls ext4_free_blocks().
In this scenario, ext4_free_blocks() calls dquot_free_block() who, in
turn, decrements i_blocks for blocks which were not accounted yet (due
to delalloc) After clean umount, e2fsck reports something like:
> Inode 21, i_blocks is 5080, should be 5128. Fix<y>?
because i_blocks was erroneously decremented as explained above.
The patch fixes the problem by passing the new flag
EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request
that the dquot_free_block() call be skipped.
Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long. (At least, that was the
comment previously.) This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway. Previously
there may have been other code paths that waited on I_SYNC, but not
any more. -- Theodore Ts'o
So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
will adapt to as large as the storage device can write within 500ms.
XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.
The balance_dirty_pages() control scrope is
[(background_thresh + dirty_thresh) / 2, dirty_thresh]
which is by default [15%, 20%] of global dirty pages, whose range size
is dirty_thresh / DIRTY_FULL_SCOPE.
The adpative write chunk size will be rounded to the nearest 4MB
boundary.
http://bugzilla.kernel.org/show_bug.cgi?id=13930
CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.
balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.
About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.
So introduce global_dirty_limit for tracking the global dirty threshold
with policies
- follow downwards slowly
- follow up in one shot
global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.
It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.
The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.
The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.
The bdi->avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.
The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.
The 200ms update interval is suitable, because it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.
The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.
That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.
CC: Li Shaohua <shaohua.li@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.
It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Regression introduced in commit 724d9f1cfb.
Prior to that, expand_dfs_referral would regenerate the mount data string
and then call cifs_parse_mount_options to re-parse it (klunky, but it
worked). The above commit moved cifs_parse_mount_options out of cifs_mount,
so the re-parsing of the new mount options no longer occurred. Fix it by
making expand_dfs_referral re-parse the mount options.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This needs to be done regardless of whether that KConfig option is set
or not.
Reported-by: Sven-Haegar Koch <haegar@sdinet.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
btrfs: fix oops when doing space balance
Btrfs: don't panic if we get an error while balancing V2
btrfs: add missing options displayed in mount output
Remove two variables that serve no purpose in
xfs_alloc_ag_vextent_exact().
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Pavol pointed out that there is one silent error case in the mount
path, and that others are rather uninformative.
I've taken Pavol's suggested patch and extended it a bit to also:
* fix a message which says "turned off" but actually errors out
* consolidate the vaguely differentiated "SB sanity check [12]"
messages, and hexdump the superblock for analysis
Original-patch-by: Pavol Gono <Pavol.Gono@siemens.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
There is no need for a pre-flush when doing writing the second part of a
split log buffer, and if we are using an external log there is no need
to do a full cache flush of the log device at all given that all writes
to it use the FUA flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the unused and misnamed _XBF_RUN_QUEUES flag, rename XBF_LOG_BUFFER
to the more fitting XBF_SYNCIO, and split XBF_ORDERED into XBF_FUA and
XBF_FLUSH to allow more fine grained control over the bio flags. Also
cleanup processing of the flags in _xfs_buf_ioapply to make more sense,
and renumber the sparse flag number space to group flags by purpose.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
All other xfs_buf_get/read-like helpers return the buffer locked, make sure
xfs_buf_get_uncached isn't different for no reason. Half of the callers
already lock it directly after, and the others probably should also keep
it locked if only for consistency and beeing able to use xfs_buf_rele,
but I'll leave that for later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Rename xfs_buf_cond_lock and reverse it's return value to fit most other
trylock operations in the Kernel and XFS (with the exception of down_trylock,
after which xfs_buf_cond_lock was modelled), and replace xfs_buf_lock_val
with an xfs_buf_islocked for use in asserts, or and opencoded variant in
tracing. remove the XFS_BUF_* wrappers for all the locking helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Micro-optimize various comparisms by always byteswapping the constant
instead of the variable, which allows to do the swap at compile instead
of runtime.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Switch the shortform directory code over to use the generic
get_unaligned_beXX helpers instead of reinventing them. As a result
kill off xfs_arch.h and move the setting of XFS_NATIVE_HOST into
xfs_linux.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Simplify the confusing xfs_dir2_leaf structure. It is supposed to describe
an XFS dir2 leaf format btree block, but due to the variable sized nature
of almost all elements in it it can't actuall do anything close to that
job. Remove the members that are after the first variable sized array,
given that they could only be used for sizeof expressions that can as well
just use the underlying types directly, and make the ents array a real
C99 variable sized array.
Also factor out the xfs_dir2_leaf_size, to make the sizing of a leaf
entry which already was convoluted somewhat readable after using the
longer type names in the sizeof expressions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the tag member which is at a variable offset after the actual
name, and make name a real variable sized C99 array instead of the incorrect
one-sized array which confuses (not only) gcc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the confusing xfs_dir2_data structure. It is supposed to describe
an XFS dir2 data btree block, but due to the variable sized nature of
almost all elements in it it can't actuall do anything close to that
job. In addition to accessing the fixed offset header structure it was
only used to get a pointer to the first dir or unused entry after it,
which can be trivially replaced by pointer arithmetics on the header
pointer. For most users that is actually more natural anyway, as they
don't use a typed pointer but rather a character pointer for further
arithmetics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In most places we can simply pass around and use the struct xfs_dir2_data_hdr,
which is the first and most important member of struct xfs_dir2_data instead
of the full structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the confusing xfs_dir2_block structure. It is supposed to describe
an XFS dir2 block format btree block, but due to the variable sized nature
of almost all elements in it it can't actuall do anything close to that
job. In addition to accessing the fixed offset header structure it was
only used to get a pointer to the first dir or unused entry after it,
which can be trivially replaced by pointer arithmetics on the header
pointer. For most users that is actually more natural anyway, as they
don't use a typed pointer but rather a character pointer for further
arithmetics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In most places we can simply pass around and use the struct xfs_dir2_data_hdr,
which is the first and most important member of struct xfs_dir2_block instead
of the full structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the inumber member which is at a variable offset after the actual
name, and make name a real variable sized C99 array instead of the incorrect
one-sized array which confuses (not only) gcc. Based on this clean up
the helpers to calculate the entry size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The list field of it is never cactually used, so all uses can simply be
replaced with the xfs_dir2_sf_hdr_t type that it has as first member.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor the shortform directory helpers that deal with the 32-bit vs
64-bit wide inode numbers into more sensible helpers, and kill the
xfs_intino_t typedef that is now superflous.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a new xfs_dir2_leaf_find_entry helper to factor out some duplicate code
from xfs_dir2_leaf_addname xfs_dir2_leafn_add. Found by Eric Sandeen using
an automated code duplication checker.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the transaction pointer in the inode. It's only used to avoid
passing down an argument in the bmap code, and for a few asserts in
the transaction code right now.
Also use the local variable ip in a few more places in xfs_inode_item_unlock,
so that it isn't only used for debug builds after the above change.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem
freeze when it sleeps during the memory allocation. Fix this by moving the
wait_for_freeze call after the memory allocation. This means moving the
freeze into the low-level _xfs_trans_alloc helper, which thus grows a new
argument. Also fix up some comments in that area while at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
The following script from Wu Fengguang shows very bad behaviour in XFS
when aggressively dirtying data during a sync on XFS, with sync times
up to almost 10 times as long as ext4.
A large part of the issue is that XFS writes data out itself two times
in the ->sync_fs method, overriding the livelock protection in the core
writeback code, and another issue is the lock-less xfs_ioend_wait call,
which doesn't prevent new ioend from being queue up while waiting for
the count to reach zero.
This patch removes the XFS-internal sync calls and relies on the VFS
to do it's work just like all other filesystems do. Note that the
i_iocount wait which is rather suboptimal is simply removed here.
We already do it in ->write_inode, which keeps the current supoptimal
behaviour. We'll eventually need to remove that as well, but that's
material for a separate commit.
------------------------------ snip ------------------------------
#!/bin/sh
umount /dev/sda7
mkfs.xfs -f /dev/sda7
# mkfs.ext4 /dev/sda7
# mkfs.btrfs /dev/sda7
mount /dev/sda7 /fs
echo $((50<<20)) > /proc/sys/vm/dirty_bytes
pid=
for i in `seq 10`
do
dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 &
pid="$pid $!"
done
sleep 1
tic=$(date +'%s')
sync
tac=$(date +'%s')
echo
echo sync time: $((tac-tic))
egrep '(Dirty|Writeback|NFS_Unstable)' /proc/meminfo
pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; }
------------------------------ snip ------------------------------
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Split the guts of xfs_itruncate_finish that loop over the existing extents
and calls xfs_bunmapi on them into a new helper, xfs_itruncate_externs.
Make xfs_attr_inactive call it directly instead of xfs_itruncate_finish,
which allows to simplify the latter a lot, by only letting it deal with
the data fork. As a result xfs_itruncate_finish is renamed to
xfs_itruncate_data to make its use case more obvious.
Also remove the sync parameter from xfs_itruncate_data, which has been
unessecary since the introduction of the busy extent list in 2002, and
completely dead code since 2003 when the XFS_BMAPI_ASYNC parameter was
made a no-op.
I can't actually see why the xfs_attr_inactive needs to set the transaction
sync, but let's keep this patch simple and without changes in behaviour.
Also avoid passing a useless argument to xfs_isize_check, and make it
private to xfs_inode.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_itruncate_start is a rather length wrapper that evaluates to a call
to xfs_ioend_wait and xfs_tosspages, and only has two callers.
Instead of using the complicated checks left over from IRIX where we
can to truncate the pagecache just call xfs_tosspages
(aka truncate_inode_pages) directly as we want to get rid of all data
after i_size, and truncate_inode_pages handles incorrect alignments
and too large offsets just fine.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Get rid of the special case where we use unlogged timestamp updates for
a truncate to the current inode size, and just call xfs_setattr_nonsize
for it to treat it like a utimes calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Split up xfs_setattr into two functions, one for the complex truncate
handling, and one for the trivial attribute updates. Also move both
new routines to xfs_iops.c as they are fairly Linux-specific.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
GCC 4.6 complains about an array subscript is above array bounds when
using the btree index to index into the agf_levels array. The only
two indices passed in are 0 and 1, and we have an assert insuring that.
Replace the trick of using the array index directly with using constants
in the already existing branch for assigning the XFS_BTREE_LASTREC_UPDATE
flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The non-blockig behaviour in xfs_vm_writepage currently is conditional on
having both the WB_SYNC_NONE sync_mode and the nonblocking flag set.
The latter used to be used by both pdflush, kswapd and a few other places
in older kernels, but has been fading out starting with the introduction
of the per-bdi flusher threads.
Enable the non-blocking behaviour for all WB_SYNC_NONE calls to get back
the behaviour we want.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Now that we reject direct reclaim in addition to always using GFP_NOFS
allocation there's no chance we'll ever end up in ->writepage with
PF_FSTRANS set. Add a WARN_ON if we hit this case, and stop checking
if we'd actually need to start a transaction.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When the 1st LEB was unmapped and written but 2nd LEB not,
the master node recovery doesn't succeed after power cut.
We see following error when mounting UBIFS partition on NOR
flash:
UBIFS error (pid 1137): ubifs_recover_master_node: failed to recover master node
Correct 2nd master node offset check is needed to fix the
problem. If the 2nd master node is at the end in the 2nd LEB,
first master node is used for recovery. When checking for this
condition we should check whether the master node is exactly at
the end of the LEB (without remaining empty space) or whether
it is followed by an empty space less than the master node size.
Artem: when the error happened, offs2 = 261120, sz = 512, c->leb_size = 262016.
Signed-off-by: Anatolij Gustschin <agust@denx.de>
Signed-off-by: Artem Bityutskiy <dedekind1@gmail.com>
Add an FS-Cache helper to bulk uncache pages on an inode. This will
only work for the circumstance where the pages in the cache correspond
1:1 with the pages attached to an inode's page cache.
This is required for CIFS and NFS: When disabling inode cookie, we were
returning the cookie and setting cifsi->fscache to NULL but failed to
invalidate any previously mapped pages. This resulted in "Bad page
state" errors and manifested in other kind of errors when running
fsstress. Fix it by uncaching mapped pages when we disable the inode
cookie.
This patch should fix the following oops and "Bad page state" errors
seen during fsstress testing.
------------[ cut here ]------------
kernel BUG at fs/cachefiles/namei.c:201!
invalid opcode: 0000 [#1] SMP
Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs
RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles]
RSP: 0018:ffff88002ce6dd00 EFLAGS: 00010282
RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282
RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300
R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840
R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0
FS: 00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0)
Stack:
0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00
ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380
ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56
Call Trace:
cachefiles_lookup_object+0x78/0xd4 [cachefiles]
fscache_lookup_object+0x131/0x16d [fscache]
fscache_object_work_func+0x1bc/0x669 [fscache]
process_one_work+0x186/0x298
worker_thread+0xda/0x15d
kthread+0x84/0x8c
kernel_thread_helper+0x4/0x10
RIP cachefiles_walk_to_object+0x436/0x745 [cachefiles]
---[ end trace 1d481c9af1804caa ]---
I tested the uncaching by the following means:
(1) Create a big file on my NFS server (104857600 bytes).
(2) Read the file into the cache with md5sum on the NFS client. Look in
/proc/fs/fscache/stats:
Pages : mrk=25601 unc=0
(3) Open the file for read/write ("bash 5<>/warthog/bigfile"). Look in proc
again:
Pages : mrk=25601 unc=25601
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement error propagation through the callers of
hfsplus_ext_write_extent_locked().
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
hfs_find_init() may fail with ENOMEM, but there are places, where
the returned value is not checked. The consequences can be very
unpleasant, e.g. kfree uninitialized pointer and
inappropriate mutex unlocking.
The patch adds checks for errors in hfs_find_init().
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A user reported an error where if we try to balance an fs after a device has
been removed it will blow up. This is because we get an EIO back and this is
where BUG_ON(ret) bites us in the ass. To fix we just exit. Thanks,
Reported-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are three missed mount options settable by user which are not
currently displayed in mount output.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When the dlm fails to make a network connection to another
node, include the address of the node in the error message.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
When inodes are marked stale in a transaction, they are treated
specially when the inode log item is being inserted into the AIL.
It tries to avoid moving the log item forward in the AIL due to a
race condition with the writing the underlying buffer back to disk.
The was "fixed" in commit de25c18 ("xfs: avoid moving stale inodes
in the AIL").
To avoid moving the item forward, we return a LSN smaller than the
commit_lsn of the completing transaction, thereby trying to trick
the commit code into not moving the inode forward at all. I'm not
sure this ever worked as intended - it assumes the inode is already
in the AIL, but I don't think the returned LSN would have been small
enough to prevent moving the inode. It appears that the reason it
worked is that the lower LSN of the inodes meant they were inserted
into the AIL and flushed before the inode buffer (which was moved to
the commit_lsn of the transaction).
The big problem is that with delayed logging, the returning of the
different LSN means insertion takes the slow, non-bulk path. Worse
yet is that insertion is to a position -before- the commit_lsn so it
is doing a AIL traversal on every insertion, and has to walk over
all the items that have already been inserted into the AIL. It's
expensive.
To compound the matter further, with delayed logging inodes are
likely to go from clean to stale in a single checkpoint, which means
they aren't even in the AIL at all when we come across them at AIL
insertion time. Hence these were all getting inserted into the AIL
when they simply do not need to be as inodes marked XFS_ISTALE are
never written back.
Transactional/recovery integrity is maintained in this case by the
other items in the unlink transaction that were modified (e.g. the
AGI btree blocks) and committed in the same checkpoint.
So to fix this, simply unpin the stale inodes directly in
xfs_inode_item_committed() and return -1 to indicate that the AIL
insertion code does not need to do any further processing of these
inodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
...as that makes for a cumbersome interface. Make it take a regular
smb_vol pointer and rely on the caller to zero it out if needed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Regression introduced by commit f87d39d951.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This call to cifs_cleanup_volume_info is clearly wrong. As soon as it's
called the following call to cifs_get_tcp_session will oops as the
volume_info pointer will then be NULL.
The caller of cifs_mount should clean up this data since it passed it
in. There's no need for us to call this here.
Regression introduced by commit 724d9f1cfb.
Reported-by: Adam Williamson <awilliam@redhat.com>
Cc: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The shdr4extnum variable isn't being freed in the cleanup process of
elf_fdpic_core_dump().
Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
locks_alloc_lock() assumed that the allocated struct file_lock is
already initialized to zero members. This is only true for the first
allocation of the structure, after reuse some of the members will have
random values.
This will for example result in passing random fl_start values to
userspace in fuse for FL_FLOCK locks, which is an information leak at
best.
Fix by reinitializing those members which may be non-zero after freeing.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix sync and dio writes across stripe boundaries
libceph: fix page calculation for non-page-aligned io
ceph: fix page alignment corrections
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: Fix double iput of the same inode in hfsplus_fill_super()
hfsplus: add missing call to bio_put()
This patch cleans-up and improves the power cut testing:
1. Kill custom 'simple_random()' function and use 'random32()' instead.
2. Make timeout larger
3. When cutting the buffer - fill the end with random data sometimes, not
only with 0xFFs.
4. Some times cut in the middle of the buffer, not always at the end.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Since the recovery testing is effectively about emulating power cuts by UBIFS,
use "power cut" as the base term for all the related variables and name them
correspondingly. This is just a minor clean-up for the sake of readability.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This is a clean-up of the power-cut emulation code - remove the custom list of
superblocks which we maintained to find the superblock by the UBI volume
descriptor. We do not need that crud any longer, because now we can get the
superblock as a function argument.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Now when we use UBIFS helpers for all the I/O, we can remove the horrible hack
of re-defining UBI I/O functions.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Instead of using 'ubi_read()' function directly, used the 'ubifs_leb_read()'
helper function instead. This allows to get rid of several redundant error
messages and make sure that we always have a stack dump on read errors.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Introduce the following I/O helper functions: 'ubifs_leb_read()',
'ubifs_leb_write()', 'ubifs_leb_change()', 'ubifs_leb_unmap()',
'ubifs_leb_map()', 'ubifs_is_mapped().
The idea is to wrap all UBI I/O functions in order to encapsulate various
assertions and error path handling (error message, stack dump, switching to R/O
mode). And there are some other benefits of this which will be used in the
following patches.
This patch does not switch whole UBIFS to use these functions yet.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When switching to R/O mode due to an I/O error, always dump the stack, not only
when debugging is enabled.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch contains several minor clean-up and preparational cahnges.
1. Remove 'dbg_read()', 'dbg_write()', 'dbg_change()', and 'dbg_leb_erase()'
functions as they are not used.
2. Remove 'dbg_leb_read()' and 'dbg_is_mapped()' as they are not really needed,
it is fine to let reads go through in failure mode.
3. Rename 'offset' argument to 'offs' to be consistent with the rest of UBIFS
code.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Now we have per-FS (superblock) debugfs knobs, but they have one drawback - you
have to first mount the FS and only after this you can switch self-checks
on/off. But often we want to have the checks enabled during the mount.
Introduce global debugging knobs for this purpose.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Separate out pieces of code from the debugfs file read/write functions and
create separate 'interpret_user_input()'/'provide_user_output()' helpers. These
helpers will be needed in one of the following patches, so this is just a
preparational change.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Move 'dbg_debugfs_init()' and 'dbg_debugfs_exit()' functions which initialize
debugfs for whole UBIFS subsystem below the code which initializes debugfs for
a particular UBIFS instance. And do the same for 'ubifs_debugging_init()' and
'ubifs_debugging_exit()' functions. This layout is a bit better for the next
patches, so this is just a preparation.
Also, rename 'open_debugfs_file()' into 'dfs_file_open()' for consistency.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When we are testing UBIFS recovery, it is better to print in which eraseblock
we are going to fail. Currently UBIFS prints it only if recovery debugging
messages are enabled, but this is not very practical. So change 'dbg_rcvry()'
messages to 'ubifs_warn()' messages.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS has many built-in self-check functions which can be enabled using the
debug_chks module parameter or the corresponding sysfs file
(/sys/module/ubifs/parameters/debug_chks). However, this is not flexible enough
because it is not per-filesystem. This patch moves this to debugfs interfaces.
We already have debugfs support, so this patch just adds more debugfs files.
While looking at debugfs support I've noticed that it is racy WRT file-system
unmount, and added a TODO entry for that. This problem has been there for long
time and it is quite standard debugfs PITA. The plan is to fix this later.
This patch is simple, but it is large because it changes many places where we
check if a particular type of checks is enabled or disabled.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
We have too many different debugging checks - lessen the amount by merging all
index-related checks into one. At the same time, move the "force in-the-gap"
test to the "index checks" class, because it is too heavy for the "general"
class.
This patch merges TNC, Old index, and Index size check and calles this just
"index checks".
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch introduces helper functions for all debugging checks, so instead of
doing
if (!(ubifs_chk_flags & UBIFS_CHK_GEN))
we now do
if (!dbg_is_chk_gen(c))
This is a preparation to further changes where the flags will go away, and
we'll need to only change the helper functions, but the code which utilizes
them won't be touched.
At the same time this patch removes 'dbg_force_in_the_gaps()',
'dbg_force_in_the_gaps_enabled()', and dbg_failure_mode helpers for
consistency.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Add 'const struct ubifs_info *c' parameter to 'dbg_check_synced_i_size()'
function because we'll need it in the next patch when we switch to debugfs.
So this patch is just a preparation.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Add 'struct ubifs_info *c' parameter to the 'dbg_check_name()' debugging
function - it will be needed in one of the following commits where we switch to
debugfs. So this is just a preparation.
Mark parameters as 'const' while on it.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Add a couple of comments - while looking into TNC I could not easily figure out
few facts, so it is a good idea to document them in the code.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The UBIFS lpt tree is in many aspects similar to the TNC tree, and we have
similar flags for these trees. And by mistake we use the COW_ZNODE flag for
LPT in some places, instead of the right flag COW_CNODE. And this works
only because these two constants have the same value.
This patch makes all the LPT code to use COW_CNODE and also changes COW_CNODE
constant value to make sure we do not misuse the flags any more.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
We have 3 znode flags: cow, obsolete, dirty. For the last flag we have a
'ubifs_zn_dirty()' helper function, but for the other 2 flags we use
'test_bit()' directly.
This patch makes the situation more consistent and introduces helpers for the
other 2 flags: 'ubifs_zn_cow()' and 'ubifs_zn_obsolete()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Remove dead pieces of code under "if (c->min_io_size == 1)" statement -
we never execute it because in UBIFS 'c->min_io_size' is always at least 8.
This are leftovers from old pre-mainline prototype.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Remove unnecessary brackets in "inode->i_flags |= (S_NOCMTIME)" statement to
make the code not look silly.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Instead of using long "(inode->i_mode & S_IFMT) != S_IFREG" expression, use
shorted "!S_ISREG(inode->i_mode)".
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Teach 'dbg_dump_inode()' dump directory entries for directory inodes.
This requires few additional changes:
1. The 'c' argument of 'dbg_dump_inode()' cannot be const any more.
2. Users of 'dbg_dump_inode()' should not have 'tnc_mutex' locked.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch lessens the 'struct ubifs_debug_info' size by 90 bytes by
allocating less bytes for the debugfs root directory name. It introduces macros
for the name patter an length instead of hard-coding 100 bytes. It also makes
UBIFS use 'snprintf()' and teaches it to gracefully catch situations when the
name array is too short.
Additionally, this patch makes 2 unrelated changes - I just thought they do not
deserve separate commits: simplifies 'ubifs_assert()' for non-debugging case
and makes 'dbg_debugfs_init()' properly verify debugfs return code which may be
an error code or NULL, so we should you 'IS_ERR_OR_NULL()' instead of
'IS_ERR()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
If commit failed and it is in broken state, UBIFS switches to R/O mode. Most
operations return -EROFS in this case, except of commit which returns -EINVAL.
Make it return -EROFS too for consistency. This is also important for our power
cut emulation testing.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Allocate dlm hash tables in the vmalloc area to allow a greater
maximum size without restructuring of the hash table code.
Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
On Linux x86_64 host with 32bit userspace, running
qemu or even just "qemu-img create -f qcow2 some.img 1G"
causes a kernel warning:
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img
ioctl 00005326 is CDROM_DRIVE_STATUS,
ioctl 801c0204 is FDGETPRM.
The warning appears because the Linux compat-ioctl handler for these
ioctls only applies to block devices, while qemu also uses the ioctls on
plain files.
Signed-off-by: Johannes Stezenbach <js@sig21.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When multithreaded program execs under ptrace,
all traced threads report WIFEXITED status, except for
thread group leader and the thread which execs.
Unless tracer tracks thread group relationship between tracees,
which is a nontrivial task, it will not detect that
execed thread no longer exists.
This patch allows tracer to figure out which thread
performed this exec, by requesting PTRACE_GETEVENTMSG
in PTRACE_EVENT_EXEC stop.
Another, samller problem which is solved by this patch
is that tracer now can figure out which of the several
concurrent execs in multithreaded program succeeded.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Benjamin S. reported that he was unable to suspend his machine while
it had a cifs share mounted. The freezer caused this to spew when he
tried it:
-----------------------[snip]------------------
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.01 seconds) done.
Freezing remaining freezable tasks ...
Freezing of tasks failed after 20.01 seconds (1 tasks refusing to freeze, wq_busy=0):
cifsd S ffff880127f7b1b0 0 1821 2 0x00800000
ffff880127f7b1b0 0000000000000046 ffff88005fe008a8 ffff8800ffffffff
ffff880127cee6b0 0000000000011100 ffff880127737fd8 0000000000004000
ffff880127737fd8 0000000000011100 ffff880127f7b1b0 ffff880127736010
Call Trace:
[<ffffffff811e85dd>] ? sk_reset_timer+0xf/0x19
[<ffffffff8122cf3f>] ? tcp_connect+0x43c/0x445
[<ffffffff8123374e>] ? tcp_v4_connect+0x40d/0x47f
[<ffffffff8126ce41>] ? schedule_timeout+0x21/0x1ad
[<ffffffff8126e358>] ? _raw_spin_lock_bh+0x9/0x1f
[<ffffffff811e81c7>] ? release_sock+0x19/0xef
[<ffffffff8123e8be>] ? inet_stream_connect+0x14c/0x24a
[<ffffffff8104485b>] ? autoremove_wake_function+0x0/0x2a
[<ffffffffa02ccfe2>] ? ipv4_connect+0x39c/0x3b5 [cifs]
[<ffffffffa02cd7b7>] ? cifs_reconnect+0x1fc/0x28a [cifs]
[<ffffffffa02cdbdc>] ? cifs_demultiplex_thread+0x397/0xb9f [cifs]
[<ffffffff81076afc>] ? perf_event_exit_task+0xb9/0x1bf
[<ffffffffa02cd845>] ? cifs_demultiplex_thread+0x0/0xb9f [cifs]
[<ffffffffa02cd845>] ? cifs_demultiplex_thread+0x0/0xb9f [cifs]
[<ffffffff810444a1>] ? kthread+0x7a/0x82
[<ffffffff81002d14>] ? kernel_thread_helper+0x4/0x10
[<ffffffff81044427>] ? kthread+0x0/0x82
[<ffffffff81002d10>] ? kernel_thread_helper+0x0/0x10
Restarting tasks ... done.
-----------------------[snip]------------------
We do attempt to perform a try_to_freeze in cifs_reconnect, but the
connection attempt itself seems to be taking longer than 20s to time
out. The connect timeout is governed by the socket send and receive
timeouts, so we can shorten that period by setting those timeouts
before attempting the connect instead of after.
Adam Williamson tested the patch and said that it seems to have fixed
suspending on his laptop when a cifs share is mounted.
Reported-by: Benjamin S <da_joind@gmx.net>
Tested-by: Adam Williamson <awilliam@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, only open(2) is defined as the 'clearing' point. It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events. Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.
The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage. Among other
things, GET_EVENT can't detect media removal while TUR can.
This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers. disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.
This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).
Note: Given the current usage, it seems @clearing hint is needlessly
complex. disk_clear_events() can simply clear all events and the hint
can be boolean @flush.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Display all addresses the dlm is using for the local node
from the configfs file config/dlm/<cluster>/comms/<comm>/addr_list
Also make the addr file write only.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Replace the hardcoded 2TB limit with a dynamic limit based on the block
size now that we have fixed the few overflows preventing operation
with large volumes.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
For partitions larger than 2TB or at such an offset the hfs wrapper code
in hfsplus might overflow the range representable in a 32-bit
data type. Make sure we use a sector_t for the arithmetics leading to it.
I'm not sure this code can be readed at all as hfs itself never supported
such large volumes.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
For filesystems larger than 2TB the final sector number passed to
map_bh might overflow the range representable in a 32-bit data type.
Make sure we use a sector_t for it and the arithmetics calculating it.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Make assignments outside `if' conditions for unicode.c module
where the checkpatch.pl script reported this coding style error.
Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
There is a misprint in resource deallocation code on error path in
hfsplus_fill_super(): the sbi->alloc_file inode is iput twice,
while the root inode in not iput at all.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
hfsplus leaks bio objects by failing to call bio_put() on the bios
it allocates. Add the missing call to fix the leak.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Cc: <stable@kernel.org> # .38.x, .39.x
Signed-off-by: Christoph Hellwig <hch@lst.de>
These days, bio_alloc() is guaranteed to never fail (as long as nvecs
is less than BIO_MAX_PAGES), so we don't need the loop around the
struct bio allocation.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In current pnfs tree, all the layouts set mds_offset in their
.write_pagelist member.
mds_offset is only used by generic layer and should be handled by it.
This patch is for upstream. It is needed in this -rc series to fix a
bug in objects layout_commit.
I'll send patches for objects and blocks to be
squashed into current pnfs tree.
TODO: It looks like the read path needs the same patch.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
/proc/PID/io may be used for gathering private information. E.g. for
openssh and vsftpd daemons wchars/rchars may be used to learn the
precise password length. Restrict it to processes being able to ptrace
the target process.
ptrace_may_access() is needed to prevent keeping open file descriptor of
"io" file, executing setuid binary and gathering io information of the
setuid'ed process.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I found that ext4_ext_find_goal() and ext4_find_near()
share the same code for returning a coloured start block
based on i_block_group.
We can refactor this into a common function so that they
don't diverge in the future.
Thanks to adilger for suggesting the new function name.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Under heavy memory and filesystem load, users observe the assertion
mapping->nrpages == 0 in end_writeback() trigger. This can be caused by
page reclaim reclaiming the last page from a mapping in the following
race:
CPU0 CPU1
...
shrink_page_list()
__remove_mapping()
__delete_from_page_cache()
radix_tree_delete()
evict_inode()
truncate_inode_pages()
truncate_inode_pages_range()
pagevec_lookup() - finds nothing
end_writeback()
mapping->nrpages != 0 -> BUG
page->mapping = NULL
mapping->nrpages--
Fix the problem by doing a reliable check of mapping->nrpages under
mapping->tree_lock in end_writeback().
Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out
by Miklos Szeredi <mszeredi@suse.de>.
Cc: Jay <jinshan.xiong@whamcloud.com>
Cc: Miklos Szeredi <mszeredi@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
romfs_get_unmapped_area() checks argument `len' without considering
PAGE_ALIGN which will cause do_mmap_pgoff() return -EINVAL error after
commit f67d9b1576 ("nommu: add page_align to mmap").
Fix the check by changing it in same way ramfs_nommu_get_unmapped_area()
was changed in ramfs/file-nommu.c.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: Greg Ungerer <gerg@snapgear.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves functions from inode.c to indirect.c.
The moved functions are ext4_ind_* functions and their helpers.
Functions called from inode.c are declared extern.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move two functions that will be needed by the indirect functions to be
moved to indirect.c as well as inode.c to truncate.h as inline
functions, so that we can avoid having duplicate copies of the
function (which can be a maintenance problem) without having to expose
them as globally functions.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In preparation for moving the indirect functions to a separate file,
move __ext4_check_blockref() to block_validity.c and rename it to
ext4_check_blockref() which is exported as globally visible function.
Also, rename the cpp macro ext4_check_inode_blockref() to
ext4_ind_check_inode(), to make it clear that it is only valid for use
with non-extent mapped inodes.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In journal checkpoint, we write the buffer and wait for its finish.
But in cfq, the async queue has a very low priority, and in our test,
if there are too many sync queues and every queue is filled up with
requests, and the process will hang waiting for the log space.
So this patch tries to use WRITE_SYNC in __flush_batch so that the request will
be moved into sync queue and handled by cfq timely. We also use the new plug,
sot that all the WRITE_SYNC requests can be given as a whole when we unplug it.
Reported-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We are going to move all ext4_ind_* functions to indirect.c.
Before we do that, let's rename 2 functions called ext4_indirect_*
to ext4_ind_*, to keep to the naming convention.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We are about to move all indirect inode functions to a new file.
Before we do that, let's split ext4_ind_truncate() out of ext4_truncate()
leaving only generic code in the latter, so we will be able to move
ext4_ind_truncate() to the new file.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
btrfs: fix inconsonant inode information
Btrfs: make sure to update total_bitmaps when freeing cache V3
Btrfs: fix type mismatch in find_free_extent()
Btrfs: make sure to record the transid in new inodes
In function ext4_ext_insert_index when eh_entries of curp is
bigger than eh_max, error messages will be printed out, but the content
is about logical and ei_block, that's incorret.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Change de_thread() to set old_leader->exit_signal = -1. This is
good for the consistency, it is no longer the leader and all
sub-threads have exit_signal = -1 set by copy_process(CLONE_THREAD).
And this allows us to micro-optimize thread_group_leader(), it can
simply check exit_signal >= 0. This also makes sense because we
should move ->group_leader from task_struct to signal_struct.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
In journal checkpoint, we write the buffer and wait for its finish.
But in cfq, the async queue has a very low priority, and in our test,
if there are too many sync queues and every queue is filled up with
requests, the write request will be delayed for quite a long time and
all the tasks which are waiting for journal space will end with errors like:
INFO: task attr_set:3816 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
attr_set D ffff880028393480 0 3816 1 0x00000000
ffff8802073fbae8 0000000000000086 ffff8802140847c8 ffff8800283934e8
ffff8802073fb9d8 ffffffff8103e456 ffff8802140847b8 ffff8801ed728080
ffff8801db4bc080 ffff8801ed728450 ffff880028393480 0000000000000002
Call Trace:
[<ffffffff8103e456>] ? __dequeue_entity+0x33/0x38
[<ffffffff8103caad>] ? need_resched+0x23/0x2d
[<ffffffff814006a6>] ? thread_return+0xa2/0xbc
[<ffffffffa01f6224>] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
[<ffffffffa01f6224>] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
[<ffffffff81400d31>] __mutex_lock_common+0x14e/0x1a9
[<ffffffffa021dbfb>] ? brelse+0x13/0x15 [ext4]
[<ffffffff81400ddb>] __mutex_lock_slowpath+0x19/0x1b
[<ffffffff81400b2d>] mutex_lock+0x1b/0x32
[<ffffffffa01f927b>] __jbd2_journal_insert_checkpoint+0xe3/0x20c [jbd2]
[<ffffffffa01f547b>] start_this_handle+0x438/0x527 [jbd2]
[<ffffffff8106f491>] ? autoremove_wake_function+0x0/0x3e
[<ffffffffa01f560b>] jbd2_journal_start+0xa1/0xcc [jbd2]
[<ffffffffa02353be>] ext4_journal_start_sb+0x57/0x81 [ext4]
[<ffffffffa024a314>] ext4_xattr_set+0x6c/0xe3 [ext4]
[<ffffffffa024aaff>] ext4_xattr_user_set+0x42/0x4b [ext4]
[<ffffffff81145adb>] generic_setxattr+0x6b/0x76
[<ffffffff81146ac0>] __vfs_setxattr_noperm+0x47/0xc0
[<ffffffff81146bb8>] vfs_setxattr+0x7f/0x9a
[<ffffffff81146c88>] setxattr+0xb5/0xe8
[<ffffffff81137467>] ? do_filp_open+0x571/0xa6e
[<ffffffff81146d26>] sys_fsetxattr+0x6b/0x91
[<ffffffff81002d32>] system_call_fastpath+0x16/0x1b
So this patch tries to use WRITE_SYNC in __flush_batch so that the request will
be moved into sync queue and handled by cfq timely. We also use the new plug,
sot that all the WRITE_SYNC requests can be given as a whole when we unplug it.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Reported-by: Robin Dong <sanbai@taobao.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: prevent bogus assert when trying to remove non-existent attribute
xfs: clear XFS_IDIRTY_RELEASE on truncate down
xfs: reset inode per-lifetime state when recycling it
When iputting the inode, We may leave the delayed nodes if they have some
delayed items that have not been dealt with. So when the inode is read again,
we must look up the relative delayed node, and use the information in it to
initialize the inode. Or we will get inconsonant inode information, it may
cause that the same directory index number is allocated again, and hit the
following oops:
[ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
[ 5447.569766] ------------[ cut here ]------------
[ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
[SNIP]
[ 5447.790721] Call Trace:
[ 5447.793191] [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
[ 5447.800156] [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
[ 5447.806517] [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
[ 5447.812876] [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
[ 5447.818961] [<ffffffff8111f840>] vfs_create+0x72/0x92
[ 5447.824090] [<ffffffff8111fa8c>] do_last+0x22c/0x40b
[ 5447.829133] [<ffffffff8112076a>] path_openat+0xc0/0x2ef
[ 5447.834438] [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
[ 5447.841216] [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
[ 5447.847846] [<ffffffff81121a79>] do_filp_open+0x3d/0x87
[ 5447.853156] [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
[ 5447.859072] [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
[ 5447.864636] [<ffffffff8111f179>] ? do_getname+0x14b/0x173
[ 5447.870112] [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
[ 5447.875682] [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
[ 5447.880882] [<ffffffff81112d39>] do_sys_open+0x69/0xae
[ 5447.886153] [<ffffffff81112db1>] sys_open+0x20/0x22
[ 5447.891114] [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
Fix it by reusing the old delayed node.
Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The function ecryptfs_keyring_auth_tok_for_sig() has been modified in order
to search keys of both 'user' and 'encrypted' types.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Acked-by: Gianluca Ramunno <ramunno@polito.it>
Acked-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Some eCryptfs specific definitions, such as the current version and the
authentication token structure, are moved to the new include file
'include/linux/ecryptfs.h', in order to be available for all kernel
subsystems.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Acked-by: Gianluca Ramunno <ramunno@polito.it>
Acked-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
journal_remove_journal_head() can oops when trying to access journal_head
returned by bh2jh(). This is caused for example by the following race:
TASK1 TASK2
journal_commit_transaction()
...
processing t_forget list
__journal_refile_buffer(jh);
if (!jh->b_transaction) {
jbd_unlock_bh_state(bh);
journal_try_to_free_buffers()
journal_grab_journal_head(bh)
jbd_lock_bh_state(bh)
__journal_try_to_free_buffer()
journal_put_journal_head(jh)
journal_remove_journal_head(bh);
journal_put_journal_head() in TASK2 sees that b_jcount == 0 and buffer is not
part of any transaction and thus frees journal_head before TASK1 gets to doing
so. Note that even buffer_head can be released by try_to_free_buffers() after
journal_put_journal_head() which adds even larger opportunity for oops (but I
didn't see this happen in reality).
Fix the problem by making transactions hold their own journal_head reference
(in b_jcount). That way we don't have to remove journal_head explicitely via
journal_remove_journal_head() and instead just remove journal_head when
b_jcount drops to zero. The result of this is that [__]journal_refile_buffer(),
[__]journal_unfile_buffer(), and __journal_remove_checkpoint() can free
journal_head which needs modification of a few callers. Also we have to be
careful because once journal_head is removed, buffer_head might be freed as
well. So we have to get our own buffer_head reference where it matters.
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
cifs: propagate errors from cifs_get_root() to mount(2)
cifs: tidy cifs_do_mount() up a bit
cifs: more breakage on mount failures
cifs: close sget() races
cifs: pull freeing mountdata/dropping nls/freeing cifs_sb into cifs_umount()
cifs: move cifs_umount() call into ->kill_sb()
cifs: pull cifs_mount() call up
sanitize cifs_umount() prototype
cifs: initialize ->tlink_tree in cifs_setup_cifs_sb()
cifs: allocate mountdata earlier
cifs: leak on mount if we share superblock
cifs: don't pass superblock to cifs_mount()
cifs: don't leak nls on mount failure
cifs: double free on mount failure
take bdi setup/destruction into cifs_mount/cifs_umount
Acked-by: Steve French <smfrench@gmail.com>
We should return -EINVAL when the FITRIM parameters are not sane, but
currently we are exiting silently if start is beyond the end of the
file system. This commit fixes this so we return -EINVAL as other file
systems do.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
The 'from' argument for copy_from_user and the 'to' argument for
copy_to_user should both be tagged as __user address space.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
New truncate calling convention allows us to handle errors from
ext3_block_truncate_page(). So reorganize the code so that
ext3_block_truncate_page() is called before we change inode size.
This also removes unnecessary block zeroing from error recovery after failed
buffered writes (zeroing isn't needed because we could have never written
non-zero data to disk). We have to be careful and keep zeroing in direct IO
write error recovery because there we might have already overwritten end of the
last file block.
Signed-off-by: Jan Kara <jack@suse.cz>
Block allocation is called from two places: ext3_get_blocks_handle() and
ext3_xattr_block_set(). These two callers are not necessarily synchronized
because xattr code holds only xattr_sem and i_mutex, and
ext3_get_blocks_handle() may hold only truncate_mutex when called from
writepage() path. Block reservation code does not expect two concurrent
allocations to happen to the same inode and thus assertions can be triggered
or reservation structure corruption can occur.
Fix the problem by taking truncate_mutex in xattr code to serialize
allocations.
CC: Sage Weil <sage@newdream.net>
CC: stable@kernel.org
Reported-by: Fyodor Ustinov <ufm@ufm.su>
Signed-off-by: Jan Kara <jack@suse.cz>
journal_get_create_access should drop jh->b_jcount in error handling path
Signed-off-by: Ding Dinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
The callers of start_this_handle() (or better ext3_journal_start()) are not
really prepared to handle allocation failures. Such failures can for example
result in silent data loss when it happens in ext3_..._writepage(). OTOH
__GFP_NOFAIL is going away so we just retry allocation in start_this_handle().
This loop is potentially dangerous because the oom killer cannot be invoked
for GFP_NOFS allocation, so there is a potential for infinitely looping.
But still this is better than silent data loss.
Signed-off-by: Jan Kara <jack@suse.cz>
Mostly trivial conversion. We fix a bug that IS_IMMUTABLE and IS_APPEND files
could not be truncated during failed writes as we change the code. In fact the
test is not needed at all because both IS_IMMUTABLE and IS_APPEND is tested in
upper layers in do_sys_[f]truncate(), may_write(), etc.
Signed-off-by: Jan Kara <jack@suse.cz>
This commit adds fixed tracepoint for jbd. It has been based on fixed
tracepoints for jbd2, however there are missing those for collecting
statistics, since I think that it will require more intrusive patch so I
should have its own commit, if someone decide that it is needed. Also
there are new tracepoints in __journal_drop_transaction() and
journal_update_superblock().
The list of jbd tracepoints:
jbd_checkpoint
jbd_start_commit
jbd_commit_locking
jbd_commit_flushing
jbd_commit_logging
jbd_drop_transaction
jbd_end_commit
jbd_do_submit_data
jbd_cleanup_journal_tail
jbd_update_superblock_end
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
This commit adds fixed tracepoints to the ext3 code. It is based on ext4
tracepoints, however due to the differences of both file systems, there
are some tracepoints missing (those for delaloc and for multi-block
allocator) and there are some ext3 specific as well (for reservation
windows).
Here is a list:
ext3_free_inode
ext3_request_inode
ext3_allocate_inode
ext3_evict_inode
ext3_drop_inode
ext3_mark_inode_dirty
ext3_write_begin
ext3_ordered_write_end
ext3_writeback_write_end
ext3_journalled_write_end
ext3_ordered_writepage
ext3_writeback_writepage
ext3_journalled_writepage
ext3_readpage
ext3_releasepage
ext3_invalidatepage
ext3_discard_blocks
ext3_request_blocks
ext3_allocate_blocks
ext3_free_blocks
ext3_sync_file_enter
ext3_sync_file_exit
ext3_sync_fs
ext3_rsv_window_add
ext3_discard_reservation
ext3_alloc_new_reservation
ext3_reserved
ext3_forget
ext3_read_block_bitmap
ext3_direct_IO_enter
ext3_direct_IO_exit
ext3_unlink_enter
ext3_unlink_exit
ext3_truncate_enter
ext3_truncate_exit
ext3_get_blocks_enter
ext3_get_blocks_exit
ext3_load_inode
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
A user reported this bug again where we have more bitmaps than we are supposed
to. This is because we failed to load the free space cache, but don't update
the ctl->total_bitmaps counter when we remove entries from the tree. This patch
fixes this problem and we should be good to go again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
data parameter should be u64 because a full-sized chunk flags field is
passed instead of 0/1 for distinguishing data from metadata. All
underlying functions expect u64.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
... instead of just failing with -EINVAL
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
if cifs_get_root() fails, we end up with ->mount() returning NULL,
which is not what callers expect. Moreover, in case of superblock
reuse we end up leaking a superblock reference...
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
have ->s_fs_info set by the set() callback passed to sget()
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
all callers of cifs_umount() proceed to do the same thing; pull it into
cifs_umount() itself.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
instead of calling it manually in case if cifs_read_super() fails
to set ->s_root, just call it from ->kill_sb(). cifs_put_super()
is gone now *and* we have cifs_sb shutdown and destruction done
after the superblock is gone from ->s_instances.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... to the point prior to sget(). Now we have cifs_sb set up early
enough.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) superblock argument is unused
b) it always returns 0
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
no need to wait until cifs_read_super() and we need it done
by the time cifs_mount() will be called.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pull mountdata allocation up, so that it won't stand in the way when
we lift cifs_mount() to location before sget().
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
cifs_sb and nls end up leaked...
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To close sget() races we'll need to be able to set cifs_sb up before
we get the superblock, so we'll want to be able to do cifs_mount()
earlier. Fortunately, it's easy to do - setting ->s_maxbytes can
be done in cifs_read_super(), ditto for ->s_time_gran and as for
putting MS_POSIXACL into ->s_flags, we can mirror it in ->mnt_cifs_flags
until cifs_read_super() is called. Kill unused 'devname' argument,
while we are at it...
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
if cifs_sb allocation fails, we still need to drop nls we'd stashed
into volume_info - the one we would've copied to cifs_sb if we could
allocate the latter.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
if we get to out_super with ->s_root already set (e.g. with
cifs_get_root() failure), we'll end up with cifs_put_super()
called and ->mountdata freed twice. We'll also get cifs_sb
freed twice and cifs_sb->local_nls dropped twice. The problem
is, we can get to out_super both with and without ->s_root,
which makes ->put_super() a bad place for such work.
Switch to ->kill_sb(), have all that work done there after
kill_anon_super(). Unlike ->put_super(), ->kill_sb() is
called by deactivate_locked_super() whether we have ->s_root
or not.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This does not work properly with CIFS as current servers do not
enable support for the FILE_OPEN_BY_FILE_ID on SMB NTCreateX
and not all NFS clients handle ESTALE.
For now, it just plain doesn't work. Mark it BROKEN to discourage
distros from enabling it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When we create a new inode, we aren't filling in the
field that records the transaction that last changed this
inode.
If we then go to fsync that inode, it will be skipped because the field
isn't filled in.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is currently leaked in the rc == 0 case.
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.dk/linux-block:
block: add REQ_SECURE to REQ_COMMON_MASK
block: use the passed in @bdev when claiming if partno is zero
block: Add __attribute__((format(printf...) and fix fallout
block: make disk_block_events() properly wait for work cancellation
block: remove non-syncing __disk_block_events() and fold it into disk_block_events()
block: don't use non-syncing event blocking in disk_check_events()
cfq-iosched: fix locking around ioc->ioc_data assignment
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix wsize negotiation to respect max buffer size and active signing (try #4)
CIFS: Fix problem with 3.0-rc1 null user mount failure
It was pointed out by 'make versioncheck' that some includes of
linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and
fs/omfs/file.c).
This patch removes them.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the attribute fork on an inode is in btree format and has
multiple levels (i.e node format rather than leaf format), then a
lookup failure will trigger an assert failure in xfs_da_path_shift
if the flag XFS_DA_OP_OKNOENT is not set. This flag is used to
indicate to the directory btree code that not finding an entry is
not a fatal error. In the case of doing a lookup for a directory
name removal, this is valid as a user cannot insert an arbitrary
name to remove from the directory btree.
However, in the case of the attribute tree, a user has direct
control over the attribute name and can ask for any random name to
be removed without any validation. In this case, fsstress is asking
for a non-existent user.selinux attribute to be removed, and that is
causing xfs_da_path_shift() to fall off the bottom of the tree where
it asserts that a lookup failure is allowed. Because the flag is not
set, we die a horrible death on a debug enable kernel.
Prevent this assert from firing on attribute removes by adding the
op_flag XFS_DA_OP_OKNOENT to atribute removal operations.
Discovered when testing on a SELinux enabled system by fsstress in
test 070 by trying to remove a non-existent user.selinux attribute.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
When an inode is truncated down, speculative preallocation is
removed from the inode. This should also reset the state bits for
controlling whether preallocation is subsequently removed when the
file is next closed. The flag is not being cleared, so repeated
operations on a file that first involve a truncate (e.g. multiple
repeated dd invocations on a file) give different file layouts for
the second and subsequent invocations.
Fix this by clearing the XFS_IDIRTY_RELEASE state bit when the
XFS_ITRUNCATED bit is detected in xfs_release() and hence ensure
that speculative delalloc is removed on files that have been
truncated down.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
XFS inodes has several per-lifetime state fields that determine the
behaviour of the inode. These state fields are not all reset when an
inode is reused from the reclaimable state.
This can lead to unexpected behaviour of the new inode such as
speculative preallocation not being truncated away in the expected
manner for local files until the inode is subsequently truncated,
freed or cycles out of the cache. It can also lead to an inode being
considered to be a filestream inode or having been truncated when
that is not the case.
Rework the reinitialisation of the inode when it is recycled to
ensure that it is pristine before it is reused. While there, also
fix the resetting of state flags in the recycling error paths so the
inode does not become unreclaimable.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Hopefully last version. Base signing check on CAP_UNIX instead of
tcon->unix_ext, also clean up the comments a bit more.
According to Hongwei Sun's blog posting here:
http://blogs.msdn.com/b/openspecification/archive/2009/04/10/smb-maximum-transmit-buffer-size-and-performance-tuning.aspx
CAP_LARGE_WRITEX is ignored when signing is active. Also, the maximum
size for a write without CAP_LARGE_WRITEX should be the maxBuf that
the server sent in the NEGOTIATE request.
Fix the wsize negotiation to take this into account. While we're at it,
alter the other wsize definitions to use sizeof(WRITE_REQ) to allow for
slightly larger amounts of data to potentially be written per request.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
jfs: agstart field must be 64 bits
JFS: Don't save agno in the inode
jfs: Update agstart when resizing volume
jfs: old_agsize should be 64 bits in jfs_extendfs
Figured it out: it was broken by b946845a9d commit - "cifs: cifs_parse_mount_options: do not tokenize mount options in-place". So, as a quick fix I suggest to apply this patch.
[PATCH] CIFS: Fix kfree() with constant string in a null user case
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
tracehook.h is on the way out. Rename tracehook_tracer_task() to
ptrace_parent() and move it from tracehook.h to ptrace.h.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
At this point, tracehooks aren't useful to mainline kernel and mostly
just add an extra layer of obfuscation. Although they have comments,
without actual in-kernel users, it is difficult to tell what are their
assumptions and they're actually trying to achieve. To mainline
kernel, they just aren't worth keeping around.
This patch kills the following clone and exec related tracehooks.
tracehook_prepare_clone()
tracehook_finish_clone()
tracehook_report_clone()
tracehook_report_clone_complete()
tracehook_unsafe_exec()
The changes are mostly trivial - logic is moved to the caller and
comments are merged and adjusted appropriately.
The only exception is in check_unsafe_exec() where LSM_UNSAFE_PTRACE*
are OR'd to bprm->unsafe instead of setting it, which produces the
same result as the field is always zero on entry. It also tests
p->ptrace instead of (p->ptrace & PT_PTRACED) for consistency, which
also gives the same result.
This doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
At this point, tracehooks aren't useful to mainline kernel and mostly
just add an extra layer of obfuscation. Although they have comments,
without actual in-kernel users, it is difficult to tell what are their
assumptions and they're actually trying to achieve. To mainline
kernel, they just aren't worth keeping around.
This patch kills the following trivial tracehooks.
* Ones testing whether task is ptraced. Replace with ->ptrace test.
tracehook_expect_breakpoints()
tracehook_consider_ignored_signal()
tracehook_consider_fatal_signal()
* ptrace_event() wrappers. Call directly.
tracehook_report_exec()
tracehook_report_exit()
tracehook_report_vfork_done()
* ptrace_release_task() wrapper. Call directly.
tracehook_finish_release_task()
* noop
tracehook_prepare_release_task()
tracehook_report_death()
This doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix decode_secinfo_maxsz
NFSv4.1: Fix an off-by-one error in pnfs_generic_pg_test
NFSv4.1: Fix some issues with pnfs_generic_pg_test
NFSv4.1: file layout must consider pg_bsize for coalescing
pnfs-obj: No longer needed to take an extra ref at add_device
SUNRPC: Ensure the RPC client only quits on fatal signals
NFSv4: Fix a readdir regression
nfs4.1: mark layout as bad on error path in _pnfs_return_layout
nfs4.1: prevent race that allowed use of freed layout in _pnfs_return_layout
NFSv4.1: need to put_layout_hdr on _pnfs_return_layout error path
NFS: (d)printks should use %zd for ssize_t arguments
NFSv4.1: fix break condition in pnfs_find_lseg
nfs4.1: fix several problems with _pnfs_return_layout
NFSv4.1: allow zero fh array in filelayout decode layout
NFSv4.1: allow nfs_fhget to succeed with mounted on fileid
NFSv4.1: Fix a refcounting issue in the pNFS device id cache
NFSv4.1: deprecate headerpadsz in CREATE_SESSION
NFS41: do not update isize if inode needs layoutcommit
NLM: Don't hang forever on NLM unlock requests
NFS: fix umount of pnfs filesystems
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: Fix oops in jbd2_journal_remove_journal_head()
jbd2: Remove obsolete parameters in the comments for some jbd2 functions
ext4: fixed tracepoints cleanup
ext4: use FIEMAP_EXTENT_LAST flag for last extent in fiemap
ext4: Fix max file size and logical block counting of extent format file
ext4: correct comments for ext4_free_blocks()
I initially did the calculation in bytes, and not words
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
1. If the intention is to coalesce requests 'prev' and 'req' then we
have to ensure at least that we have a layout starting at
req_offset(prev).
2. If we're only requesting a minimal layout of length desc->pg_count,
we need to test the length actually returned by the server before
we allow the coalescing to occur.
3. We need to deal correctly with (pgio->lseg == NULL)
4. Fixup the test guarding the pnfs_update_layout.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-2.6.40' of git://linux-nfs.org/~bfields/linux:
nfsd4: fix break_lease flags on nfsd open
nfsd: link returns nfserr_delay when breaking lease
nfsd: v4 support requires CRYPTO
nfsd: fix dependency of nfsd on auth_rpcgss
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
devcgroup_inode_permission: take "is it a device node" checks to inlined wrapper
fix comment in generic_permission()
kill obsolete comment for follow_down()
proc_sys_permission() is OK in RCU mode
reiserfs_permission() doesn't need to bail out in RCU mode
proc_fd_permission() is doesn't need to bail out in RCU mode
nilfs2_permission() doesn't need to bail out in RCU mode
logfs doesn't need ->permission() at all
coda_ioctl_permission() is safe in RCU mode
cifs_permission() doesn't need to bail out in RCU mode
bad_inode_permission() is safe from RCU mode
ubifs: dereferencing an ERR_PTR in ubifs_mount()
The previous patch added the agstart field to jfs_ip, but declared
it a long. We need to make sure its 64 bits on every platform.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Otherwise we end up overflowing the rpc buffer size on the receive end.
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: avoid delayed metadata items during commits
btrfs: fix uninitialized return value
btrfs: fix wrong reservation when doing delayed inode operations
btrfs: Remove unused sysfs code
btrfs: fix dereference of ERR_PTR value
Btrfs: fix relocation races
Btrfs: set no_trans_join after trying to expand the transaction
Btrfs: protect the pending_snapshots list with trans_lock
Btrfs: fix path leakage on subvol deletion
Btrfs: drop the delalloc_bytes check in shrink_delalloc
Btrfs: check the return value from set_anon_super
Resizing the file system can result in an in-memory inode being remapped
to a different aggregate group (AG). A cached AG number can cause
problems when trying to free or allocate inodes. Instead, save the IAG's
agstart address and calculate the agno when we need it.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
A comment indicates that the IAG's agstart does not need to be updated
since it will always point to a block in the same aggregate group, but
jfs_fsck isn't so forgiving and reports it as an error.
I'm fixing this in jfsutils as well, so either a new kernel or new
utilities will be sufficient to fix the problem.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
nothing blocking there, since all instances of sysctl
->permissions() method are non-blocking - both of them,
that is.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and never did, what with its ->permission() being what we do by default
when ->permission is NULL...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
return -EIO; is *not* a blocking operation, thank you very much.
Nick, what the hell have you been smoking?
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d251ed271d "ubifs: fix sget races" left out the goto from this
error path so the static checkers complain that we're dereferencing
"sb" when it's an ERR_PTR.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Thanks to Casey Bodley for pointing out that on a read open we pass 0,
instead of O_RDONLY, to break_lease, with the result that a read open is
treated like a write open for the purposes of lease breaking!
Reported-by: Casey Bodley <cbodley@citi.umich.edu>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Many stupid corrections of duplicated includes based on the output of
scripts/checkincludes.pl.
Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Andy's last device_cache patches, already take an extra
reference on the newly inserted device_id. So we can remove it
from obj-io.
Without this patch the device_ids are leaked.
Andy's patches are not in Linus tree yet. So I'm not sure if they are
scheduled for this Kernel or the next. This patch should be added as
part of these.
CC: Andy Adamson <andros@netapp.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tools/perf: Fix static build of perf tool
tracing: Fix regression in printk_formats file
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clocksource: Make watchdog robust vs. interruption
timerfd: Fix wakeup of processes when timer is cancelled on clock change
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, MAINTAINERS: Add x86 MCE people
x86, efi: Do not reserve boot services regions within reserved areas
In isofs_fill_super(), when an iso_primary_descriptor is found, it is
kept in pri_bh. The error cases don't properly release it. Fix it.
Reported-and-tested-by: 김원석 <stanley.will.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Snapshot creation has two phases. One is the initial snapshot setup,
and the second is done during commit, while nobody is allowed to modify
the root we are snapshotting.
The delayed metadata insertion code can break that rule, it does a
delayed inode update on the inode of the parent of the snapshot,
and delayed directory item insertion.
This makes sure to run the pending delayed operations before we
record the snapshot root, which avoids corruptions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When allocation fails in btrfs_read_fs_root_no_name, ret is not set
although it is returned, holding a garbage value.
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Removes code no longer used. The sysfs file itself is kept, because the
btrfs developers expressed interest in putting new entries to sysfs.
Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent commit to get rid of our trans_mutex introduced
some races with block group relocation. The problem is that relocation
needs to do some record keeping about each root, and it was relying
on the transaction mutex to coordinate things in subtle ways.
This fix adds a mutex just for the relocation code and makes sure
it doesn't have a big impact on normal operations. The race is
really fixed in btrfs_record_root_in_trans, which is where we
step back and wait for the relocation code to finish accounting
setup.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
____call_usermodehelper() now erases any credentials set by the
subprocess_inf::init() function. The problem is that commit
17f60a7da1 ("capabilites: allow the application of capability limits
to usermode helpers") creates and commits new credentials with
prepare_kernel_cred() after the call to the init() function. This wipes
all keyrings after umh_keys_init() is called.
The best way to deal with this is to put the init() call just prior to
the commit_creds() call, and pass the cred pointer to init(). That
means that umh_keys_init() and suchlike can modify the credentials
_before_ they are published and potentially in use by the rest of the
system.
This prevents request_key() from working as it is prevented from passing
the session keyring it set up with the authorisation token to
/sbin/request-key, and so the latter can't assume the authority to
instantiate the key. This causes the in-kernel DNS resolver to fail
with ENOKEY unconditionally.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7ebb9315 (NFS: use secinfo when crossing mountpoints) introduces
a regression when decoding an NFSv4 readdir entry that sets the
rdattr_error field.
By treating the resulting value as if it is a decoding error, the current
code may cause us to skip valid readdir entries.
Reported-by: Andy Adamson <andros@netapp.com>
Cc: stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
AFS: Use i_generation not i_version for the vnode uniquifier
AFS: Set s_id in the superblock to the volume name
vfs: Fix data corruption after failed write in __block_write_begin()
afs: afs_fill_page reads too much, or wrong data
VFS: Fix vfsmount overput on simultaneous automount
fix wrong iput on d_inode introduced by e6bc45d65d
Delay struct net freeing while there's a sysfs instance refering to it
afs: fix sget() races, close leak on umount
ubifs: fix sget races
ubifs: split allocation of ubifs_info into a separate function
fix leak in proc_set_super()
There's no reason not to support cache flushing on external log devices.
The only thing this really requires is flushing the data device first
both in fsync and log commits. A side effect is that we also have to
remove the barrier write test during mount, which has been superflous
since the new FLUSH+FUA code anyway. Also use the chance to flush the
RT subvolume write cache before the fsync commit, which is required
for correct semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Store the AFS vnode uniquifier in the i_generation field, not the i_version
field of the inode struct. i_version can then be given the AFS data version
number.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Set s_id in the superblock to the name of the AFS volume that this superblock
corresponds to.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I've got a report of a file corruption from fsxlinux on ext3. The important
operations to the page were:
mapwrite to a hole
partial write to the page
read - found the page zeroed from the end of the normal write
The culprit seems to be that if get_block() fails in __block_write_begin()
(e.g. transient ENOSPC in ext3), the function does ClearPageUptodate(page).
Thus when we retry the write, the logic in __block_write_begin() thinks zeroing
of the page is needed and overwrites old data. In fact, I don't see why we
should ever need to zero the uptodate bit here - either the page was uptodate
when we entered __block_write_begin() and it should stay so when we leave it,
or it was not uptodate and noone had right to set it uptodate during
__block_write_begin() so it remains !uptodate when we leave as well. So just
remove clearing of the bit.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
afs_fill_page should read the page that is about to be written but
the current implementation has a number of issues. If we aren't
extending the file we always read PAGE_CACHE_SIZE at offset 0. If we
are extending the file we try to read the entire file.
Change afs_fill_page to read PAGE_CACHE_SIZE at the right offset,
clamped to i_size.
While here, avoid calling afs_fill_page when we are doing a
PAGE_CACHE_SIZE write.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[Kudos to dhowells for tracking that crap down]
If two processes attempt to cause automounting on the same mountpoint at the
same time, the vfsmount holding the mountpoint will be left with one too few
references on it, causing a BUG when the kernel tries to clean up.
The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.
During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.
The fix is to make sure we have a ref on the vfsmount of the mountpoint before
calling do_add_mount(). However, if lock_mount() doesn't transit, we're then
left with an extra ref on the mountpoint vfsmount which needs releasing.
We can handle that in follow_managed() by not making assumptions about what
we can and what we cannot get from lookup_mnt() as the current code does.
The callers of follow_managed() expect that reference to path->mnt will be
grabbed iff path->mnt has been changed. follow_managed() and follow_automount()
keep track of whether such reference has been grabbed and assume that it'll
happen in those and only those cases that'll have us return with changed
path->mnt. That assumption is almost correct - it breaks in case of
racing automounts and in even harder to hit race between following a mountpoint
and a couple of mount --move. The thing is, we don't need to make that
assumption at all - after the end of loop in follow_manage() we can check
if path->mnt has ended up unchanged and do mntput() if needed.
The BUG can be reproduced with the following test program:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/wait.h>
int main(int argc, char **argv)
{
int pid, ws;
struct stat buf;
pid = fork();
stat(argv[1], &buf);
if (pid > 0) wait(&ws);
return 0;
}
and the following procedure:
(1) Mount an NFS volume that on the server has something else mounted on a
subdirectory. For instance, I can mount / from my server:
mount warthog:/ /mnt -t nfs4 -r
On the server /data has another filesystem mounted on it, so NFS will see
a change in FSID as it walks down the path, and will mark /mnt/data as
being a mountpoint. This will cause the automount code to be triggered.
!!! Do not look inside the mounted fs at this point !!!
(2) Run the above program on a file within the submount to generate two
simultaneous automount requests:
/tmp/forkstat /mnt/data/testfile
(3) Unmount the automounted submount:
umount /mnt/data
(4) Unmount the original mount:
umount /mnt
At this point the kernel should throw a BUG with something like the
following:
BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]
Note that the bug appears on the root dentry of the original mount, not the
mountpoint and not the submount because sys_umount() hasn't got to its final
mntput_no_expire() yet, but this isn't so obvious from the call trace:
[<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82
[<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b
[<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
[<ffffffff811617f3>] kill_anon_super+0x1d/0x7e
[<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs]
[<ffffffff81161c17>] deactivate_locked_super+0x34/0x83
[<ffffffff811629ff>] deactivate_super+0x6f/0x7b
[<ffffffff81186261>] mntput_no_expire+0x18d/0x199
[<ffffffff811862a8>] mntput+0x3b/0x44
[<ffffffff81186d87>] release_mounts+0xa2/0xbf
[<ffffffff811876af>] sys_umount+0x47a/0x4ba
[<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f
[<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b
as do_umount() is inlined. However, you can see release_mounts() in there.
Note also that it may be necessary to have multiple CPU cores to be able to
trigger this bug.
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Git bisection shows that commit e6bc45d65d causes
BUG_ONs under high I/O load:
kernel BUG at fs/inode.c:1368!
[ 2862.501007] Call Trace:
[ 2862.501007] [<ffffffff811691d8>] d_kill+0xf8/0x140
[ 2862.501007] [<ffffffff81169c19>] dput+0xc9/0x190
[ 2862.501007] [<ffffffff8115577f>] fput+0x15f/0x210
[ 2862.501007] [<ffffffff81152171>] filp_close+0x61/0x90
[ 2862.501007] [<ffffffff81152251>] sys_close+0xb1/0x110
[ 2862.501007] [<ffffffff814c14fb>] system_call_fastpath+0x16/0x1b
A reliable way to reproduce this bug is:
Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk,
and apt-get remove openjdk-6-jdk.
The buggy part of the patch is this:
struct inode *inode = NULL;
.....
- if (nd.last.name[nd.last.len])
- goto slashes;
inode = dentry->d_inode;
- if (inode)
- ihold(inode);
+ if (nd.last.name[nd.last.len] || !inode)
+ goto slashes;
+ ihold(inode)
...
if (inode)
iput(inode); /* truncate the inode here */
If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken),
and dentry->d_inode is non-NULL, then this code now does an additional iput on
the inode, which is wrong.
Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0.
Reference: https://lkml.org/lkml/2011/6/15/50
Reported-by: Norbert Preining <preining@logic.at>
Reported-by: Török Edwin <edwintorok@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Török Edwin <edwintorok@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This reverts commit 7f81c8890c.
It turns out that it's not actually a build-time check on x86-64 UML,
which does some seriously crazy stuff with VM_STACK_FLAGS.
The VM_STACK_FLAGS define depends on the arch-supplied
VM_STACK_DEFAULT_FLAGS value, and on x86-64 UML we have
arch/um/sys-x86_64/shared/sysdep/vm-flags.h:
#define VM_STACK_DEFAULT_FLAGS \
(test_thread_flag(TIF_IA32) ? vm_stack_flags32 : vm_stack_flags)
#define VM_STACK_DEFAULT_FLAGS vm_stack_flags
(yes, seriously: two different #define's for that thing, with the first
one being inside an "#ifdef TIF_IA32")
It's possible that it is UML that should just be fixed in this area, but
for now let's just undo the (very small) optimization.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a8bef8ff6e ("mm: migration: avoid race between shift_arg_pages()
and rmap_walk() during migration by not migrating temporary stacks")
introduced a BUG_ON() to ensure that VM_STACK_FLAGS and
VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time
one, so BUILD_BUG_ON is more appropriate.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't call iput with the inode half setup to be a namespace filedescriptor.
Instead rearrange the code so that we don't initialize ei->ns_ops until
after I ns_ops->get succeeds, preventing us from invoking ns_ops->put
when ns_ops->get failed.
Reported-by: Ingo Saitz <Ingo.Saitz@stud.uni-hannover.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
We can lockup if we try to allow new writers join the transaction and we have
flushoncommit set or have a pending snapshot. This is because we set
no_trans_join and then loop around and try to wait for ordered extents again.
The problem is the ordered endio stuff needs to join the transaction, which it
can't do because no_trans_join is set. So instead wait until after this loop to
set no_trans_join and then make sure to wait for num_writers == 1 in case
anybody got started in between us exiting the loop and setting no_trans_join.
This could easily be reproduced by mounting -o flushoncommit and running xfstest
13. It cannot be reproduced with this patch. Thanks,
Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently there is nothing protecting the pending_snapshots list on the
transaction. We only hold the directory mutex that we are snapshotting and a
read lock on the subvol_sem, so we could race with somebody else creating a
snapshot in a different directory and end up with list corruption. So protect
this list with the trans_lock. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The delayed ref patch accidently removed the btrfs_free_path in
btrfs_unlink_subvol, this puts it back and means we don't leak a path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
mark_matching_lsegs_invalid could put the last ref to the layout, so
the get_layout_hdr needs to be called first.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We always get a reference on the layout header and we rely on
nfs4_layoutreturn_release to put it. If we hit an allocation error
before starting the rpc proc we bail out early without dereferncing
the layout header properly.
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(d)printks should use %zd for ssize_t arguments not %ld, otherwise they might
get a warning. I see the following with MN10300.
fs/nfs/objlayout/objlayout.c: In function 'objlayout_read_done':
fs/nfs/objlayout/objlayout.c:294: warning: format '%ld' expects type 'long int', but argument 3 has type 'ssize_t'
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <Trond.Myklebust@netapp.com>
cc: linux-nfs@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The break condition to skip out of the loop got broken when cmp_layout
was change. Essentially, we want to stop looking once we know no layout
on the remainder of the list can match the first byte of the looked-up
range.
Reported-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
_pnfs_return_layout had the following problems:
- it did not call pnfs_free_lseg_list on all paths
- it unintentionally did a forgetful return when there was no outstanding io
- it raced with concurrent LAYOUTGETS
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 28331a46d8 "Ensure we request the
ordinary fileid when doing readdirplus"
changed the meaning of NFS_ATTR_FATTR_FILEID which used to be set when
FATTR4_WORD1_MOUNTED_ON_FILED was requested.
Allow nfs_fhget to succeed with only a mounted on fileid when crossing
a mountpoint or a referral.
Ask for the fileid of the absent file system if mounted_on_fileid is not
supported.
Signed-off-by: Andy Adamson <andros@netapp.com>
cc:stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we add something to the global device id cache, we need to bump the
reference count, so that the cache itself holds a reference.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't support header padding yet so better off ditching it
Reported-by: Sid Moore <learnmost@gmail.com>
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_update_inode will update isize if there is no queued pages. For pNFS,
layoutcommit is supposed to change file size on server, the same effect as queued
pages. nfs_update_inode may be called when dirty pages are written back (nfsi->npages==0)
but layoutcommit is not sent, and it will change client file size according to server
file size. Then client ends up losing what it just writes back in pNFS path.
So we should skip updating client file size if file needs layoutcommit.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Cc: stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the NLM daemon is killed on the NFS server, we can currently end up
hanging forever on an 'unlock' request, instead of aborting. Basically,
if the rpcbind request fails, or the server keeps returning garbage, we
really want to quit instead of retrying.
Tested-by: Vasily Averin <vvs@sw.ru>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Unmounting a pnfs filesystem hangs using filelayout and possibly others.
This fixes the use of the rcu protected node by making use of a new 'tmpnode'
for the temporary purge list. Also, the spinlock shouldn't be held when calling
synchronize_rcu().
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
->mknod() should return negative on errors and PTR_ERR() gives
already negative value...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently processes waiting with poll on cancelable timerfd timers are
not woken up when the timers are canceled. When the system time is set
the clock_was_set() function calls timerfd_clock_was_set() to cancel
and wake up processes waiting on potential cancelable timerfd
timers. However the wake up currently has no effect because in the
case of timerfd_read it is dependent on ctx->ticks not being
0. timerfd_poll also requires ctx->ticks being non zero. As a
consequence processes waiting on cancelable timers only get woken up
when the timers expire. This patch fixes this by incrementing
ctx->ticks before calling wake_up.
Signed-off-by: Max Asbock <masbock@linux.vnet.ibm.com>
Cc: kay.sievers@vrfy.org
Cc: virtuoso@slind.org
Cc: johnstul <johnstul@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1307985512.4710.41.camel@w-amax.beaverton.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We were iterating across stripe boundaries properly, but not moving the
write buffer pointer forward. This caused us to rewrite the same data
after the break. Fix by adjusting the data pointer forward, and
recalculating the io and buffer alignment after the break.
Signed-off-by: Sage Weil <sage@newdream.net>
Long ago (in commit 00e485b0), I added some code to handle share-level
passwords in CIFSTCon. That code ignored the fact that it's legit to
pass in a NULL tcon pointer when connecting to the IPC$ share on the
server.
This wasn't really a problem until recently as we only called CIFSTCon
this way when the server returned -EREMOTE. With the introduction of
commit c1508ca2 however, it gets called this way on every mount, causing
an oops when share-level security is in effect.
Fix this by simply treating a NULL tcon pointer as if user-level
security were in effect. I'm not aware of any servers that protect the
IPC$ share with a specific password anyway. Also, add a comment to the
top of CIFSTCon to ensure that we don't make the same mistake again.
Cc: <stable@kernel.org>
Reported-by: Martijn Uffing <mp3project@sarijopen.student.utwente.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It's possible for the following set of events to happen:
cifsd calls cifs_reconnect which reconnects the socket. A userspace
process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
gets a reply. But, while processing the reply, cifsd calls
cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the
reply from the earlier NEGOTIATE completes and the tcpStatus is set to
CifsGood. cifs_reconnect then goes through and closes the socket and sets the
pointer to zero, but because the status is now CifsGood, the new socket
is not created and cifs_reconnect exits with the socket pointer set to
NULL.
Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
CifsNeedNegotiate, and by making sure that generic_ip_connect is always
called at least once in cifs_reconnect.
Note that this is not a perfect fix for this issue. It's still possible
that the NEGOTIATE reply is handled after the socket has been closed and
reconnected. In that case, the socket state will look correct but it no
NEGOTIATE was performed on it be for the wrong socket. In that situation
though the server should just shut down the socket on the next attempted
send, rather than causing the oops that occurs today.
Cc: <stable@kernel.org> # .38.x: fd88ce9: [CIFS] cifs: clarify the meaning of tcpStatus == CifsGood
Reported-and-Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_sb_master_tlink was declared as inline, but without a definition.
Remove the declaration and move the definition up.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
jbd2_journal_remove_journal_head() can oops when trying to access
journal_head returned by bh2jh(). This is caused for example by the
following race:
TASK1 TASK2
jbd2_journal_commit_transaction()
...
processing t_forget list
__jbd2_journal_refile_buffer(jh);
if (!jh->b_transaction) {
jbd_unlock_bh_state(bh);
jbd2_journal_try_to_free_buffers()
jbd2_journal_grab_journal_head(bh)
jbd_lock_bh_state(bh)
__journal_try_to_free_buffer()
jbd2_journal_put_journal_head(jh)
jbd2_journal_remove_journal_head(bh);
jbd2_journal_put_journal_head() in TASK2 sees that b_jcount == 0 and
buffer is not part of any transaction and thus frees journal_head
before TASK1 gets to doing so. Note that even buffer_head can be
released by try_to_free_buffers() after
jbd2_journal_put_journal_head() which adds even larger opportunity for
oops (but I didn't see this happen in reality).
Fix the problem by making transactions hold their own journal_head
reference (in b_jcount). That way we don't have to remove journal_head
explicitely via jbd2_journal_remove_journal_head() and instead just
remove journal_head when b_jcount drops to zero. The result of this is
that [__]jbd2_journal_refile_buffer(),
[__]jbd2_journal_unfile_buffer(), and
__jdb2_journal_remove_checkpoint() can free journal_head which needs
modification of a few callers. Also we have to be careful because once
journal_head is removed, buffer_head might be freed as well. So we
have to get our own buffer_head reference where it matters.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: unwind canceled flock state
ceph: fix ENOENT logic in striped_read
ceph: fix short sync reads from the OSD
ceph: fix sync vs canceled write
ceph: use ihold when we already have an inode ref
6b4517a791 (block: implement bd_claiming and claiming block)
introduced claiming block to support O_EXCL blkdev opens properly.
bd_start_claiming() looks up the part 0 bdev and starts claiming
block. The function assumed that there is only one part 0 bdev and
always used bdget_disk(disk, 0) to look it up; unfortunately, this
isn't true for some drivers (floppy) which use multiple block devices
to denote different operating parameters for the same physical device.
There can be multiple part 0 bdev's for the same device number.
This incorrect assumption caused the wrong bdev to be used during
claiming leading to unbalanced bd_holders as reported in the following
bug.
https://bugzilla.kernel.org/show_bug.cgi?id=28522
This patch updates bd_start_claiming() such that it uses the bdev
specified as argument if its partno is zero.
Note that this means that different bdev's can be used for the same
device and O_EXCL check can be effectively bypassed. It has always
been broken that way and floppy is fortunately on its way out. Leave
that breakage alone.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Tested-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Cc: stable@kernel.org # >= v2.6.36
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The symbols part_ro_show, part_alignment_offset_show, and
part_discard_alignment_show are not used outside this file and
should be marked static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
credits isn't a parameter for jbd2_journal_get_write_access and
jbd2_journal_get_undo_access. So remove the corresponding comments.
Acked-by: Jan Kara <jack@suse.cz>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* new refcount in struct net, controlling actual freeing of the memory
* new method in kobj_ns_type_operations (->drop_ns())
* ->current_ns() semantics change - it's supposed to be followed by
corresponding ->drop_ns(). For struct net in case of CONFIG_NET_NS it bumps
the new refcount; net_drop_ns() decrements it and calls net_free() if the
last reference has been dropped. Method renamed to ->grab_current_ns().
* old net_free() callers call net_drop_ns() instead.
* sysfs_exit_ns() is gone, along with a large part of callchain
leading to it; now that the references stored in ->ns[...] stay valid we
do not need to hunt them down and replace them with NULL. That fixes
problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
of sb->s_instances abuse.
Note that struct net *shutdown* logics has not changed - net_cleanup()
is called exactly when it used to be called. The only thing postponed by
having a sysfs instance refering to that struct net is actual freeing of
memory occupied by struct net.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* set ->s_fs_info in set() callback passed to sget()
* allocate the thing and set it up enough for afs_test_super() before
making it visible
* have it freed in ->kill_sb() (current tree simply leaks it)
* have ->put_super() leave ->s_fs_info->volume alone; it's too early for
dropping it; do that from ->kill_sb() after having called kill_anon_super().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* allocate ubifs_info in ->mount(), fill it enough for sb_test() and
set ->s_fs_info to it in set() callback passed to sget().
* do *not* free it in ->put_super(); do that in ->kill_sb() after we'd
done kill_anon_super().
* don't free it in ubifs_fill_super() either - deactivate_locked_super()
done by caller when ubifs_fill_super() returns an error will take care
of that sucker.
* get rid of kludge with passing ubi to ubifs_fill_super() in ->s_fs_info;
we only need it in alloc_ubifs_info(), so ubifs_fill_super() will need
only ubifs_info. Which it will find in ->s_fs_info just fine, no need to
reassign anything...
As the result, sb_test() becomes safe to apply to all superblocks that
can be found by sget() (and a kludge with temporary use of ->s_fs_info
to store a pointer to very different structure goes away).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: use join_transaction in btrfs_evict_inode()
Btrfs - use %pU to print fsid
Btrfs: fix extent state leak on failed nodatasum reads
btrfs: fix unlocked access of delalloc_inodes
Btrfs: avoid stack bloat in btrfs_ioctl_fs_info()
btrfs: remove 64bit alignment padding to allow extent_buffer to fit into one fewer cacheline
Btrfs: clear current->journal_info on async transaction commit
Btrfs: make sure to recheck for bitmaps in clusters
btrfs: remove unneeded includes from scrub.c
btrfs: reinitialize scrub workers
btrfs: scrub: errors in tree enumeration
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: unlock the trans lock properly
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: fix duplicate checking logic
Btrfs: fix the allocator loop logic
Btrfs: fix bitmap regression
Btrfs: don't commit the transaction if we dont have enough pinned bytes
Btrfs: noinline the cluster searching functions
Btrfs: cache bitmaps when searching for a cluster
The WARN_ON() in start_transaction() was triggered while balancing.
The cause is btrfs_relocate_chunk() started a transaction and
then called iput() on the inode that stores free space cache,
and iput() called btrfs_start_transaction() again.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checkpoint generation interval of nilfs goes wrong after user has
changed the interval parameter with nilfs-tune tool.
segctord starting. Construction interval = 5 seconds,
CP frequency < 30 seconds
segctord starting. Construction interval = 0 seconds,
CP frequency < 30 seconds
This turned out to be caused by a trivial bug in initialization code
of log writer. This will fix it.
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_delete function does not terminate part of virtual block
addresses when shrinking the last remaining child node into the root
node. The missing address termination causes that dead btree node
blocks persist and chip away free disk space.
This fixes the leak bug on the btree node deletion.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_delete function wrongly terminates virtual block address
of the btree node held by its parent at index 0. When concatenating
the index-0 node with its right sibling node, nilfs_btree_delete
terminates the block address of index-0 node instead of the right
sibling node which should be deleted.
This bug not only wears disk space in the long run, but also causes
file system corruption. This will fix it.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Get rid of FIXME comment. Uuids from dmesg are now the same as uuids
given by btrfs-progs.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When encountering an EIO while reading from a nodatasum extent, we
insert an error record into the inode's failure tree.
btrfs_readpage_end_io_hook returns early for nodatasum inodes. We'd
better clear the failure tree in that case, otherwise the kernel
complains about
BUG extent_state: Objects remaining on kmem_cache_close()
on rmmod.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
list_splice_init will make delalloc_inodes empty, but without a spinlock
around, this may produce corrupted list head, accessed in many placess,
The race window is very tight and nobody seems to have hit it so far.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of struct btrfs_ioctl_fs_info_args is as big as 1KB, so
don't declare the variable on stack.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reorder extent_buffer to remove 8 bytes of alignment padding on 64 bit
builds. This shrinks its size to 128 bytes allowing it to fit into one
fewer cache lines and allows more objects per slab in its kmem_cache.
slabinfo extent_buffer reports :-
before:-
Sizes (bytes) Slabs
----------------------------------
Object : 136 Total : 123
SlabObj: 136 Full : 121
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 30
after :-
Object : 128 Total : 4
SlabObj: 128 Full : 2
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 32
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Normally current->jouranl_info is cleared by commit_transaction. For an
async snap or subvol creation, though, it runs in a work queue. Clear
it in btrfs_commit_transaction_async() to avoid leaking a non-NULL
journal_info when we return to userspace. When the actual commit runs in
the other thread it won't care that it's current->journal_info is already
NULL.
Signed-off-by: Sage Weil <sage@newdream.net>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef recently changed the free extent cache to look in
the block group cluster for any bitmaps before trying to
add a new bitmap for the same offset. This avoids BUG_ON()s due
covering duplicate ranges.
But it didn't go quite far enough. A given free range might span
between one or more bitmaps or free space entries. The code has
looping to cover this, but it doesn't check for clustered bitmaps
every time.
This shuffles our gotos to check for a bitmap in the cluster
for every new bitmap entry we try to add.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Scrub starts the workers each time a scrub starts and stops them after it
finished. This patch adds an initialization for the workers before each
start, otherwise the workers behave strangely.
Signed-off-by: Arne Jansen <sensille@gmx.net>
due to the semantics of btrfs_search_slot the path can point to an
invalid slot when ret > 0. This condition went unnoticed, which in
turn could have led to an incomplete scrubbing.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Unconditionally changing the address limit to USER_DS and not restoring
it to its old value in the error path is wrong because it prevents us
using kernel memory on repeated calls to this function. This, in fact,
breaks the fallback of hard coded paths to the init program from being
ever successful if the first candidate fails to load.
With this patch applied switching to USER_DS is delayed until the point
of no return is reached which makes it possible to have a multi-arch
rootfs with one arch specific init binary for each of the (hard coded)
probed paths.
Since the address limit is already set to USER_DS when start_thread()
will be invoked, this redundancy can be safely removed.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In btrfs_wait_for_commit if we came upon a transaction that had committed we
just exited, but that's bad since we are holding the trans_lock. So break
instead so that the lock is dropped. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: trivial: add space in fsc error message
cifs: silence printk when establishing first session on socket
CIFS ACL support needs CONFIG_KEYS, so depend on it
possible memory corruption in cifs_parse_mount_options()
cifs: make CIFS depend on CRYPTO_ECB
cifs: fix the kernel release version in the default security warning message
When merging my code into the integration test the second check for duplicate
entries got screwed up. This patch fixes it by dropping ret2 and just using ret
for the return value, and checking if we got an error before adding the bitmap
to the local list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I was testing with empty_cluster = 0 to try and reproduce a problem and kept
hitting early enospc panics. This was because our loop logic was a little
confused. So this is what I did
1) Make the loop variable the ultimate decider on wether we should loop again
isntead of checking to see if we had an uncached bg, empty size or empty
cluster.
2) Increment loop before checking to see what we are on to make the loop
definitions make more sense.
3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0
unless we didn't actually allocate a chunk. If we did allocate a chunk we
should be able to easily setup a new cluster so clearing
empty_size/empty_cluster makes us less efficient.
This kept me from hitting panics while trying to reproduce the other problem.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In cleaning up the clustering code I accidently introduced a regression by
adding bitmap entries to the cluster rb tree. The problem is if we've maxed out
the number of bitmaps we can have for the block group we can only add free space
to the bitmaps, but since the bitmap is on the cluster we can't find it and we
try to create another one. This would result in a panic because the total
bitmaps was bigger than the max bitmaps that were allowed. This patch fixes
this by checking to see if we have a cluster, and then looking at the cluster rb
tree to see if it has a bitmap entry and if it does and that space belongs to
that bitmap, go ahead and add it to that bitmap.
I could hit this panic every time with an fs_mark test within a couple of
minutes. With this patch I no longer hit the panic and fs_mark goes to
completion. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed when running an enospc test that we would get stuck committing the
transaction in check_data_space even though we truly didn't have enough space.
So check to see if bytes_pinned is bigger than num_bytes, if it's not don't
commit the transaction. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When profiling the find cluster code it's hard to tell where we are spending our
time because the bitmap and non-bitmap functions get inlined by the compiler, so
make that not happen. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we are looking for a cluster in a particularly sparse or fragmented block
group, we will do a lot of looping through the free space tree looking for
various things, and if we need to look at bitmaps we will endup doing the whole
dance twice. So instead add the bitmap entries to a temporary list so if we
have to do the bitmap search we can just look through the list of entries we've
found quickly instead of having to loop through the entire tree again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we request a lock and then abort (e.g., ^C), we need to send a matching
unlock request to the MDS to unwind our lock attempt to avoid indefinitely
blocking other clients.
Reported-by: Brian Chrisman <brchrisman@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Getting ENOENT is equivalent to reading 0 bytes. Make that correction
before setting up the hit_stripe and was_short flags.
Fixes the following case:
dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct
Reported-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If we get a short read from the OSD because the object is small, we need to
zero the remainder of the buffer. For O_DIRECT reads, the attempted range
is not trimmed to i_size by the VFS, so we were actually looping
indefinitely.
Fix by trimming by i_size, and the unconditionally zeroing the trailing
range.
Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock. This avoids adding new and unnecessary
locking dependencies.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
lmLogOpen() broken failure exit
usb: remove bad dput after dentry_unhash
more conservative S_NOSEC handling
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
It is valuable to know how the dirty inodes are iterated and their IO size.
"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"
- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written
v2: add trace event writeback_single_inode_requeue as proposed by Dave.
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Remove two unused struct writeback_control fields:
.encountered_congestion (completely unused)
.nonblocking (never set, checked/showed in XFS,NFS/btrfs)
The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.
Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Code refactor for more logical code layout.
No behavior change.
- remove the mis-named __writeback_inodes_sb()
- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
before calling __writeback_inodes_wb()
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads. It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.
Based on earlier patches from Nick Piggin and Dave Chinner.
It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
------------------
inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
------------------
inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
2.6.39-rc3 + patch:
&(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
------------------------
&(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
&(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
&(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
------------------------
&(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.
A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.
This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.
This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.
The below test script completes a slightly faster now:
2.6.39-rc3 2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed 256.043 252.367
stddev 24.381 12.530
tar elapsed 30.097 28.808
dd elapsed 13.214 11.782
#!/bin/zsh
cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
umount /dev/sda7
mkfs.xfs -f /dev/sda7
mount /dev/sda7 /fs
echo 3 > /proc/sys/vm/drop_caches
tic=$(cat /proc/uptime|cut -d' ' -f2)
cd /fs
time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
wait
sync
tac=$(cat /proc/uptime|cut -d' ' -f2)
echo elapsed: $((tac - tic))
It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.
Analyzes from Dave Chinner in great details:
Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.
With the old code, we do:
b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)
writeback large inode 1024 pages -> b_more_io
b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
writeback 8 small inodes 800 pages
1 large inode 224 pages -> b_more_io
b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
.....
Your new code:
b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)
writeback large inode 1024 pages -> b_more_io
(b_io == 8s)
writeback 8 small inodes 800 pages
b_io empty: (1800 pages written)
b_more_io (large inode) -> b_io (1l)
14 newly expired inodes -> b_io (1l, 14s)
writeback large inode 1024 pages -> b_more_io
(b_io == 14s)
writeback 10 small inodes 1000 pages
1 small inode 24 pages -> b_more_io (1l, 1s(24))
writeback 5 small inodes 500 pages
b_io empty: (2548 pages written)
b_more_io (large inode) -> b_io (1l, 1s(24))
20 newly expired inodes -> b_io (1l, 1s(24), 20s)
......
Rough progression of pages written at b_io refill:
Old code:
total large file % of writeback
1024 224 21.9% (fixed)
New code:
total large file % of writeback
1800 1024 ~55%
2550 1024 ~40%
3050 1024 ~33%
3500 1024 ~29%
3950 1024 ~26%
4250 1024 ~24%
4500 1024 ~22.7%
4700 1024 ~21.7%
4800 1024 ~21.3%
4800 1024 ~21.3%
(pretty much steady state from here)
Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)
The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.
CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Dynamically compute the dirty expire timestamp at queue_io() time.
writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.
This has two possible problems:
- It is unfair for a large dirty inode to delay (for a long time) the
writeback of small dirty inodes.
- As time goes by, the large and busy dirty inode may contain only
_freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
delaying the expired dirty pages to the end of LRU lists, triggering
the evil pageout(). Nevertheless this patch merely addresses part
of the problem.
v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.
CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.
For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.
For example, imagine 100 inodes
i0, i1, i2, ..., i90, i91, i99
At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.
Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1). I actually tried to write 1 page
per 1ms with this command
write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".
So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV. A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Explicitly update .dirtied_when on synced inodes, so that they are no
longer considered for writeback in the next round.
It can prevent both of the following livelock schemes:
- while true; do echo data >> f; done
- while true; do touch f; done (in theory)
The exact livelock condition is, during sync(1):
(1) no new inodes are dirtied
(2) an inode being actively dirtied
On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
When finished, it will be redirty_tail()ed because it's still dirty
and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
on condition (1). The sync work will then revisit it on the next
queue_io() and find it eligible again because its old ->dirtied_when
predates the sync work start time.
We'll do more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
b9543dac5b ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
no longer be enough to stop sync livelock.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.
Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.
Although ext4 is tested to no longer livelock with commit f446daaea9,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.
Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.
Impact: It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.
Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
If user space attempts to remove a non-existent file or directory, and
the file system is mounted read-only, return ENOENT instead of EROFS.
Either error code is arguably valid/correct, but ENOENT is a more
specific error message.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Callers of lmLogOpen() expect it to return -E... on failure exits, which
is what it returns, except for the case of blkdev_get_by_dev() failure.
It that case lmLogOpen() return the error with the wrong sign...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
When signing is enabled, the first session that's established on a
socket will cause a printk like this to pop:
CIFS VFS: Unexpected SMB signature
This is because the key exchange hasn't happened yet, so the signature
field is bogus. Don't try to check the signature on the socket until the
first session has been established. Also, eliminate the specific check
for SMB_COM_NEGOTIATE since this check covers that case too.
Cc: stable@kernel.org
Cc: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
fix for commit 4795bb37ef, nfsd: break
lease on unlink, link, and rename
if the LINK operation breaks a delegation, it returns NFS4ERR_NOENT
(which is not a valid error in rfc 5661) instead of NFS4ERR_DELAY.
the return value of nfsd_break_lease() in nfsd_link() must be
converted from host_err to err
Signed-off-by: Casey Bodley <cbodley@citi.umich.edu>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd V4 support uses crypto interfaces, so select CRYPTO
to fix build errors in 2.6.39:
ERROR: "crypto_destroy_tfm" [fs/nfsd/nfsd.ko] undefined!
ERROR: "crypto_alloc_base" [fs/nfsd/nfsd.ko] undefined!
Reported-by: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit b0b0c0a26e "nfsd: add proc file listing kernel's gss_krb5
enctypes" added an nunnecessary dependency of nfsd on the auth_rpcgss
module.
It's a little ad hoc, but since the only piece of information nfsd needs
from rpcsec_gss_krb5 is a single static string, one solution is just to
share it with an include file.
Cc: stable@kernel.org
Reported-by: Michael Guntsche <mike@it-loops.com>
Cc: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Build fails if CONFIG_KEYS is not selected.
Signed-off-by: Darren Salt <linux@youmustbejoking.demon.co.uk>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
While creating fixed tracepoints for ext3, basically by porting them
from ext4, I found a lot of useless retyping, wrong type usage, useless
variable passing and other inconsistencies in the ext4 fixed tracepoint
code.
This patch cleans the fixed tracepoint code for ext4 and also simplify
some of them.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently we are not marking the extent as the last one
(FIEMAP_EXTENT_LAST) if there is a hole at the end of the file. This is
because we just do not check for it right now and continue searching for
next extent. But at the point we hit the hole at the end of the file, it
is too late.
This commit adds check for the allocated block in subsequent extent and
if there is no more extents (block = EXT_MAX_BLOCKS) just flag the
current one as the last one.
This behaviour has been spotted unintentionally by 252 xfstest, when the
test hangs out, because of wrong loop condition. However on other
filesystems (like xfs) it will exit anyway, because we notice the last
extent flag and exit.
With this patch xfstest 252 does not hang anymore, ext4 fiemap
implementation still reports bad extent type in some cases, however
this seems to be different issue.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Kazuya Mio reported that he was able to hit BUG_ON(next == lblock)
in ext4_ext_put_gap_in_cache() while creating a sparse file in extent
format and fill the tail of file up to its end. We will hit the BUG_ON
when we write the last block (2^32-1) into the sparse file.
The root cause of the problem lies in the fact that we specifically set
s_maxbytes so that block at s_maxbytes fit into on-disk extent format,
which is 32 bit long. However, we are not storing start and end block
number, but rather start block number and length in blocks. It means
that in order to cover extent from 0 to EXT_MAX_BLOCK we need
EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) -
and it does not.
The only way to fix it without changing the meaning of the struct
ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes
by one fs block so we can cover the whole extent we can get by the
on-disk extent format.
Also in many places EXT_MAX_BLOCK is used as length instead of maximum
logical block number as the name suggests, it is all a bit messy. So
this commit renames it to EXT_MAX_BLOCKS and change its usage in some
places to actually be maximum number of blocks in the extent.
The bug which this commit fixes can be reproduced as follows:
dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-2))
sync
dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-1))
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
metadata is not parameter of ext4_free_blocks() any more.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
btrfs: fix uninitialized variable warning
btrfs: add helper for fs_info->closing
Btrfs: add mount -o inode_cache
btrfs: scrub: add explicit plugging
btrfs: use btrfs_ino to access inode number
Btrfs: don't save the inode cache if we are deleting this root
btrfs: false BUG_ON when degraded
Btrfs: don't save the inode cache in non-FS roots
Btrfs: make sure we don't overflow the free space cache crc page
Btrfs: fix uninit variable in the delayed inode code
btrfs: scrub: don't reuse bios and pages
Btrfs: leave spinning on lookup and map the leaf
Btrfs: check for duplicate entries in the free space cache
Btrfs: don't try to allocate from a block group that doesn't have enough space
Btrfs: don't always do readahead
Btrfs: try not to sleep as much when doing slow caching
Btrfs: kill BTRFS_I(inode)->block_group
Btrfs: don't look at the extent buffer level 3 times in a row
Btrfs: map the node block when looking for readahead targets
Btrfs: set range_start to the right start in count_range_bits
...
JOBCTL_TRAPPING indicates that ptracer is waiting for tracee to
(re)transit into TRACED. task_clear_jobctl_pending() must be called
when either tracee enters TRACED or the transition is cancelled for
some reason. The former is achieved by explicitly calling
task_clear_jobctl_pending() in ptrace_stop() and the latter by calling
it at the end of do_signal_stop().
Calling task_clear_jobctl_trapping() at the end of do_signal_stop()
limits the scope TRAPPING can be used and is fragile in that seemingly
unrelated changes to tracee's control flow can lead to stuck TRAPPING.
We already have task_clear_jobctl_pending() calls on those cancelling
events to clear JOBCTL_STOP_PENDING. Cancellations can be handled by
making those call sites use JOBCTL_PENDING_MASK instead and updating
task_clear_jobctl_pending() such that task_clear_jobctl_trapping() is
called automatically if no stop/trap is pending.
This patch makes the above changes and removes the fallback
task_clear_jobctl_trapping() call from do_signal_stop().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
This patch introduces JOBCTL_PENDING_MASK and replaces
task_clear_jobctl_stop_pending() with task_clear_jobctl_pending()
which takes an extra @mask argument.
JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but
future patches will add more bits. recalc_sigpending_tsk() is updated
to use JOBCTL_PENDING_MASK instead.
task_clear_jobctl_pending() takes @mask which in subset of
JOBCTL_PENDING_MASK and clears the relevant jobctl bits. If
JOBCTL_STOP_PENDING is set, other STOP bits are cleared together. All
task_clear_jobctl_stop_pending() users are updated to call
task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is
functionally identical to task_clear_jobctl_stop_pending().
This patch doesn't cause any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
signal->group_stop currently hosts mostly group stop related flags;
however, it's gonna be used for wider purposes and the GROUP_STOP_
flag prefix becomes confusing. Rename signal->group_stop to
signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_*.
Bit position macros JOBCTL_*_BIT are defined and JOBCTL_* flags are
defined in terms of them to allow using bitops later.
While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate
future additions.
This doesn't cause any functional change.
-v2: JOBCTL_*_BIT macros added as suggested by Linus.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
With Linus' tree, today's linux-next build (powercp ppc64_defconfig)
produced this warning:
fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode':
fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used
uninitialized in this function
Introduced by commit 16cdcec736 ("btrfs: implement delayed inode items
operation").
This fixes a bug in btrfs_update_inode(): if the returned value from
btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not
updated and several call paths may hit a BUG_ON or fail with strange
code.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David Sterba <dsterba@suse.cz>
This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the removal of the implicit plugging scrub ends up doing more and
smaller I/O than necessary. This patch adds explicit plugging per chunk.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode
number directly while it should use the helper with the new inode
number allocator.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With xfstest 254 I can panic the box every time with the inode number caching
stuff on. This is because we clean the inodes out when we delete the subvolume,
but then we write out the inode cache which adds an inode to the subvolume inode
tree, and then when it gets evicted again the root gets added back on the dead
roots list and is deleted again, so we have a double free. To stop this from
happening just return 0 if refs is 0 (and we're not the tree root since tree
root always has refs of 0). With this fix 254 no longer panics. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds extra checks to make sure the inode map we are caching really
belongs to a FS root instead of a special relocation tree. It
prevents crashes during balancing operations.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The free space cache uses only one page for crcs right now,
which means we can't have a cache file bigger than the
crcs we can fit in the first page. This adds a check to
enforce that restriction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Use hlist_entry() for io_context.cic_list.first
cfq-iosched: Remove bogus check in queue_fail path
xen/blkback: potential null dereference in error handling
xen/blkback: don't call vbd_size() if bd_disk is NULL
block: blkdev_get() should access ->bd_disk only after success
CFQ: Fix typo and remove unnecessary semicolon
block: remove unwanted semicolons
Revert "block: Remove extra discard_alignment from hd_struct."
nbd: adjust 'max_part' according to part_shift
nbd: limit module parameters to a sane value
nbd: pass MSG_* flags to kernel_recvmsg()
block: improve the bio_add_page() and bio_add_pc_page() descriptions
Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.
AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.
It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When CONFIG_CRYPTO_ECB is not set, trying to mount a CIFS share with NTLM
security resulted in mount failure with the following error:
"CIFS VFS: could not allocate des crypto API"
Seems like a leftover from commit 43988d7.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
CC: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When ntlm security mechanim is used, the message that warns about the upgrade
to ntlmv2 got the kernel release version wrong (Blame it on Linus :). Fix it.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The free space fixup is currently initiated during mount after the call to
ubifs_write_master() which results in a write to PEBs; this has been observed
with the patch 'assert no fixup when writing a node' applied:
Move the free space fixup on mount to before the calls to
ubifs_recover_inl_heads() and ubifs_write_master(). This results in no
assertions with the previously mentioned patch applied.
Artem: tweaked the patch a bit
Signed-off-by: Ben Gardiner <bengardiner@nanometrics>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The current 'mount_ubifs()' implementation does not initialize the LPT until the
the master node is marked dirty. Move the LPT initialization to before marking
the master node dirty. This is a preparation for the next patch which will move
the free-space-fixup check to before marking the master node dirty, because we
have to fix-up the free space before doing any writes.
Artem: massaged the patch and commit message.
Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The current free space fixup can result in some writing to the UBI volume
when the space_fixup flag is set.
To catch instances where UBIFS is writing to the NAND while the space_fixup
flag is set, add an assert to ubifs_write_node().
Artem: tweaked the patch, added similar assertion to the write buffer
write path.
Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS maintains per-filesystem and global clean znode counters
('c->clean_zn_cnt' and 'ubifs_clean_zn_cnt'). It is important to maintain
correct values there since the shrinker relies on 'ubifs_clean_zn_cnt'.
However, in case of failures during commit the counters were corrupted. E.g.,
if a failure happens in the middle of 'write_index()', then some nodes in the
commit list ('c->cnext') are marked as clean, and some are marked as dirty. And
the 'ubifs_destroy_tnc_subtree()' frees does not retrun correct count, and we
end up with non-zero 'c->clean_zn_cnt' when unmounting. This means that if we
have 2 file-sytem and one of them fails, and we unmount it,
'ubifs_clean_zn_cnt' stays incorrect and confuses the shrinker.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS leaks memory on error path in 'ubifs_jnl_update()' in case of write
failure because it forgets to free the 'struct ubifs_dent_node *dent' object.
Although the object is small, the alignment can make it large - e.g., 2KiB
if the min. I/O unit is 2KiB.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
Sometimes VM asks the shrinker to return amount of objects it can shrink,
and we return the ubifs_clean_zn_cnt in that case. However, it is possible
that this counter is negative for a short period of time, due to the way
UBIFS TNC code updates it. And I can observe the following warnings sometimes:
shrink_slab: ubifs_shrinker+0x0/0x2b7 [ubifs] negative objects to delete nr=-8541616642706119788
This patch makes sure UBIFS never returns negative count of objects.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
Fix void function parameter list sparse warning:
fs/fuse/inode.c:74:44: warning: non-ANSI function declaration of function 'fuse_alloc_forget'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Unfortunately, the recovery fix d1606a59b6be4ea392eabd40d1250aa1eeb19efb
(UBIFS: fix extremely rare mount failure) broke recovery. This commit make
UBIFS drop the last min. I/O unit in all journal heads, but this is needed only
for the GC head. And this does not work for non-GC heads. For example, if
suppose we have min. I/O units A and B, and A contains a valid node X, which
was fsynced, and then a group of nodes Y which spans the rest of A and B. In
this case we'll drop not only Y, but also X, which is obviously incorrect.
This patch fixes the issue and additionally makes recovery to drop last min.
I/O unit only for the GC head, and leave things as they have been for ages for
the other heads - this is safer.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Instead of passing "grouped" parameter to 'ubifs_recover_leb()' which tells
whether the nodes are grouped in the LEB to recover, pass the journal head
number and let 'ubifs_recover_leb()' look at the journal head's 'grouped' flag.
This patch is a preparation to a further fix where we'll need to know the
journal head number for other purposes.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Journal heads are different in a way how UBIFS writes nodes there. All normal
journal heads receive grouped nodes, while the GC journal heads receives
ungrouped nodes. This patch adds a 'grouped' flag to 'struct ubifs_jhead' which
describes this property.
This patch is a preparation to a further recovery fix.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Commit ab51afe05273741f72383529ef488aa1ea598ec6 was a good clean-up, but
it introduced a regression - now UBIFS prints scary error messages during
recovery on all corrupted nodes, even though the corruptions are expected
(due to a power cut). This patch fixes the issue.
Additionally fix a typo in a commentary introduced by the same commit.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
d4dc210f69 (block: don't block events on excl write for non-optical
devices) added dereferencing of bdev->bd_disk to test
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
%NULL if open failed which can lead to an oops.
Test the flag after testing open was successful, not before.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Miller <davem@davemloft.net>
Tested-by: David Miller <davem@davemloft.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: ocfs2-devel@oss.oracle.com
Signed-off-by: Joel Becker <jlbec@evilplan.org>
The original code had a null derefence in the error handling.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>