linux

Author	SHA1	Message	Date
Don Zickus	abd4d5587b	pstore: change mutex locking to spin_locks pstore was using mutex locking to protect read/write access to the backend plug-ins. This causes problems when pstore is executed in an NMI context through panic() -> kmsg_dump(). This patch changes the mutex to a spin_lock_irqsave then also checks to see if we are in an NMI context. If we are in an NMI and can't get the lock, just print a message stating that and blow by the locking. All this is probably a hack around the bigger locking problem but it solves my current situation of trying to sleep in an NMI context. Tested by loading the lkdtm module and executing a HARDLOCKUP which will cause the machine to panic inside the nmi handler. Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2011-08-16 11:55:58 -07:00
Luck, Tony	6dda926691	pstore: defer inserting OOPS entries into pstore Life is simple for all the kernel terminating types of kmsg_dump call backs - pstore just saves the tail end of the console log. But for "oops" the situation is more complex - the kernel may carry on running (possibly for ever). So we'd like to make the logged copy of the oops appear in the pstore filesystem - so that the user has a handle to clear the entry from the persistent backing store (if we don't, the store may fill with "oops" entries (that are also safely stashed in /var/log/messages) leaving no space for real errors. Current code calls pstore_mkfile() immediately. But this may not be safe. The oops could have happened with arbitrary locks held, or in interrupt or NMI context. So allocating memory and calling into generic filesystem code seems unwise. This patch defers making the entry appear. At the time of the oops, we merely set a flag "pstore_new_entry" noting that a new entry has been added. A periodic timer checks once a minute to see if the flag is set - if so, it schedules a work queue to rescan the backing store and make all new entries appear in the pstore filesystem. Signed-off-by: Tony Luck <tony.luck@intel.com>	2011-08-16 11:53:01 -07:00
Jeff Layton	fa71f44706	cifs: demote cERROR in build_path_from_dentry to cFYI Running the cthon tests on a recent kernel caused this message to pop occasionally: CIFS VFS: did not end path lookup where expected namelen is 0 Some added debugging showed that namelen and dfsplen were both 0 when this occurred. That means that the read_seqretry returned true. Assuming that the comment inside the if statement is true, this should be harmless and just means that we raced with a rename. If that is the case, then there's no need for alarm and we can demote this to cFYI. While we're at it, print the dfsplen too so that we can see what happened here if the message pops during debugging. Cc: stable@kernel.org Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-16 13:07:24 +00:00
Sage Weil	795858dbd2	ceph: fix encoding of ino only (not relative) paths A 'path' consists of a starting ino and relative component. Encode even when there is no relative component. This is primarily needed by the NFS reexport code. Signed-off-by: Sage Weil <sage@newdream.net>	2011-08-15 13:03:56 -07:00
Linus Torvalds	a0b3447fb1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs: flush journal completely before releasing metadata inodes	2011-08-15 08:40:24 -07:00
Linus Torvalds	aa2b1cf5ca	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: Do not set cifs/ntfs acl using a file handle (try #4) [CIFS] Cleanup use of CONFIG_CIFS_STATS2 ifdef to make transport routines more readable	2011-08-15 08:36:30 -07:00
Theodore Ts'o	9dd75f1f1a	ext4: fix nomblk_io_submit option so it correctly converts uninit blocks Bug discovered by Jan Kara: Finally, commit `1449032be1` returned back the old IO submission code but apparently it forgot to return the old handling of uninitialized buffers so we unconditionnaly call block_write_full_page() without specifying end_io function. So AFAICS we never convert unwritten extents to written in some cases. For example when I mount the fs as: mount -t ext4 -o nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do int fd = open(argv[1], O_RDWR \| O_CREAT \| O_TRUNC, 0600); char buf[1024]; memset(buf, 'a', sizeof(buf)); fallocate(fd, 0, 0, 16384); write(fd, buf, sizeof(buf)); I get a file full of zeros (after remounting the filesystem so that pagecache is dropped) instead of seeing the first KB contain 'a's. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-13 12:58:21 -04:00
Tao Ma	32c80b32c0	ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN. EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should be done simultaneously since ext4_end_io_nolock always clear the flag and decrease the counter in the same time. We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process to wait forever. Part of the patch came from Eric in his e-mail, but it doesn't fix the problem met by Michael actually. http://marc.info/?l=linux-ext4&m=131316851417460&w=2 Reported-and-Tested-by: Michael Tokarev<mjt@tls.msk.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-13 12:30:59 -04:00
Jiaying Zhang	2581fdc810	ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode Flush inode's i_completed_io_list before calling ext4_io_wait to prevent the following deadlock scenario: A page fault happens while some process is writing inode A. During page fault, shrink_icache_memory is called that in turn evicts another inode B. Inode B has some pending io_end work so it calls ext4_ioend_wait() that waits for inode B's i_ioend_count to become zero. However, inode B's ioend work was queued behind some of inode A's ioend work on the same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten thread on that cpu is processing inode A's ioend work, it tries to grab inode A's i_mutex lock. Since the i_mutex lock of inode A is still hold before the page fault happened, we enter a deadlock. Also moves ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion, ext4_evict_inode() is called before ext4_destroy_inode() and in ext4_evict_inode(), we may call ext4_truncate() without holding i_mutex lock. As a result, there is a race between flush_completed_IO that is called from ext4_ext_truncate() and ext4_end_io_work, which may cause corruption on an io_end structure. This change moves ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode() to ext4_evict_inode() to resolve the race between ext4_truncate() and ext4_end_io_work during inode deletion. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-13 12:17:13 -04:00
Curt Wohlgemuth	441c850857	ext4: Fix ext4_should_writeback_data() for no-journal mode ext4_should_writeback_data() had an incorrect sequence of tests to determine if it should return 0 or 1: in particular, even in no-journal mode, 0 was being returned for a non-regular-file inode. This meant that, in non-journal mode, we would use ext4_journalled_aops for directories, symlinks, and other non-regular files. However, calling journalled aop callbacks when there is no valid handle, can cause problems. This would cause a kernel crash with Jan Kara's commit `2d859db3e4` ("ext4: fix data corruption in inodes with journalled data"), because we now dereference 'handle' in ext4_journalled_write_end(). I also added BUG_ONs to check for a valid handle in the obviously journal-only aops callbacks. I tested this running xfstests with a scratch device in these modes: - no-journal - data=ordered - data=writeback - data=journal All work fine; the data=journal run has many failures and a crash in xfstests 074, but this is no different from a vanilla kernel. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-13 11:25:18 -04:00
Linus Torvalds	e68ff9cd15	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: replace xfs_buf_geterror() with bp->b_error xfs: Check the return value of xfs_buf_read() for NULL "xfs: fix error handling for synchronous writes" revisited xfs: set cursor in xfs_ail_splice() even when AIL was empty xfs: Remove the macro XFS_BUFTARG_NAME xfs: Remove the macro XFS_BUF_TARGET xfs: Remove the macro XFS_BUF_SET_TARGET Replace the macro XFS_BUF_ISPINNED with helper xfs_buf_ispinned xfs: Remove the macro XFS_BUF_SET_PTR xfs: Remove the macro XFS_BUF_PTR xfs: Remove macro XFS_BUF_SET_START xfs: Remove macro XFS_BUF_HOLD xfs: Remove macro XFS_BUF_BUSY and family xfs: Remove the macro XFS_BUF_ERROR and family xfs: Remove the macro XFS_BUF_BFLAGS	2011-08-12 20:43:01 -07:00
Christoph Hellwig	c59d87c460	xfs: remove subdirectories Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the annoying subdirectories in the XFS source code. Besides the large amount of file rename the only changes are to the Makefile, a few files including headers with the subdirectory prefix, and the binary sysctl compat code that includes a header under fs/xfs/ from kernel/. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-12 16:21:35 -05:00
Alex Elder	06f8e2d675	xfs: don't expect xfs headers to be in subdirectories Fix up some #include directives in preparation for moving a few header files out of xfs source subdirectories. Note that "xfs_linux.h" also got its quoting convention for included files switched. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2011-08-12 13:57:55 -05:00
Chandra Seetharaman	e570280521	xfs: replace xfs_buf_geterror() with bp->b_error Since we just checked bp for NULL, it is ok to replace xfs_buf_geterror() with bp->b_error in these places. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-12 13:39:40 -05:00
Chandra Seetharaman	ac4d6888b2	xfs: Check the return value of xfs_buf_read() for NULL Check the return value of xfs_buf_read() for NULL and return ENOMEM if it is NULL. This is necessary in a few spots to avoid subsequent code blindly dereferencing the null buffer pointer. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-12 13:39:29 -05:00
Linus Torvalds	ce8a84ef1e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits) e1000e: increase driver version number e1000e: alternate MAC address update e1000e: do not disable receiver on 82574/82583 e1000e: alternate MAC address does not work on device id 0x1060 PCnet: Fix section mismatch bnx2x: disable dcb on 578xx since not supported yet bnx2x: properly clean indirect addresses bnx2x: prevent race between undi_unload and load flows bnx2x: fix select_queue when FCoE is disabled bnx2x: init FCOE FP only once ipv4: some rt_iif -> rt_route_iif conversions net/bridge/netfilter/ebtables.c: use available error handling code net/netlabel/netlabel_kapi.c: add missing cleanup code net/irda: sh_sir: tidyup compile warning net/irda: sh_sir: add missing header net/irda: sh_irda: add missing header slcan: ldisc generated skbs are received in softirq context scm: Capture the full credentials of the scm sender tcp: initialize variable ecn_ok in syncookies path drivers/net/wireless/wl1251: add missing kfree ...	2011-08-12 06:43:53 -07:00
Mimi Zohar	f995e74087	CIFS: remove local xattr definitions Local XATTR_TRUSTED_PREFIX_LEN and XATTR_SECURITY_PREFIX_LEN definitions redefined ones in 'linux/xattr.h'. This was caused by commit `9d8f13ba3f` ("security: new security_inode_init_security API adds function callback") including 'linux/xattr.h' in 'linux/security.h'. In file included from include/linux/security.h:39, from include/net/sock.h:54, from fs/cifs/cifspdu.h:25, from fs/cifs/xattr.c:26: This patch removes the local definitions. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2011-08-12 12:49:58 +10:00
Boaz Harrosh	8cf1fb2163	pnfs: Automatically select blocks & objects layouts Just like files-layout, blocks & objects layouts are part of the NFS 4.1 protocol and should be automatically selected if NFS_4_1 is selected. The small problem is that these depend on other Kernel support being present, while files only depends on NFS itself. This patch removes from the user choice the presence of objects and blocks layout. But makes sure these are selected only if the depended subsystems are present in the Kernel. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Acked-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-11 17:51:27 -07:00
Eric Sandeen	8c20871998	ext4: Properly count journal credits for long symlinks Commit `df5e622340` ("ext4: fix deadlock in ext4_symlink() in ENOSPC conditions") recalculated the number of credits needed for a long symlink, in the process of splitting it into two transactions. However, the first credit calculation under-counted because if selinux is enabled, credits are needed to create the selinux xattr as well. Overrunning the reservation will result in an OOPS in jbd2_journal_dirty_metadata() due to this assert: J_ASSERT_JH(jh, handle->h_buffer_credits > 0); Fix this by increasing the reservation size. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-11 17:23:40 -07:00
Eric Sandeen	d2db60df1e	ext3: Properly count journal credits for long symlinks Commit `ae54870a1d` ("ext3: Fix lock inversion in ext3_symlink()") recalculated the number of credits needed for a long symlink, in the process of splitting it into two transactions. However, the first credit calculation under-counted because if selinux is enabled, credits are needed to create the selinux xattr as well. Overrunning the reservation will result in an OOPS in journal_dirty_metadata() due to this assert: J_ASSERT_JH(jh, handle->h_buffer_credits > 0); Fix this by increasing the reservation size. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-11 17:23:40 -07:00
Vasiliy Kulikov	72fa59970f	move RLIMIT_NPROC check from set_user() to do_execve_common() The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC check in set_user() to check for NPROC exceeding via setuid() and similar functions. Before the check there was a possibility to greatly exceed the allowed number of processes by an unprivileged user if the program relied on rlimit only. But the check created new security threat: many poorly written programs simply don't check setuid() return code and believe it cannot fail if executed with root privileges. So, the check is removed in this patch because of too often privilege escalations related to buggy programs. The NPROC can still be enforced in the common code flow of daemons spawning user processes. Most of daemons do fork()+setuid()+execve(). The check introduced in execve() (1) enforces the same limit as in setuid() and (2) doesn't create similar security issues. Neil Brown suggested to track what specific process has exceeded the limit by setting PF_NPROC_EXCEEDED process flag. With the change only this process would fail on execve(), and other processes' execve() behaviour is not changed. Solar Designer suggested to re-check whether NPROC limit is still exceeded at the moment of execve(). If the process was sleeping for days between set*uid() and execve(), and the NPROC counter step down under the limit, the defered execve() failure because NPROC limit was exceeded days ago would be unexpected. If the limit is not exceeded anymore, we clear the flag on successful calls to execve() and fork(). The flag is also cleared on successful calls to set_user() as the limit was exceeded for the previous user, not the current one. Similar check was introduced in -ow patches (without the process flag). v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user(). Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-11 11:24:42 -07:00
Shirish Pargaonkar	e22906c564	cifs: Do not set cifs/ntfs acl using a file handle (try #4 ) Set security descriptor using path name instead of a file handle. We can't be sure that the file handle has adequate permission to set a security descriptor (to modify DACL). Function set_cifs_acl_by_fid() has been removed since we can't be sure how a file was opened for writing, a valid request can fail if the file was not opened with two above mentioned permissions. We could have opted to add on WRITE_DAC and WRITE_OWNER permissions to file opens and then use that file handle but adding addtional permissions such as WRITE_DAC and WRITE_OWNER could cause an any open to fail. And it was incorrect to look for read file handle to set a security descriptor anyway. Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-11 18:23:45 +00:00
Steve French	789e666123	[CIFS] Cleanup use of CONFIG_CIFS_STATS2 ifdef to make transport routines more readable Christoph had requested that the stats related code (in CONFIG_CIFS_STATS2) be moved into helpers to make code flow more readable. This patch should help. For example the following section from transport.c spin_unlock(&GlobalMid_Lock); atomic_inc(&ses->server->num_waiters); wait_event(ses->server->request_q, atomic_read(&ses->server->inFlight) < cifs_max_pending); atomic_dec(&ses->server->num_waiters); spin_lock(&GlobalMid_Lock); becomes simpler (with the patch below): spin_unlock(&GlobalMid_Lock); cifs_num_waiters_inc(server); wait_event(server->request_q, atomic_read(&server->inFlight) < cifs_max_pending); cifs_num_waiters_dec(server); spin_lock(&GlobalMid_Lock); Reviewed-by: Jeff Layton <jlayton@redhat.com> CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Steve French <sfrench@us.ibm.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>	2011-08-11 18:23:45 +00:00
Peng Tao	54a33b190a	NFS41: make PNFS_BLOCK selectable PNFS_BLOCK needs BLK_DEV_DM/MD, which is not a dependency for other pnfs layout drivers. Seperate it out so others can still build when BLK_DEV_DM/MD is not enabled. Also change select to depends on to avoid build failures. Reported-and-tested-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Peng Tao <peng_tao@emc.com> Acked-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-11 08:58:02 -07:00
Ajeet Yadav	9e978d8f7d	"xfs: fix error handling for synchronous writes" revisited xfs: fix for hang during synchronous buffer write error If removed storage while synchronous buffer write underway, "xfslogd" hangs. Detailed log http://oss.sgi.com/archives/xfs/2011-07/msg00740.html Related work `bfc60177f8` "xfs: fix error handling for synchronous writes" Given that xfs_bwrite actually does the shutdown already after waiting for the b_iodone completion and given that we actually found that calling xfs_force_shutdown from inside xfs_buf_iodone_callbacks was a major contributor the problem it better to drop this call. Signed-off-by: Ajeet Yadav <ajeet.yadav.77@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-10 17:00:21 -05:00
Linus Torvalds	d55140ce3a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: Ecryptfs: Add mount option to check uid of device being mounted = expect uid eCryptfs: Fix payload_len unitialized variable warning eCryptfs: fix compile error eCryptfs: Return error when lower file pointer is NULL	2011-08-10 11:08:06 -07:00
John Johansen	764355487e	Ecryptfs: Add mount option to check uid of device being mounted = expect uid Close a TOCTOU race for mounts done via ecryptfs-mount-private. The mount source (device) can be raced when the ownership test is done in userspace. Provide Ecryptfs a means to force the uid check at mount time. Signed-off-by: John Johansen <john.johansen@canonical.com> Cc: <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-08-09 23:29:01 -05:00
Alex Elder	e44f4112a4	xfs: set cursor in xfs_ail_splice() even when AIL was empty In xfs_ail_splice(), if a cursor is provided it is updated to point to the last item on the list being spliced into the AIL. But if the AIL was found to be empty, the cursor (if provided) is just initialized instead. There is no reason the empty AIL case needs to be treated any differently. And treating it the same way allows this code to be rearranged a bit, with a somewhat tidier result. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-08-09 15:30:43 -05:00
Tyler Hicks	99b373ff2d	eCryptfs: Fix payload_len unitialized variable warning fs/ecryptfs/keystore.c: In function ‘ecryptfs_generate_key_packet_set’: fs/ecryptfs/keystore.c:1991:28: warning: ‘payload_len’ may be used uninitialized in this function [-Wuninitialized] fs/ecryptfs/keystore.c:1976:9: note: ‘payload_len’ was declared here Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-08-09 13:42:46 -05:00
Roberto Sassu	4b6fee17b1	eCryptfs: fix compile error This patch fixes the compile error reported at the address: https://bugzilla.kernel.org/show_bug.cgi?id=40292 The problem arises when compiling eCryptfs as built-in and the 'encrypted' key type as a module. The patch prevents this combination from being set in the kernel configuration, by fixing the eCryptfs dependencies. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Reported-by: David Hill <hilld@binarystorm.net> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-08-09 13:42:46 -05:00
Tyler Hicks	f61500e000	eCryptfs: Return error when lower file pointer is NULL When an eCryptfs inode's lower file has been closed, and the pointer has been set to NULL, return an error when trying to do a lower read or write rather than calling BUG(). https://bugzilla.kernel.org/show_bug.cgi?id=37292 Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: <stable@kernel.org>	2011-08-09 13:42:45 -05:00
James Morris	5a2f3a02ae	Merge branch 'next-evm' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next Conflicts: fs/attr.c Resolve conflict manually. Signed-off-by: James Morris <jmorris@namei.org>	2011-08-09 10:31:03 +10:00
Linus Torvalds	2f84dd7091	autofs4: fix debug printk warning uncovered by cleanup The previous comit made the autofs4 debug printouts check types against the printout format, and uncovered this bug: fs/autofs4/waitq.c:106:2: warning: format ‘%08lx’ expects type ‘long unsigned int’, but argument 4 has type ‘autofs_wqt_t’ which is due to the insane type for wait_queue_token. That thing should be some fixed well-defined size (preferably just 'unsigned int' or 'u32') but for unexplained reasons it is randomly either 'unsigned long' or 'unsigned int' depending on the architecture. For now, cast it to 'unsigned long' for printing, the way we do elsewhere. Somebody else can try to explain the typedef mess. (There's a reason we don't support excessive use of typedefs in the kernel: it's usually just a good way of confusing yourself). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-08 12:02:43 -07:00
Linus Torvalds	c3ad996246	autofs4: clean up uaotfs use of debug/info/warning printouts Use 'pr_debug()' for DPRINTK, which will do the proper type checking on the arguments (without generating code) even when DEBUG isn't #defined. Also, use the standard __VA_ARGS__ for the macros, and stop the pointless abuse of 'do { xyz } while (0)' when the macro is already a perfectly well-formed single statement. Reported-by: David Howells <dhowells@redhat.com> Suggested-by: Joe Perches <joe@perches.com> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-08 11:35:17 -07:00
Johannes Weiner	478e0841b3	fuse: mark pages accessed when written to As fuse does not use the page cache library functions when userspace writes to a file, it did not benefit from 'c8236db mm: mark page accessed before we write_end()' that made sure pages are properly marked accessed when written to. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-08-08 16:08:08 +02:00
Johannes Weiner	b40cdd56df	fuse: delete dead .write_begin and .write_end aops Ever since 'ea9b990 fuse: implement perform_write', the .write_begin and .write_end aops have been dead code. Their task - acquiring a page from the page cache, sending out a write request and releasing the page again - is now done batch-wise to maximize the number of pages send per userspace request. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-08-08 16:08:08 +02:00
Miklos Szeredi	37fb3a30b4	fuse: fix flock Commit `a9ff4f87` "fuse: support BSD locking semantics" overlooked a number of issues with supporing flock locks over existing POSIX locking infrastructure: - it's not backward compatible, passing flock(2) calls to userspace unconditionally (if userspace sets FUSE_POSIX_LOCKS) - it doesn't cater for the fact that flock locks are automatically unlocked on file release - it doesn't take into account the fact that flock exclusive locks (write locks) don't need an fd opened for write. The last one invalidates the original premise of the patch that flock locks can be emulated with POSIX locks. This patch fixes the first two issues. The last one needs to be fixed in userspace if the filesystem assumed that a write lock will happen only on a file operned for write (as in the case of the current fuse library). Reported-by: Sebastian Pipping <webmaster@hartwork.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-08-08 16:08:08 +02:00
Alex Elder	2ddb4e9406	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux	2011-08-08 07:06:24 -05:00
Florian Westphal	8bab6f1408	compat_ioctl: add compat handler for PPPIOCGL2TPSTATS fixes following error seen on x86_64 kernel: ioctl32(openl2tpd:7480): Unknown cmd fd(14) cmd(80487436){t:'t';sz:72} arg(ffa7e6c0) on socket:[105094] The argument (struct pppol2tp_ioc_stats) uses "aligned_u64" and thus doesn't need fixups. Cc: James Chapman <jchapman@katalix.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-07 22:24:41 -07:00
Linus Torvalds	7813b94a54	vfs: rename 'do_follow_link' to 'should_follow_link' Al points out that the do_follow_link() helper function really is misnamed - it's about whether we should try to follow a symlink or not, not about actually doing the following. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-07 13:42:25 -07:00
Ari Savolainen	206b1d09a5	Fix POSIX ACL permission check After commit `3567866bf2`: "RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached" posix_acl_permission is being called with an unsupported flag and the permission check fails. This patch fixes the issue. Signed-off-by: Ari Savolainen <ari.m.savolainen@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-07 04:52:23 -04:00
Linus Torvalds	c2f340a69c	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd * 'for-linus' of git://git.open-osd.org/linux-open-osd: ore: Make ore its own module exofs: Rename raid engine from exofs/ios.c => ore exofs: ios: Move to a per inode components & device-table exofs: Move exofs specific osd operations out of ios.c exofs: Add offset/length to exofs_get_io_state exofs: Fix truncate for the raid-groups case exofs: Small cleanup of exofs_fill_super exofs: BUG: Avoid sbi realloc exofs: Remove pnfs-osd private definitions nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4	2011-08-06 22:56:03 -07:00
Linus Torvalds	3ddcd0569c	vfs: optimize inode cache access patterns The inode structure layout is largely random, and some of the vfs paths really do care. The path lookup in particular is already quite D$ intensive, and profiles show that accessing the 'inode->i_op->xyz' fields is quite costly. We already optimized the dcache to not unnecessarily load the d_op structure for members that are often NULL using the DCACHE_OP_xyz bits in dentry->d_flags, and this does something very similar for the inode ops that are used during pathname lookup. It also re-orders the fields so that the fields accessed by 'stat' are together at the beginning of the inode structure, and roughly in the order accessed. The effect of this seems to be in the 1-2% range for an empty kernel "make -j" run (which is fairly kernel-intensive, mostly in filename lookup), so it's visible. The numbers are fairly noisy, though, and likely depend a lot on exact microarchitecture. So there's more tuning to be done. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-06 22:53:23 -07:00
Linus Torvalds	830c0f0edc	vfs: renumber DCACHE_xyz flags, remove some stale ones Gcc tends to generate better code with small integers, including the DCACHE_xyz flag tests - so move the common ones to be first in the list. Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and DCACHE_AUTOFS_PENDING values, their users no longer exists in the source tree. And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the common case to be a nice straight-line fall-through. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-06 22:52:40 -07:00
Boaz Harrosh	cf283ade08	ore: Make ore its own module Export everything from ore need exporting. Change Kbuild and Kconfig to build ore.ko as an independent module. Import ore from exofs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:36:19 -07:00
Boaz Harrosh	8ff660ab85	exofs: Rename raid engine from exofs/ios.c => ore ORE stands for "Objects Raid Engine" This patch is a mechanical rename of everything that was in ios.c and its API declaration to an ore.c and an osd_ore.h header. The ore engine will later be used by the pnfs objects layout driver. * File ios.c => ore.c * Declaration of types and API are moved from exofs.h to a new osd_ore.h * All used types are prefixed by ore_ from their exofs_ name. * Shift includes from exofs.h to osd_ore.h so osd_ore.h is independent, include it from exofs.h. Other than a pure rename there are no other changes. Next patch will move the ore into it's own module and will export the API to be used by exofs and later the layout driver Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:36:18 -07:00
Boaz Harrosh	9e9db45649	exofs: ios: Move to a per inode components & device-table Exofs raid engine was saving on memory space by having a single layout-info, single pid, and a single device-table, global to the filesystem. Then passing a credential and object_id info at the io_state level, private for each inode. It would also devise this contraption of rotating the device table view for each inode->ino to spread out the device usage. This is not compatible with the pnfs-objects standard, demanding that each inode can have it's own layout-info, device-table, and each object component it's own pid, oid and creds. So: Bring exofs raid engine to be usable for generic pnfs-objects use by: * Define an exofs_comp structure that holds obj_id and credential info. * Break up exofs_layout struct to an exofs_components structure that holds a possible array of exofs_comp and the array of devices + the size of the arrays. * Add a "comps" parameter to get_io_state() that specifies the ids creds and device array to use for each IO. This enables to keep the layout global, but the device-table view, creds and IDs at the inode level. It only adds two 64bit to each inode, since some of these members already existed in another form. * ios raid engine now access layout-info and comps-info through the passed pointers. Everything is pre-prepared by caller for generic access of these structures and arrays. At the exofs Level: * Super block holds an exofs_components struct that holds the device array, previously in layout. The devices there are in device-table order. The device-array is twice bigger and repeats the device-table twice so now each inode's device array can point to a random device and have a round-robin view of the table, making it compatible to previous exofs versions. * Each inode has an exofs_components struct that is initialized at load time, with it's own view of the device table IDs and creds. When doing IO this gets passed to the io_state together with the layout. While preforming this change. Bugs where found where credentials with the wrong IDs where used to access the different SB objects (super.c). As well as some dead code. It was never noticed because the target we use does not check the credentials. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:35:32 -07:00
Boaz Harrosh	85e44df474	exofs: Move exofs specific osd operations out of ios.c ios.c will be moving to an external library, for use by the objects-layout-driver. Remove from it some exofs specific functions. Also g_attr_logical_length is used both by inode.c and ios.c move definition to the later, to keep it independent Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:35:31 -07:00
Boaz Harrosh	e1042ba099	exofs: Add offset/length to exofs_get_io_state In future raid code we will need to know the IO offset/length and if it's a read or write to determine some of the array sizes we'll need. So add a new exofs_get_rw_state() API for use when writeing/reading. All other simple cases are left using the old way. The major change to this is that now we need to call exofs_get_io_state later at inode.c::read_exec and inode.c::write_exec when we actually know these things. So this patch is kept separate so I can test things apart from other changes. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:35:31 -07:00
Linus Torvalds	1957e7fdef	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: cope with negative dentries in cifs_get_root cifs: convert prefixpath delimiters in cifs_build_path_to_root CIFS: Fix missing a decrement of inFlight value cifs: demote DFS referral lookup errors to cFYI Revert "cifs: advertise the right receive buffer size to the server"	2011-08-06 13:54:36 -07:00
Linus Torvalds	1117f72ea0	vfs: show O_CLOEXE bit properly in /proc/<pid>/fdinfo/<fd> files The CLOEXE bit is magical, and for performance (and semantic) reasons we don't actually maintain it in the file descriptor itself, but in a separate bit array. Which means that when we show f_flags, the CLOEXE status is shown incorrectly: we show the status not as it is now, but as it was when the file was opened. Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit array. Uli needs this in order to re-implement the pfiles program: "For normal file descriptors (not sockets) this was the last piece of information which wasn't available. This is all part of my 'give Solaris users no reason to not switch' effort. I intend to offer the code to the util-linux-ng maintainers." Requested-by: Ulrich Drepper <drepper@akkadia.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-06 11:51:33 -07:00
Linus Torvalds	c21427043d	oom_ajd: don't use WARN_ONCE, just use printk_once WARN_ONCE() is very annoying, in that it shows the stack trace that we don't care about at all, and also triggers various user-level "kernel oopsed" logic that we really don't care about. And it's not like the user can do anything about the applications (sshd) in question, it's a distro issue. Requested-by: Andi Kleen <andi@firstfloor.org> (and many others) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-06 11:43:08 -07:00
Chris Mason	2ab1ba68ae	Btrfs: force unplugs when switching from high to regular priority bios Btrfs does bio submissions from a worker thread, and each device has a list of high priority bios and regular priority bios. Synchronous writes go to the high priority thread while async writes go to regular list. This commit brings back an explicit unplug any time we switch from high to regular priority, which makes it easier for the block layer to give us low latencies. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-05 13:48:18 -04:00
Jeff Layton	80975d21aa	cifs: cope with negative dentries in cifs_get_root The loop around lookup_one_len doesn't handle the case where it might return a negative dentry, which can cause an oops on the next pass through the loop. Check for that and break out of the loop with an error of -ENOENT if there is one. Fixes the panic reported here: https://bugzilla.redhat.com/show_bug.cgi?id=727927 Reported-by: TR Bentley <home@trarbentley.net> Reported-by: Iain Arnell <iarnell@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-05 15:03:09 +00:00
Jeff Layton	f9e8c45002	cifs: convert prefixpath delimiters in cifs_build_path_to_root Regression from 2.6.39... The delimiters in the prefixpath are not being converted based on whether posix paths are in effect. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=727834 Reported-and-Tested-by: Iain Arnell <iarnell@gmail.com> Reported-by: Patrick Oltmann <patrick.oltmann@gmx.net> Cc: Pavel Shilovsky <piastryyy@gmail.com> Cc: stable@kernel.org Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-05 14:55:15 +00:00
Linus Torvalds	24f0eed266	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached get rid of boilerplate switches in posix_acl.h fix block device fallout from ->fsync() changes	2011-08-04 16:44:40 -10:00
Boaz Harrosh	16f75bb35d	exofs: Fix truncate for the raid-groups case In the general raid-group case the truncate was wrong in that it did not also fix the object length of the neighboring groups. There are two bad cases in the old code: 1. Space that should be freed was not. 2. If a file That was big is truncated small, then made bigger again, the holes would not contain zeros but could expose old data. (If the growing of the file expands to more than a full groups cycle + group size (> S + T)) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-04 12:35:25 -07:00
Boaz Harrosh	9ce730475e	exofs: Small cleanup of exofs_fill_super Small cleanup that unifies duplicated code used in both the error and success cases Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-04 12:35:23 -07:00
Boaz Harrosh	6d4073e881	exofs: BUG: Avoid sbi realloc Since the beginning we realloced the sbi structure when a bigger then one device table was specified. (I know that was really stupid). Then much later when "register bdi" was added (By Jens) it was registering the pointer to sbi->bdi before the realloc. We never saw this problem because up till now the realloc did not do anything since the device table was small enough to fit in the original allocation. But once we starting testing with large device tables (Bigger then 28) we noticed the crash of writeback operating on a deallocated pointer. * Avoid the all mess by allocating the device-table as a second array and get rid of the variable-sized structure and the rest of this mess. * Take the chance to clean near by structures and comments. * Add a needed dprint on startup to indicate the loaded layout. * Also move the bdi registration to the very end because it will only fail in a low memory, which will probably fail before hand. There are many more likely causes to not load before that. This way the error handling is made simpler. (Just doing this would be enough to fix the BUG) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-04 12:35:20 -07:00
Boaz Harrosh	26ae93c2dc	exofs: Remove pnfs-osd private definitions Now that pnfs-osd has hit mainline we can remove exofs's private header. (And the FIXME comment) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-04 12:35:18 -07:00
Trond Myklebust	910ac68a2b	NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resets If the client is in the process of resetting the session when it receives a callback, then returning NFS4ERR_DELAY may cause a deadlock with the DESTROY_SESSION call. Basically, if the client returns NFS4ERR_DELAY in response to the CB_SEQUENCE call, then the server is entitled to believe that the client is busy because it is already processing that call. In that case, the server is perfectly entitled to respond with a NFS4ERR_BACK_CHAN_BUSY to any DESTROY_SESSION call. Fix this by having the client reply with a NFS4ERR_BADSESSION in response to the callback if it is resetting the session. Cc: stable@kernel.org [2.6.38+] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-04 11:55:35 -04:00
Trond Myklebust	55a673990e	NFSv4.1: Fix the callback 'highest_used_slotid' behaviour Currently, there is no guarantee that we will call nfs4_cb_take_slot() even though nfs4_callback_compound() will consistently call nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field. The result is that we can trigger the BUG_ON() upon the next call to nfs4_cb_take_slot(). This patch fixes the above problem by using the slot id that was taken in the CB_SEQUENCE operation as a flag for whether or not we need to call nfs4_cb_free_slot(). It also fixes an atomicity problem: we need to set tbl->highest_used_slotid atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up racing with the various tests in nfs4_begin_drain_session(). Cc: stable@kernel.org [2.6.38+] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-04 11:55:35 -04:00
Boaz Harrosh	9af7db3228	pnfs-obj: Fix the comp_index != 0 case There were bugs in the case of partial layout where olo_comp_index is not zero. This used to work and was tested but one of the later cleanup SQUASHMEs broke it and was not tested since. Also add a dprint that specify those received layout parameters. Everything else was already printed. [Needed in v3.0] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-04 11:54:48 -04:00
Boaz Harrosh	20618b21da	pnfs-obj: Bug when we are running out of bio When we have a situation that the number of pages we want to encode is bigger then the size of the bio. (Which can currently happen only when all IO is going to a single device .e.g group_width==1) then the IO is submitted short and we report back only the amount of bytes we actually wrote/read and all is fine. BUT ... There was a bug that the current length counter was advanced before the fail to add the extra page, and we come to a situation that the CDB length was one-page longer then the actual bio size, which is of course rejected by the osd-target. While here also fix the bio size calculation, in the case that we received more then one group of devices. CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-04 11:54:38 -04:00
Heiko Carstens	88c9e42196	nfs: add missing prefetch.h include Fix this compile error on s390: CC [M] fs/nfs/blocklayout/blocklayout.o fs/nfs/blocklayout/blocklayout.c: In function 'bl_end_io_read': fs/nfs/blocklayout/blocklayout.c:201:4: error: implicit declaration of function 'prefetchw' Introduced with `9549ec01` "pnfsblock: bl_read_pagelist". Cc: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-04 11:54:25 -04:00
Linus Torvalds	14595f708e	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: use kzalloc in ext4_kzalloc()	2011-08-03 15:09:10 -10:00
Robert P. J. Day	206506ccf0	tmpfs: expand "help" to explain value of TMPFS_POSIX_ACL Expand the fs/Kconfig "help" info to clarify why it's a bad idea to deselect the TMPFS_POSIX_ACL config variable. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-03 14:25:25 -10:00
Hugh Dickins	31475dd611	mm: a few small updates for radix-swap Remove PageSwapBacked (!page_is_file_cache) cases from add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now go through shmem_add_to_page_cache(). Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(), and add a comment on swap entries to invalidate_mapping_pages(). And mincore_page() uses find_get_page() on what might be shmem or a tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-03 14:25:24 -10:00
Randy Dunlap	2af1416265	fs/dcache.c: fix new kernel-doc warning Fix new kernel-doc warning in fs/dcache.c: Warning(fs/dcache.c:797): No description found for parameter 'sb' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-03 14:25:21 -10:00
Pavel Shilovsky	0193e07226	CIFS: Fix missing a decrement of inFlight value if we failed on getting mid entry in cifs_call_async. Cc: stable@kernel.org Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-03 19:42:12 +00:00
Mathias Krause	db9481c047	ext4: use kzalloc in ext4_kzalloc() Commit 9933fc0i (ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()) intruduced wrappers around k*alloc/vmalloc but introduced a typo for ext4_kzalloc() by not using kzalloc() but kmalloc(). Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-03 14:57:11 -04:00
Linus Torvalds	ed8f37370d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits) Btrfs: don't call writepages from within write_full_page Btrfs: Remove unused variable 'last_index' in file.c Btrfs: clean up for find_first_extent_bit() Btrfs: clean up for wait_extent_bit() Btrfs: clean up for insert_state() Btrfs: remove unused members from struct extent_state Btrfs: clean up code for merging extent maps Btrfs: clean up code for extent_map lookup Btrfs: clean up search_extent_mapping() Btrfs: remove redundant code for dir item lookup Btrfs: make acl functions really no-op if acl is not enabled Btrfs: remove remaining ref-cache code Btrfs: remove a BUG_ON() in btrfs_commit_transaction() Btrfs: use wait_event() Btrfs: check the nodatasum flag when writing compressed files Btrfs: copy string correctly in INO_LOOKUP ioctl Btrfs: don't print the leaf if we had an error btrfs: make btrfs_set_root_node void Btrfs: fix oops while writing data to SSD partitions Btrfs: Protect the readonly flag of block group ... Fix up trivial conflicts (due to acl and writeback cleanups) in - fs/btrfs/acl.c - fs/btrfs/ctree.h - fs/btrfs/extent_io.c	2011-08-02 21:14:05 -10:00
Al Viro	3567866bf2	RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-03 00:58:42 -04:00
Jeff Layton	b802898334	cifs: demote DFS referral lookup errors to cFYI cifs: demote DFS referral lookup errors to cFYI Now that we call into this routine on every mount, anyone who doesn't have the upcall configured will get multiple printks about failed lookups. Reported-and-Tested-by: Martijn Uffing <mp3project@sarijopen.student.utwente.nl> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-03 03:19:28 +00:00
Steve French	fc05a78efb	Revert "cifs: advertise the right receive buffer size to the server" This reverts commit `c4d3396b26`. Problems discovered with readdir to Samba due to not accounting for header size properly with this change	2011-08-03 03:17:43 +00:00
Rafael J. Wysocki	da5aa861be	fix block device fallout from ->fsync() changes blkdev_fsync() needs to write pages in pagecache... Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 21:33:47 -04:00
Linus Torvalds	60ad446682	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits) ext4: prevent memory leaks from ext4_mb_init_backend() on error path ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode # ext4: use ext4_msg() instead of printk in mballoc ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree() ext4: use the correct error exit path in ext4_init_inode_table() ext4: add missing kfree() on error return path in add_new_gdb() ext4: change umode_t in tracepoint headers to be an explicit __u16 ext4: fix races in ext4_sync_parent() ext4: Fix overflow caused by missing cast in ext4_fallocate() ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole ext4: simplify parameters of reserve_backup_gdb() ext4: simplify parameters of add_new_gdb() ext4: remove lock_buffer in bclean() and setup_new_group_blocks() ext4: simplify journal handling in setup_new_group_blocks() ext4: let setup_new_group_blocks() set multiple bits at a time ext4: fix a typo in ext4_group_extend() ext4: let ext4_group_add_blocks() handle 0 blocks quickly ext4: let ext4_group_add_blocks() return an error code ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks() ... Fix up conflict in fs/ext4/inode.c: commit `aacfc19c62` ("fs: simplify the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO() function for the new simplified calling convention, while commit `dae1e52cb1` ("ext4: move ext4_ind_* functions from inode.c to indirect.c") moved the function to another file.	2011-08-01 13:56:03 -10:00
Linus Torvalds	1b8e94993c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree() VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree() switch posix_acl_chmod() to umode_t switch posix_acl_from_mode() to umode_t switch posix_acl_equiv_mode() to umode_t * switch posix_acl_create() to umode_t * block: initialise bd_super in bdget() vfs: avoid call to inode_lru_list_del() if possible vfs: avoid taking inode_hash_lock on pipes and sockets vfs: conditionally call inode_wb_list_del() VFS: Fix automount for negative autofs dentries Btrfs: load the key from the dir item in readdir into a fake dentry devtmpfs: missing initialialization in never-hit case hppfs: missing include	2011-08-01 13:48:31 -10:00
Linus Torvalds	a2d7730235	Merge branch 'pstore-efi' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'pstore-efi' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: efivars: Introduce PSTORE_EFI_ATTRIBUTES efivars: Use string functions in pstore_write efivars: introduce utf16_strncmp efivars: String functions efi: Add support for using efivars as a pstore backend pstore: Allow the user to explicitly choose a backend pstore: Make "part" unsigned pstore: Add extra context for writes and erases pstore: Extend API for more flexibility in new backends	2011-08-01 13:40:51 -10:00
Yu Jian	79a77c5ac3	ext4: prevent memory leaks from ext4_mb_init_backend() on error path In ext4_mb_init(), if the s_locality_group allocation fails it will currently cause the allocations made in ext4_mb_init_backend() to be leaked. Moving the ext4_mb_init_backend() allocation after the s_locality_group allocation avoids that problem. Signed-off-by: Yu Jian <yujian@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-01 17:41:46 -04:00
Yu Jian	48e6061bf4	ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode # Signed-off-by: Yu Jian <yujian@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-01 17:41:39 -04:00
Theodore Ts'o	9d8b9ec442	ext4: use ext4_msg() instead of printk in mballoc Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-01 17:41:35 -04:00
Josef Bacik	0d10ee2e6d	Btrfs: don't call writepages from within write_full_page When doing a writepage we call writepages to try and write out any other dirty pages in the area. This could cause problems where we commit a transaction and then have somebody else dirtying metadata in the area as we could end up writing out a lot more than we care about, which could cause latency on anybody who is waiting for the transaction to completely finish committing. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:37:36 -04:00
Mitch Harder	341d14f161	Btrfs: Remove unused variable 'last_index' in file.c The variable 'last_index' is calculated in the __btrfs_buffered_write function and passed as a parameter to the prepare_pages function, but is not used anywhere in the prepare_pages function. Remove instances of 'last_index' in these functions. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:32:39 -04:00
Xiao Guangrong	69261c4b6a	Btrfs: clean up for find_first_extent_bit() find_first_extent_bit() and find_first_extent_bit_state() share most of the code, and we can just make the former call the latter. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:32:39 -04:00
Xiao Guangrong	ded91f0814	Btrfs: clean up for wait_extent_bit() We can just use cond_resched_lock(). Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:32:38 -04:00
Xiao Guangrong	3150b69969	Btrfs: clean up for insert_state() Don't duplicate set_state_bits(). Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:32:30 -04:00
Xiao Guangrong	3a6d457ec7	Btrfs: remove unused members from struct extent_state These members are not used at all. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:50 -04:00
Li Zefan	4d2c8f62f1	Btrfs: clean up code for merging extent maps unpin_extent_cache() and add_extent_mapping() shares the same code that merges extent maps. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:50 -04:00
Li Zefan	ed64f06652	Btrfs: clean up code for extent_map lookup lookup_extent_map() and search_extent_map() can share most of code. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:49 -04:00
Li Zefan	7e016a038e	Btrfs: clean up search_extent_mapping() rb_node returned by __tree_search() can be a valid pointer or NULL, but won't be some errno. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:49 -04:00
Li Zefan	85d85a743d	Btrfs: remove redundant code for dir item lookup When we search a dir item with a specific hash code, we can just return NULL without further checking if btrfs_search_slot() returns 1. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:48 -04:00
Li Zefan	9b89d95a14	Btrfs: make acl functions really no-op if acl is not enabled So there's no overhead for something we don't use. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:48 -04:00
Li Zefan	15de900d08	Btrfs: remove remaining ref-cache code Since commit `f2a97a9dbd` ("btrfs: remove all unused functions"), there's no extern functions at all in ref-cache.c, so just remove the remaining dead code. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:47 -04:00
Li Zefan	b9c8300c2a	Btrfs: remove a BUG_ON() in btrfs_commit_transaction() wait_for_commit() always returns 0. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:47 -04:00
Li Zefan	72d63ed642	Btrfs: use wait_event() Use wait_event() when possible to avoid code duplication. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:46 -04:00
Li Zefan	e55179b3d7	Btrfs: check the nodatasum flag when writing compressed files If mounting with nodatasum option, we won't csum file data for general write or direct-io write, and this rule should also be applied when writing compressed files. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:46 -04:00
Li Zefan	77906a5075	Btrfs: copy string correctly in INO_LOOKUP ioctl Memory areas [ptr, ptr+total_len] and [name, name+total_len] may overlap, so it's wrong to use memcpy(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:45 -04:00
Josef Bacik	b783e62d96	Btrfs: don't print the leaf if we had an error In __btrfs_free_extent we will print the leaf if we fail to find the extent we wanted, but the problem is if we get an error we won't have a leaf so often this leads to a NULL pointer dereference and we lose the error that actually occurred. So only print the leaf if ret > 0, which means we didn't find the item we were looking for but we didn't error either. This way the error is preserved. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:45 -04:00
Mark Fasheh	bf5f32ecb6	btrfs: make btrfs_set_root_node void This is fairly trivial - btrfs_set_root_node() - always returns zero so we can just make it void. All callers ignore the return code now anyway. I also made sure to check that none of the functions that btrfs_set_root_node() calls returns an error that we might have needed to catch and pass back. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:44 -04:00
liubo	ff1f2b4407	Btrfs: fix oops while writing data to SSD partitions Here I have a two SSD-partitions btrfs, and they are defaultly set to "data=raid0, metadata=raid1", then I try to fill my btrfs partition till "No space left on device", via "dd if=/dev/zero of=/mnt/btrfs/tmp". I get an oops panic from kernel BUG at fs/btrfs/extent-tree.c:5199!, which refers to find_free_extent's BUG_ON(index != get_block_group_index(block_group)); In SSD mode, in order to find enough space to alloc, we may check the block_group cache which has been checked sometime before, but the index is not updated, where it hits the BUG_ON. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Acked-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:44 -04:00
WuBo	61cfea9bb8	Btrfs: Protect the readonly flag of block group The access for ro in btrfs_block_group_cache should be protected because of the racy lock in relocation. Signed-off-by: Wu Bo <wu.bo@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:43 -04:00
Jeff Mahoney	1bf85046e4	btrfs: Make extent-io callbacks that never fail return void The set/clear bit and the extent split/merge hooks only ever return 0. Changing them to return void simplifies the error handling cases later. This patch changes the hook prototypes, the single implementation of each, and the functions that call them to return void instead. Since all four of these hooks execute under a spinlock, they're necessarily simple. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:43 -04:00
Li Zefan	b6973aa622	Btrfs: fix readahead in file defrag We passed the wrong value to btrfs_force_ra(). Fix this by changing the argument of btrfs_force_ra() from last_index to nr_page. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:42 -04:00
Tsutomu Itoh	b532402e4d	Btrfs: return error to caller when btrfs_unlink() failes When btrfs_unlink_inode() and btrfs_orphan_add() in btrfs_unlink() are error, the error code is returned to the caller instead of BUG_ON(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:42 -04:00
Wanlong Gao	a0f98dde11	Btrfs:don't check the return value of __btrfs_add_inode_defrag Don't need to check the return value of __btrfs_add_inode_defrag(), since it will always return 0. Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:41 -04:00
Chris Mason	b43b31bdf2	Merge branch 'alloc_path' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling into for-linus	2011-08-01 14:27:34 -04:00
Dave Kleikamp	1c8007b076	jfs: flush journal completely before releasing metadata inodes This fixes a race during unmount. We need to not only make sure that the journal is completely written, but that the metadata changes make it to disk before releasing ipimap and ipbmap. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2011-08-01 12:41:00 -05:00
Linus Torvalds	5f66d2b58c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: CIFS: Cleanup demupltiplex thread exiting code CIFS: Move mid search to a separate function CIFS: Move RFC1002 check to a separate function CIFS: Simplify socket reading in demultiplex thread CIFS: Move buffer allocation to a separate function cifs: remove unneeded variable initialization in cifs_reconnect_tcon cifs: simplify refcounting for oplock breaks cifs: fix compiler warning in CIFSSMBQAllEAs cifs: fix name parsing in CIFSSMBQAllEAs cifs: don't start signing too early cifs: trivial: goto out here is unnecessary cifs: advertise the right receive buffer size to the server	2011-08-01 06:14:25 -10:00
Pavel Shilovsky	762dfd1057	CIFS: Cleanup demupltiplex thread exiting code Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-01 12:49:45 +00:00
Pavel Shilovsky	ad69bae178	CIFS: Move mid search to a separate function Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-01 12:49:42 +00:00
Pavel Shilovsky	98bac62c9f	CIFS: Move RFC1002 check to a separate function Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-01 12:49:38 +00:00
Pavel Shilovsky	e7015fb1c5	CIFS: Simplify socket reading in demultiplex thread Move reading to separate function and remove csocket variable. Also change semantic in a little: goto incomplete_rcv only when we get -EAGAIN (or a familiar error) while reading rfc1002 header. In this case we don't check for echo timeout when we don't get whole header at once, as it was before. Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-01 12:49:34 +00:00
Theodore Ts'o	f18a5f21c2	ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-01 08:45:38 -04:00
Theodore Ts'o	9933fc0ac1	ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree() Introduce new helper functions which try kmalloc, and then fall back to vmalloc if necessary, and use them for allocating and deallocating s_flex_groups. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-01 08:45:02 -04:00
Pavel Shilovsky	3d9c2472a5	CIFS: Move buffer allocation to a separate function Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-08-01 12:33:44 +00:00
Yongqiang Yang	33853a0dde	ext4: use the correct error exit path in ext4_init_inode_table() This patch lets ext4_init_inode_table() handle errors right. ext4_init_inode_table() should down_write() alloc_sem which has been up_write()ed and stop the started journal handle. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-01 06:32:19 -04:00
Markus Trippelsdorf	206d440f64	xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set commit `4e34e719e4`, that takes the ACL checks to common code, accidentely broke the build when CONFIG_FS_POSIX_ACL is not set: CC fs/xfs/linux-2.6/xfs_iops.o fs/xfs/linux-2.6/xfs_iops.c:1025:14: error: ‘xfs_get_acl’ undeclared here (not in a function) Fix this by declaring xfs_get_acl a static inline function. Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:35:04 -04:00
David Howells	43c1c9cd24	VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock Reorganise shrink_dcache_for_umount_subtree() in light of the demise of dcache_lock. Without that dcache_lock, there is no need for the batching of removal of dentries from the system under it (we wanted to make intensive use of the locked data whilst we held it, but didn't want to hold it for long at a time). This works, provided the preceding patch is correct in its removal of locking on dentry->d_lock on the basis that no one should be locking these dentries any more as the whole superblock is defunct. With this patch, the calls to dentry_lru_del() and __d_shrink() are placed at the point where each dentry is detached handled. It is possible that, as an alternative, the batching should still be done - but only for dentry_lru_del() of all a dentry's children in one go. In such a case, the batching would be done under dcache_lru_lock. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:27:57 -04:00
David Howells	c6627c60c0	VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree() Locks of the dcache_lock were replaced by locks of dentry->d_lock in commits such as: `2304450783` `2fd6b7f507` as part of the RCU-based pathwalk changes, despite the fact that the caller (shrink_dcache_for_umount()) notes in the banner comment the reasons that d_lock is not necessary in these functions: /* * destroy the dentries attached to a superblock on unmounting * - we don't need to use dentry->d_lock because: * - the superblock is detached from all mountings and open files, so the * dentry trees will not be rearranged by the VFS * - s_umount is write-locked, so the memory pressure shrinker will ignore * any dentries belonging to this superblock that it comes across * - the filesystem itself is no longer permitted to rearrange the dentries * in this superblock */ So remove these locks. If the locks are actually necessary, then this banner comment should be altered instead. The hash table chains are protected by 1-bit locks in the hash table heads, so those shouldn't be a problem. Note that to make this work, __d_drop() has to be split so that the RCUwalk barrier can be avoided. This causes problems otherwise as it has an assertion that dentry->d_lock is locked - but there is no need for that as no one else can be trying to access this dentry, except to step over it (and that should be handled by d_free(), I think). Signed-off-by: David Howells <dhowells@redhat.com> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:27:57 -04:00
David Howells	35f40ef002	VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree() Remove the detached-dentry counter from shrink_dcache_for_umount_subtree() as the value it computes is no longer used as of commit `312d3ca856` which made the nr_dentry counters summed per-CPU rather than global atomic. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:27:57 -04:00
Al Viro	86bc704db0	switch posix_acl_chmod() to umode_t again, that's what all callers pass to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:10:32 -04:00
Al Viro	3a5fba19b0	switch posix_acl_from_mode() to umode_t ... seeing that this is what all callers pass to it anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:10:20 -04:00
Al Viro	d6952123b5	switch posix_acl_equiv_mode() to umode_t * ... so that &inode->i_mode could be passed to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:10:06 -04:00
Al Viro	d3fb612076	switch posix_acl_create() to umode_t * so we can pass &inode->i_mode to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 02:09:42 -04:00
Lachlan McIlroy	782b94cdf5	block: initialise bd_super in bdget() bd_super is currently reset to NULL in kill_block_super() so we rely on previous users of the block_device object to initialise this value for the next user. This quirk was exposed on RHEL5 when a third party filesystem did not always use kill_block_super() and therefore bd_super wasn't being reset when a block_device object was recycled within the cache. This may not be a problem upstream but makes sense to be defensive. Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 01:57:44 -04:00
Eric Dumazet	c4ae0c6545	vfs: avoid call to inode_lru_list_del() if possible inode_lru_list_del() is expensive because of per superblock lru locking, while some inodes are not in lru list. Adding a check in iput_final() can speedup pipe/sockets workloads on SMP. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 01:41:17 -04:00
Eric Dumazet	f2ee7abf4c	vfs: avoid taking inode_hash_lock on pipes and sockets Some inodes (pipes, sockets, ...) are not hashed, no need to take contended inode_hash_lock at dismantle time. nice speedup on SMP machines on socket intensive workloads. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 01:41:17 -04:00
Eric Dumazet	b12362bdb6	vfs: conditionally call inode_wb_list_del() Some inodes (pipes, sockets, ...) are not in bdi writeback list. evict() can avoid calling inode_wb_list_del() and its expensive spinlock by checking inode i_wb_list being empty or not. At this point, no other cpu/user can concurrently manipulate this inode i_wb_list Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 01:41:17 -04:00
David Howells	5a30d8a2b8	VFS: Fix automount for negative autofs dentries Autofs may set the DCACHE_NEED_AUTOMOUNT flag on negative dentries. These need attention from the automounter daemon regardless of the LOOKUP_FOLLOW flag. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 01:38:01 -04:00
Josef Bacik	b4aff1f874	Btrfs: load the key from the dir item in readdir into a fake dentry In btrfs we have 2 indexes for inodes. One is for readdir, it's in this nice sequential order and works out brilliantly for readdir. However if you use ls, it usually stat's each file it gets from readdir. This is where the second index comes in, which is based on a hash of the name of the file. So then the lookup has to lookup this index, and then lookup the inode. The index lookup is going to be in random order (since its based on the name hash), which gives us less than stellar performance. Since we know the inode location from the readdir index, I create a dummy dentry and copy the location key into dentry->d_fsdata. Then on lookup if we have d_fsdata we use that location to lookup the inode, avoiding looking up the other directory index. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 01:31:42 -04:00
Trond Myklebust	a00ed25cce	NFS: Re-enable compilation of nfs with !CONFIG_NFS_V4 \|\| !CONFIG_NFS_V4_1 Fix two recently introduced compile problems: Fix a typo in fs/nfs/pnfs.h Move the pnfs_blksize declaration outside the CONFIG_NFS_V4 section in struct nfs_server. Reported-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-31 14:27:04 -10:00
Jeff Layton	c4a5534a1b	cifs: remove unneeded variable initialization in cifs_reconnect_tcon Reported-and-acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-31 21:27:16 +00:00
Jeff Layton	ad635942c8	cifs: simplify refcounting for oplock breaks Currently, we take a sb->s_active reference and a cifsFileInfo reference when an oplock break workqueue job is queued. This is unnecessary and more complicated than it needs to be. Also as Al points out, deactivate_super has non-trivial locking implications so it's best to avoid that if we can. Instead, just cancel any pending oplock breaks for this filehandle synchronously in cifsFileInfo_put after taking it off the lists. That should ensure that this job doesn't outlive the structures it depends on. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-31 21:21:20 +00:00
Jeff Layton	5980fc966b	cifs: fix compiler warning in CIFSSMBQAllEAs The recent fix to the above function causes this compiler warning to pop on some gcc versions: CC [M] fs/cifs/cifssmb.o fs/cifs/cifssmb.c: In function ‘CIFSSMBQAllEAs’: fs/cifs/cifssmb.c:5708: warning: ‘ea_name_len’ may be used uninitialized in this function Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-31 21:21:13 +00:00
Jeff Layton	91d065c473	cifs: fix name parsing in CIFSSMBQAllEAs The code that matches EA names in CIFSSMBQAllEAs is incorrect. It uses strncmp to do the comparison with the length limited to the name_len sent in the response. Problem: Suppose we're looking for an attribute named "foobar" and have an attribute before it in the EA list named "foo". The comparison will succeed since we're only looking at the first 3 characters. Fix this by also comparing the length of the provided ea_name with the name_len in the response. If they're not equal then it shouldn't match. Reported-by: Jian Li <jiali@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-31 21:21:09 +00:00
Jeff Layton	998d6fcb24	cifs: don't start signing too early Sniffing traffic on the wire shows that windows clients send a zeroed out signature field in a NEGOTIATE request, and send "BSRSPYL" in the signature field during SESSION_SETUP. Make the cifs client behave the same way. It doesn't seem to make much difference in any server that I've tested against, but it's probably best to follow windows behavior as closely as possible here. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-31 21:21:06 +00:00
Jeff Layton	1f1cff0be0	cifs: trivial: goto out here is unnecessary ...and remove some obsolete comments. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-31 21:21:02 +00:00
Jeff Layton	c4d3396b26	cifs: advertise the right receive buffer size to the server Currently, we mirror the same size back to the server that it sends us. That makes little sense. Instead we should be sending the server the maximum buffer size that we can handle -- CIFSMaxBufSize minus the 4 byte RFC1001 header. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-31 21:20:58 +00:00
Linus Torvalds	24c3047095	Merge branch 'nfs-for-3.1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs * 'nfs-for-3.1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits) pnfsblock: write_pagelist handle zero invalid extents pnfsblock: note written INVAL areas for layoutcommit pnfsblock: bl_write_pagelist pnfsblock: bl_read_pagelist pnfsblock: cleanup_layoutcommit pnfsblock: encode_layoutcommit pnfsblock: merge rw extents pnfsblock: add extent manipulation functions pnfsblock: bl_find_get_extent pnfsblock: xdr decode pnfs_block_layout4 pnfsblock: call and parse getdevicelist pnfsblock: merge extents pnfsblock: lseg alloc and free pnfsblock: remove device operations pnfsblock: add device operations pnfsblock: basic extent code pnfsblock: use pageio_ops api pnfsblock: add blocklayout Kconfig option, Makefile, and stubs pnfs: cleanup_layoutcommit pnfs: ask for layout_blksize and save it in nfs_server ...	2011-07-31 06:26:50 -10:00
Peng Tao	71cdd40fd4	pnfsblock: write_pagelist handle zero invalid extents For invalid extents, find other pages in the same fsblock and write them out. [pnfsblock: write_begin] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:17 -04:00
Fred Isaman	31e6306a40	pnfsblock: note written INVAL areas for layoutcommit Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:17 -04:00
Fred Isaman	650e2d39bd	pnfsblock: bl_write_pagelist Note: When upper layer's read/write request cannot be fulfilled, the block layout driver shouldn't silently mark the page as error. It should do what can be done and leave the rest to the upper layer. To do so, we should set rdata/wdata->res.count properly. When upper layer re-send the read/write request to finish the rest part of the request, pgbase is the position where we should start at. [pnfsblock: bl_write_pagelist support functions] [pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfsblock: handle errors when read or write pagelist.] Signed-off-by: Zhang Jingwang <yyalone@gmail.com> [pnfs-block: use new write_pagelist api] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> [SQUASHME: pnfsblock: mds_offset is set in the generic layer] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [pnfsblock: mark IO error with NFS_LAYOUT_{RW\|RO}_FAILED] Signed-off-by: Peng Tao <peng_tao@emc.com> [pnfsblock: SQUASHME: adjust to API change] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfsblock: fixup blksize alignment in bl_setup_layoutcommit] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfsblock: handle errors when read or write pagelist.] Signed-off-by: Zhang Jingwang <yyalone@gmail.com> [pnfs-block: use new write_pagelist api] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:17 -04:00
Fred Isaman	9549ec01b0	pnfsblock: bl_read_pagelist Note: When upper layer's read/write request cannot be fulfilled, the block layout driver shouldn't silently mark the page as error. It should do what can be done and leave the rest to the upper layer. To do so, we should set rdata/wdata->res.count properly. When upper layer re-send the read/write request to finish the rest part of the request, pgbase is the position where we should start at. [pnfsblock: mark IO error with NFS_LAYOUT_{RW\|RO}_FAILED] Signed-off-by: Peng Tao <peng_tao@emc.com> [pnfsblock: read path error handling] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfsblock: handle errors when read or write pagelist.] Signed-off-by: Zhang Jingwang <yyalone@gmail.com> [pnfs-block: use new read_pagelist api] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:17 -04:00
Fred Isaman	b2be7811dd	pnfsblock: cleanup_layoutcommit In blocklayout driver. There are two things happening while layoutcommit/cleanup. 1. the modified extents are encoded. 2. On cleanup the extents are put back on the layout rw extents list, for reads. In the new system where actual xdr encoding is done in encode_layoutcommit() directly into xdr buffer, these are the new commit stages: 1. On setup_layoutcommit, the range is adjusted as before and a structure is allocated for communication with bl_encode_layoutcommit && bl_cleanup_layoutcommit (Generic layer provides a void-star to hang it on) 2. bl_encode_layoutcommit is called to do the actual encoding directly into xdr. The commit-extent-list is not freed and is stored on above structure. FIXME: The code is not yet converted to the new XDR cleanup 3. On cleanup the commit-extent-list is put back by a call to set_to_rw() as before, but with no need for XDR decoding of the list as before. And the commit-extent-list is freed. Finally allocated structure is freed. [rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()] Signed-off-by: Jim Rees <rees@umich.edu> [pnfsblock: introduce bl_committing list] Signed-off-by: Peng Tao <peng_tao@emc.com> [pnfsblock: SQUASHME: adjust to API change] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [blocklayout: encode_layoutcommit implementation] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> [pnfsblock: fix bug setting up layoutcommit.] Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> [pnfsblock: cleanup_layoutcommit wants a status parameter] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:17 -04:00
Fred Isaman	90ace12ac4	pnfsblock: encode_layoutcommit In blocklayout driver. There are two things happening while layoutcommit/cleanup. 1. the modified extents are encoded. 2. On cleanup the extents are put back on the layout rw extents list, for reads. In the new system where actual xdr encoding is done in encode_layoutcommit() directly into xdr buffer, these are the new commit stages: 1. On setup_layoutcommit, the range is adjusted as before and a structure is allocated for communication with bl_encode_layoutcommit && bl_cleanup_layoutcommit (Generic layer provides a void-star to hang it on) 2. bl_encode_layoutcommit is called to do the actual encoding directly into xdr. The commit-extent-list is not freed and is stored on above structure. FIXME: The code is not yet converted to the new XDR cleanup 3. On cleanup the commit-extent-list is put back by a call to set_to_rw() as before, but with no need for XDR decoding of the list as before. And the commit-extent-list is freed. Finally allocated structure is freed. [rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()] [pnfsblock: get rid of deprecated xdr macros] Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [blocklayout: encode_layoutcommit implementation] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> [pnfsblock: fix bug setting up layoutcommit.] Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> [pnfsblock: prevent commit list corruption] [pnfsblock: fix layoutcommit with an empty opaque] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:17 -04:00
Fred Isaman	9f3770422c	pnfsblock: merge rw extents Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:17 -04:00
Fred Isaman	c1c2a4cd35	pnfsblock: add extent manipulation functions Adds working implementations of various support functions to handle INVAL extents, needed by writes, such as bl_mark_sectors_init and bl_is_sector_init. [pnfsblock: fix 64-bit compiler warnings for extent manipulation] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [Implement release_inval_marks] Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:17 -04:00
Fred Isaman	6d742ba538	pnfsblock: bl_find_get_extent Implement bl_find_get_extent(), one of the core extent manipulation routines. [pnfsblock: Lookup list entry of layouts and tags in reverse order] Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Jim Rees <rees@umich.edu> pnfsblock: fix print format warnings for sector_t and size_t gcc spews warnings about these on x86_64, e.g.: fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’ fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’ Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:16 -04:00
Fred Isaman	e9437ccef9	pnfsblock: xdr decode pnfs_block_layout4 XDR decodes the block layout payload sent in LAYOUTGET result, storing the result in an extent list. [pnfsblock: get rid of deprecated xdr macros] Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfsblock: fix bug getting pnfs_layout_type in translate_devid().] Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:16 -04:00
Fred Isaman	2f9fd18260	pnfsblock: call and parse getdevicelist Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO for each device returned. [pnfsblock: get rid of deprecated xdr macros] Signed-off-by: Jim Rees <rees@umich.edu> [pnfsblock: fix pnfs_deviceid references] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfsblock: fix print format warnings for sector_t and size_t] [pnfs-block: #include <linux/vmalloc.h>] [pnfsblock: no PNFS_NFS_SERVER] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [pnfsblock: fix bug determining size of striped volume] [pnfsblock: fix oops when using multiple devices] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [pnfsblock: get rid of vmap and deviceid->area structure] Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:16 -04:00
Fred Isaman	03341d2cc9	pnfsblock: merge extents Replace a stub, so that extents underlying the layouts are properly added, merged, or ignored as necessary. Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfsblock: delete the new node before put it] Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:16 -04:00
Fred Isaman	a60d2ebd93	pnfsblock: lseg alloc and free Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfsblock: fix bug getting pnfs_layout_type in translate_devid().] Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Zhang Jingwang <Jingwang.Zhang@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:16 -04:00
Jim Rees	025a70ed65	pnfsblock: remove device operations Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [upcall bugfixes] Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:16 -04:00
Jim Rees	fe0a9b7408	pnfsblock: add device operations Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [upcall bugfixes] Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:16 -04:00
Fred Isaman	9e69296999	pnfsblock: basic extent code Adds structures and basic create/delete code for extents. Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Zhang Jingwang <Jingwang.Zhang@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:16 -04:00
Benny Halevy	e9643fe80d	pnfsblock: use pageio_ops api [pnfsblock: use pnfs_generic_pg_init_read/write] Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:15 -04:00
Fred Isaman	155e7524f2	pnfsblock: add blocklayout Kconfig option, Makefile, and stubs Define a configuration variable to enable/disable compilation of the block driver code. Add the minimal structure for a pnfs block layout driver, and empty list-heads that will hold the extent data [pnfsblock: make NFS_V4_1 select PNFS_BLOCK] Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [pnfs-block: fix CONFIG_PNFS_BLOCK dependencies] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [pnfsblock: SQUASHME: adjust to API change] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfs: move pnfs_layout_type inline in nfs_inode] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [blocklayout: encode_layoutcommit implementation] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [pnfsblock: layout alloc and free] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> [pnfs: move pnfs_layout_type inline in nfs_inode] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [pnfsblock: define module alias] Signed-off-by: Peng Tao <peng_tao@emc.com> [rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()] Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:15 -04:00
Andy Adamson	db29c08909	pnfs: cleanup_layoutcommit This gives layout driver a chance to cleanup structures they put in at encode_layoutcommit. Signed-off-by: Andy Adamson <andros@netapp.com> [fixup layout header pointer for layoutcommit] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> [rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()] Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:15 -04:00
Fred Isaman	dae100c2b1	pnfs: ask for layout_blksize and save it in nfs_server Block layout needs it to determine IO size. Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Tao Guo <glorioustao@gmail.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:15 -04:00
Benny Halevy	738fd0f360	pnfs: add set-clear layoutdriver interface To allow layout driver to issue getdevicelist at mount time, and clean up at umount time. [fixup non NFS_V4_1 set_pnfs_layoutdriver definition] [pnfs: pass mntfh down the init_pnfs path] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:15 -04:00
Andy Adamson	7f11d8d38d	pnfs: GETDEVICELIST The block driver uses GETDEVICELIST Signed-off-by: Andy Adamson <andros@netapp.com> [pass struct nfs_server * to getdevicelist] [get machince creds for getdevicelist] [fix getdevicelist decode sizing] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:15 -04:00
Peng Tao	3557c6c3be	pnfs: use lwb as layoutcommit length Using NFS4_MAX_UINT64 will break current protocol. [Needed in v3.0] CC: Stable Tree <stable@kernel.org> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:15 -04:00
Peng Tao	a9bae5666d	pnfs: let layoutcommit handle a list of lseg There can be multiple lseg per file, so layoutcommit should be able to handle it. [Needed in v3.0] CC: Stable Tree <stable@kernel.org> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:15 -04:00
Peng Tao	9fa4075878	pnfs: save layoutcommit cred at layout header init No need to save it for every lseg. No need to save it at every pnfs_set_layoutcommit. [Needed in v3.0] CC: Stable Tree <stable@kernel.org> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:14 -04:00
Peng Tao	acff588053	pnfs: save layoutcommit lwb at layout header No need to save it for every lseg. [Needed in v3.0] CC: Stable Tree <stable@kernel.org> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-31 12:18:14 -04:00
Wu Fengguang	0e995816f4	don't busy retry the inode on failed grab_super_passive() This fixes a soft lockup on conditions a) the flusher is working on a work by __bdi_start_writeback(), while b) someone else calls writeback_inodes_sb() or sync_inodes_sb(), which grab sb->s_umount and enqueue a new work for the flusher to execute The s_umount grabbed by (b) will fail the grab_super_passive() in (a). Then if the inode is requeued, wb_writeback() will busy retry on it. As a result, wb_writeback() loops for ever without releasing wb->list_lock, which further blocks other tasks. Fix the busy loop by redirtying the inode. This may undesirably delay the writeback of the inode, however most likely it will be picked up soon by the queued work by writeback_inodes_sb(), sync_inodes_sb() or even writeback_inodes_wb(). bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html Reported-by: Christoph Hellwig <hch@infradead.org> Tested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-07-31 22:52:08 +08:00
Bryan Schumaker	374e4e3ec3	Additional readdir cookie loop information Print out the name of the file that triggers the cookie loop message to make it slightly easier to track down the cause. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-30 14:37:14 -04:00
Trond Myklebust	0c0308066c	NFS: Fix spurious readdir cookie loop messages If the directory contents change, then we have to accept that the file->f_pos value may shrink if we do a 'search-by-cookie'. In that case, we should turn off the loop detection and let the NFS client try to recover. The patch also fixes a second loop detection bug by ensuring that after turning on the ctx->duped flag, we read at least one new cookie into ctx->dir_cookie before attempting to match with ctx->dup_cookie. Reported-by: Petr Vandrovec <petr@vandrovec.name> Cc: stable@kernel.org [2.6.39+] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-30 14:34:50 -04:00
Dan Carpenter	c49bafa384	ext4: add missing kfree() on error return path in add_new_gdb() We added some more error handling in `b40971426a` "ext4: add error checking to calls to ext4_handle_dirty_metadata()". But we need to call kfree() as well to avoid a memory leak. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-30 12:58:41 -04:00
Theodore Ts'o	d59729f4e7	ext4: fix races in ext4_sync_parent() Fix problems if fsync() races against a rename of a parent directory as pointed out by Al Viro in his own inimitable way: >While we are at it, could somebody please explain what the hell is ext4 >doing in >static int ext4_sync_parent(struct inode inode) >{ > struct writeback_control wbc; > struct dentry dentry = NULL; > int ret = 0; > > while (inode && ext4_test_inode_state(inode, EXT4_STATE_NEWENTRY)) { > ext4_clear_inode_state(inode, EXT4_STATE_NEWENTRY); > dentry = list_entry(inode->i_dentry.next, > struct dentry, d_alias); > if (!dentry \|\| !dentry->d_parent \|\| !dentry->d_parent->d_inode) > break; > inode = dentry->d_parent->d_inode; > ret = sync_mapping_buffers(inode->i_mapping); > ... >Note that dentry obviously can't be NULL there. dentry->d_parent is never >NULL. And dentry->d_parent would better not be negative, for crying out >loud! What's worse, there's no guarantees that dentry->d_parent will >remain our parent over that sync_mapping_buffers() and that inode won't >just be freed under us (after rename() and memory pressure leading to >eviction of what used to be our dentry->d_parent)...... Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-30 12:34:19 -04:00
Linus Torvalds	983236b574	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set	2011-07-29 23:45:06 -07:00
Linus Torvalds	74aec4e0dd	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: ecryptfs: Make inode bdi consistent with superblock bdi eCryptfs: Unlock keys needed by ecryptfsd	2011-07-29 23:43:50 -07:00
Linus Torvalds	59ed2bb274	ext2: remove duplicate 'ext2_get_acl()' define When commit `4e34e719e4` ("fs: take the ACL checks to common code") changed the xyz_check_acl() functions into the more natural xyz_get_acl() interface, we grew two copies of the #define ext2_get_acl NULL define for the non-acl case. Remove the extra one. Reported-by: Marco Stornelli <marco.stornelli@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-29 23:21:50 -07:00
Markus Trippelsdorf	a5a7bbcc01	xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set commit `4e34e719e4`, that takes the ACL checks to common code, accidentely broke the build when CONFIG_FS_POSIX_ACL is not set: CC fs/xfs/linux-2.6/xfs_iops.o fs/xfs/linux-2.6/xfs_iops.c:1025:14: error: ‘xfs_get_acl’ undeclared here (not in a function) Fix this by declaring xfs_get_acl a static inline function. Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-29 12:26:14 -05:00
Thieu Le	985ca0e626	ecryptfs: Make inode bdi consistent with superblock bdi Make the inode mapping bdi consistent with the superblock bdi so that dirty pages are flushed properly. Signed-off-by: Thieu Le <thieule@chromium.org> Cc: <stable@kernel.org> [2.6.39+] Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-07-28 23:48:26 -05:00
Tyler Hicks	b2987a5e05	eCryptfs: Unlock keys needed by ecryptfsd Fixes a regression caused by `b5695d0463` Kernel keyring keys containing eCryptfs authentication tokens should not be write locked when calling out to ecryptfsd to wrap and unwrap file encryption keys. The eCryptfs kernel code can not hold the key's write lock because ecryptfsd needs to request the key after receiving such a request from the kernel. Without this fix, all file opens and creates will timeout and fail when using the eCryptfs PKI infrastructure. This is not an issue when using passphrase-based mount keys, which is the most widely deployed eCryptfs configuration. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Acked-by: Roberto Sassu <roberto.sassu@polito.it> Tested-by: Roberto Sassu <roberto.sassu@polito.it> Tested-by: Alexis Hafner1 <haf@zurich.ibm.com> Cc: <stable@kernel.org> [2.6.39+]	2011-07-28 23:30:09 -05:00
Jan Kara	c7e25e6e0b	ocfs2: Avoid livelock in ocfs2_readpage() When someone writes to an inode, readers accessing the same inode via ocfs2_readpage() just busyloop trying to get ip_alloc_sem because do_generic_file_read() looks up the page again and retries ->readpage() when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough readers, they can occupy all CPUs and in non-preempt kernel the system is deadlocked because writer holding ip_alloc_sem is never run to release the semaphore. Fix the problem by making reader block on ip_alloc_sem to break the busy loop. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <jlbec@evilplan.org>	2011-07-28 02:07:19 -07:00
Mark Fasheh	a11f7e63c5	ocfs2: serialize unaligned aio Fix a corruption that can happen when we have (two or more) outstanding aio's to an overlapping unaligned region. Ext4 (`e9e3bcecf4`) and xfs recently had to fix similar issues. In our case what happens is that we can have an outstanding aio on a region and if a write comes in with some bytes overlapping the original aio we may decide to read that region into a page before continuing (typically because of buffered-io fallback). Since we have no ordering guarantees with the aio, we can read stale or bad data into the page and then write it back out. If the i/o is page and block aligned, then we avoid this issue as there won't be any need to read data from disk. I took the same approach as Eric in the ext4 patch and introduced some serialization of unaligned async direct i/o. I don't expect this to have an effect on the most common cases of AIO. Unaligned aio will be slower though, but that's far more acceptable than data corruption. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>	2011-07-28 02:07:16 -07:00
Linus Torvalds	95b6886526	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (54 commits) tpm_nsc: Fix bug when loading multiple TPM drivers tpm: Move tpm_tis_reenable_interrupts out of CONFIG_PNP block tpm: Fix compilation warning when CONFIG_PNP is not defined TOMOYO: Update kernel-doc. tpm: Fix a typo tpm_tis: Probing function for Intel iTPM bug tpm_tis: Fix the probing for interrupts tpm_tis: Delay ACPI S3 suspend while the TPM is busy tpm_tis: Re-enable interrupts upon (S3) resume tpm: Fix display of data in pubek sysfs entry tpm_tis: Add timeouts sysfs entry tpm: Adjust interface timeouts if they are too small tpm: Use interface timeouts returned from the TPM tpm_tis: Introduce durations sysfs entry tpm: Adjust the durations if they are too small tpm: Use durations returned from TPM TOMOYO: Enable conditional ACL. TOMOYO: Allow using argv[]/envp[] of execve() as conditions. TOMOYO: Allow using executable's realpath and symlink's target as conditions. TOMOYO: Allow using owner/group etc. of file objects as conditions. ... Fix up trivial conflict in security/tomoyo/realpath.c	2011-07-27 19:26:38 -07:00
Al Viro	d6b722aa38	hppfs: missing include Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-27 22:21:58 -04:00
Utako Kusaka	29ae07b702	ext4: Fix overflow caused by missing cast in ext4_fallocate() The logical block number in map.l_blk is a __u32, and so before we shift it left, by the block size, we neeed cast it to a 64-bit size. Otherwise i_size can be corrupted on an ENOSPC. # df -T /mnt/mp1 Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/sda6 ext4 9843276 153056 9190200 2% /mnt/mp1 # fallocate -o 0 -l 2199023251456 /mnt/mp1/testfile fallocate: /mnt/mp1/testfile: fallocate failed: No space left on device # stat /mnt/mp1/testfile File: `/mnt/mp1/testfile' Size: 4293656576 Blocks: 19380440 IO Block: 4096 regular file Device: 806h/2054d Inode: 12 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2011-07-25 13:01:31.414490496 +0900 Modify: 2011-07-25 13:01:31.414490496 +0900 Change: 2011-07-25 13:01:31.454490495 +0900 Signed-off-by: Utako Kusaka <u-kusaka@wm.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> -- fs/ext4/extents.c \| 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)	2011-07-27 22:11:20 -04:00
Robin Dong	0e1147b001	ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole The old function ext4_ext_rm_idx is used only for truncate case because it just remove last index in extent-index-block. When punching hole, it usually needed to remove "middle" index, therefore we must move indexes which after it forward. (I create a file with 1 depth extent tree and punch hole in the middle of it, the last index in index-block strangly gone, so I find out this bug) Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-27 21:29:33 -04:00
Yongqiang Yang	668f4dc559	ext4: simplify parameters of reserve_backup_gdb() The reserve_backup_gdb() function only needs the block group number; there's no need to pass a pointer to struct ext4_new_group_data to it. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>	2011-07-27 21:23:13 -04:00
Yongqiang Yang	2f91971014	ext4: simplify parameters of add_new_gdb() add_new_gdb() only needs the block group number; there is no need to pass a pointer to struct ext4_new_group_data to add_new_gdb(). Instead of filling in a pointer the struct buffer_head in add_new_gdb(), it's simpler to have the caller fetch it from the s_group_desc[] array. [Fixed error path to handle the case where struct buffer_head *primary hasn't been set yet. -- Ted] Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-27 21:16:33 -04:00
Yongqiang Yang	e6075e984d	ext4: remove lock_buffer in bclean() and setup_new_group_blocks() There is no need to lock the buffers since no one else should be touching these buffers besides the file system. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-27 20:40:18 -04:00
Linus Torvalds	22712200e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors Btrfs: use the commit_root for reading free_space_inode crcs Btrfs: reduce extent_state lock contention for metadata Btrfs: remove lockdep magic from btrfs_next_leaf Btrfs: make a lockdep class for each root Btrfs: switch the btrfs tree locks to reader/writer Btrfs: fix deadlock when throttling transactions Btrfs: stop using highmem for extent_buffers Btrfs: fix BUG_ON() caused by ENOSPC when relocating space Btrfs: tag pages for writeback in sync Btrfs: fix enospc problems with delalloc Btrfs: don't flush delalloc arbitrarily Btrfs: use find_or_create_page instead of grab_cache_page Btrfs: use a worker thread to do caching Btrfs: fix how we merge extent states and deal with cached states Btrfs: use the normal checksumming infrastructure for free space cache Btrfs: serialize flushers in reserve_metadata_bytes Btrfs: do transaction space reservation before joining the transaction Btrfs: try to only do one btrfs_search_slot in do_setxattr	2011-07-27 16:43:52 -07:00
Linus Torvalds	597a67e0ba	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: optimize the negative xattr caching xfs: prevent against ioend livelocks in xfs_file_fsync xfs: flag all buffers as metadata xfs: encapsulate a block of debug code	2011-07-27 13:41:51 -07:00
Linus Torvalds	28890d3598	Merge branch 'nfs-for-3.1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs * 'nfs-for-3.1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (44 commits) NFSv4: Don't use the delegation->inode in nfs_mark_return_delegation() nfs: don't use d_move in nfs_async_rename_done RDMA: Increasing RPCRDMA_MAX_DATA_SEGS SUNRPC: Replace xprt->resend and xprt->sending with a priority queue SUNRPC: Allow caller of rpc_sleep_on() to select priority levels SUNRPC: Support dynamic slot allocation for TCP connections SUNRPC: Clean up the slot table allocation SUNRPC: Initalise the struct xprt upon allocation SUNRPC: Ensure that we grab the XPRT_LOCK before calling xprt_alloc_slot pnfs: simplify pnfs files module autoloading nfs: document nfsv4 sillyrename issues NFS: Convert nfs4_set_ds_client to EXPORT_SYMBOL_GPL SUNRPC: Convert the backchannel exports to EXPORT_SYMBOL_GPL SUNRPC: sunrpc should not explicitly depend on NFS config options NFS: Clean up - simplify the switch to read/write-through-MDS NFS: Move the pnfs write code into pnfs.c NFS: Move the pnfs read code into pnfs.c NFS: Allow the nfs_pageio_descriptor to signal that a re-coalesce is needed NFS: Use the nfs_pageio_descriptor->pg_bsize in the read/write request NFS: Cache rpc_ops in struct nfs_pageio_descriptor ...	2011-07-27 13:23:02 -07:00
Chris Mason	ff95acb673	Merge branch 'integration' into for-linus	2011-07-27 16:18:13 -04:00
Chris Mason	75c195a2ca	Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors The btrfs transaction code will return any errors that come from reserve_metadata_bytes. We need to make sure we don't return funny things like 1 or EAGAIN. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 16:11:41 -04:00
David Howells	09570f9149	proc: make struct proc_dir_entry::name a terminal array rather than a pointer Since __proc_create() appends the name it is given to the end of the PDE structure that it allocates, there isn't a need to store a name pointer. Instead we can just replace the name pointer with a terminal char array of _unspecified_ length. The compiler will simply append the string to statically defined variables of PDE type overlapping any hole at the end of the structure and, unlike specifying an explicitly _zero_ length array, won't give a warning if you try to statically initialise it with a string of more than zero length. Also, whilst we're at it: (1) Move namelen to end just prior to name and reduce it to a single byte (name shouldn't be longer than NAME_MAX). (2) Move pde_unload_lock two places further on so that if it's four bytes in size on a 64-bit machine, it won't cause an unused hole in the PDE struct. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-27 12:50:45 -07:00
Chris Mason	2cf8572dac	Btrfs: use the commit_root for reading free_space_inode crcs Now that we are using regular file crcs for the free space cache, we can deadlock if we try to read the free_space_inode while we are updating the crc tree. This commit fixes things by using the commit_root to read the crcs. This is safe because we the free space cache file would already be loaded if that block group had been changed in the current transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:48 -04:00
Chris Mason	19b6caf4ac	Btrfs: reduce extent_state lock contention for metadata For metadata buffers that don't straddle pages (all of them), btrfs can safely use the page uptodate bits and extent_buffer uptodate bit instead of needing to use the extent_state tree. This greatly reduces contention on the state tree lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:47 -04:00
Chris Mason	31533fb263	Btrfs: remove lockdep magic from btrfs_next_leaf Before the reader/writer locks, btrfs_next_leaf needed to keep the path blocking to avoid making lockdep upset. Now that btrfs_next_leaf only takes read locks, this isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:47 -04:00
Chris Mason	85d4e46111	Btrfs: make a lockdep class for each root This patch was originally from Tejun Heo. lockdep complains about the btrfs locking because we sometimes take btree locks from two different trees at the same time. The current classes are based only on level in the btree, which isn't enough information for lockdep to figure out if the lock is safe. This patch makes a class for each type of tree, and lumps all the FS trees that actually have files and directories into the same class. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:46 -04:00
Chris Mason	bd681513fa	Btrfs: switch the btrfs tree locks to reader/writer The btrfs metadata btree is the source of significant lock contention, especially in the root node. This commit changes our locking to use a reader/writer lock. The lock is built on top of rw spinlocks, and it extends the lock tracking to remember if we have a read lock or a write lock when we go to blocking. Atomics count the number of blocking readers or writers at any given time. It removes all of the adaptive spinning from the old code and uses only the spinning/blocking hints inside of btrfs to decide when it should continue spinning. In read heavy workloads this is dramatically faster. In write heavy workloads we're still faster because of less contention on the root node lock. We suffer slightly in dbench because we schedule more often during write locks, but all other benchmarks so far are improved. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:46 -04:00
Josef Bacik	81317fdedd	Btrfs: fix deadlock when throttling transactions Hit this nice little deadlock. What happens is this __btrfs_end_transaction with throttle set, --use_count so it equals 0 btrfs_commit_transaction <somebody else actually manages to start the commit> btrfs_end_transaction --use_count so now its -1 <== BAD we just return and wait on the transaction This is bad because we just return after our use_count is -1 and don't let go of our num_writer count on the transaction, so the guy committing the transaction just sits there forever. Fix this by inc'ing our use_count if we're going to call commit_transaction so that if we call btrfs_end_transaction it's valid. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:46 -04:00
Chris Mason	a65917156e	Btrfs: stop using highmem for extent_buffers The extent_buffers have a very complex interface where we use HIGHMEM for metadata and try to cache a kmap mapping to access the memory. The next commit adds reader/writer locks, and concurrent use of this kmap cache would make it even more complex. This commit drops the ability to use HIGHMEM with extent buffers, and rips out all of the related code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:45 -04:00
Miao Xie	199c36eaa9	Btrfs: fix BUG_ON() caused by ENOSPC when relocating space When we balanced the chunks across the devices, BUG_ON() in __finish_chunk_alloc() was triggered. ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:2568! [SNIP] Call Trace: [<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs] [<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs] [<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs] [<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs] [<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs] [<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs] [<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs] [<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs] [<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs] [<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs] [<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs] [<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs] [<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs] [<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160 [<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs] [<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs] [<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30 [<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs] [<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs] [<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs] [<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs] [<ffffffff81159c63>] ? generic_permission+0x23/0xb0 [<ffffffff8115b415>] vfs_create+0xa5/0xc0 [<ffffffff8115ce6e>] do_last+0x5fe/0x880 [<ffffffff8115dc0d>] path_openat+0xcd/0x3d0 [<ffffffff8115e029>] do_filp_open+0x49/0xa0 [<ffffffff8116a965>] ? alloc_fd+0x95/0x160 [<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0 [<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0 [<ffffffff8114f1e0>] sys_open+0x20/0x30 [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs] The reason is: Task1 Space balance task do_chunk_alloc() __finish_chunk_alloc() update device info in the chunk tree alloc system metadata block relocate system metadata block group set system metadata block group readonly, This block group is the only one that can allocate space. So there is no free space that can be allocated now. find no space and don't try to alloc new chunk, and then return ENOSPC BUG_ON() in __finish_chunk_alloc() was triggered. Fix this bug by allocating a new system metadata chunk before relocating the old one if we find there is no free space which can be allocated after setting the old block group to be read-only. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:45 -04:00
Josef Bacik	f7aaa06bff	Btrfs: tag pages for writeback in sync Everybody else does this, we need to do it too. If we're syncing, we need to tag the pages we're going to write for writeback so we don't end up writing the same stuff over and over again if somebody is constantly redirtying our file. This will keep us from having latencies with heavy sync workloads. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:44 -04:00
Josef Bacik	9e0baf60de	Btrfs: fix enospc problems with delalloc So I had this brilliant idea to use atomic counters for outstanding and reserved extents, but this turned out to be a bad idea. Consider this where we have 1 outstanding extent and 1 reserved extent Reserver Releaser atomic_dec(outstanding) now 0 atomic_read(outstanding)+1 get 1 atomic_read(reserved) get 1 don't actually reserve anything because they are the same atomic_cmpxchg(reserved, 1, 0) atomic_inc(outstanding) atomic_add(0, reserved) free reserved space for 1 extent Then the reserver now has no actual space reserved for it, and when it goes to finish the ordered IO it won't have enough space to do it's allocation and you get those lovely warnings. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:44 -04:00
Josef Bacik	a599142806	Btrfs: don't flush delalloc arbitrarily Kill the check to see if we have 512mb of reserved space in delalloc and shrink_delalloc if we do. This causes unexpected latencies and we have other logic to see if we need to throttle. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:43 -04:00
Josef Bacik	a94733d0bc	Btrfs: use find_or_create_page instead of grab_cache_page grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we need GFP_NOFS so we don't deadlock. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-27 12:46:43 -04:00
Josef Bacik	bab39bf998	Btrfs: use a worker thread to do caching A user reported a deadlock when copying a bunch of files. This is because they were low on memory and kthreadd got hung up trying to migrate pages for an allocation when starting the caching kthread. The page was locked by the person starting the caching kthread. To fix this we just need to use the async thread stuff so that the threads are already created and we don't have to worry about deadlocks. Thanks, Reported-by: Roman Mamedov <rm@romanrm.ru> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-27 12:46:25 -04:00
Linus Torvalds	5fd00b0315	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs: clean up some compiler warnings	2011-07-27 09:26:39 -07:00
Linus Torvalds	333c066bb7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Fix mount hang caused by certain access pattern to sysfs files	2011-07-27 09:26:22 -07:00
Christoph Hellwig	510792ee29	xfs: optimize the negative xattr caching Since the addition of file capabilities every write needs to read xattrs to check if we have any capabilities to clear. In Linux 3.0 Andi Kleen added a flag to cache the fact that we do not have any attributes on an inode. Make sure to already mark a file as not having any attributes when reading it from disk in case it doesn't even have an attribute fork. Based on an earlier patch from Andi Kleen. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-26 22:06:50 -05:00
Christoph Hellwig	d1166ec792	xfs: prevent against ioend livelocks in xfs_file_fsync We need to take some locks to prevent new ioends from coming in when we wait for all existing ones to go away. Up to Linux 3.0 that was done using the i_mutex held by the VFS fsync code, but now that we are called without it we need to take care of it ourselves. Use the I/O lock instead of i_mutex just like we do in other places. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-26 22:06:39 -05:00
Christoph Hellwig	34951f5cb7	xfs: flag all buffers as metadata Now that REQ_META bios aren't treated specially in the CFQ I/O schedule anymore, we can tag all buffers as metadata to make blktrace traces more meaningful. Note that we use buffers also to zero out partial blocks in the preallocation / hole punching code, and while they operate on data blocks the zeros written certainly aren't data. I think this case is borderline metadata enough to not bother special casing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-26 22:05:48 -05:00
Alex Elder	1c4f33296e	xfs: encapsulate a block of debug code Pull into a helper function some debug-only code that validates a xfs_da_blkinfo structure that's been read from disk. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@infradead.org>	2011-07-26 22:05:34 -05:00
Yongqiang Yang	6d40bc5a7e	ext4: simplify journal handling in setup_new_group_blocks() This patch simplifies journal handling in setup_new_group_blocks(). In previous code, block bitmap is modified everywhere in setup_new_group_blocks(), ext4_get_write_access() in extend_or_restart_transaction() is used to guarantee that the block bitmap stays in the new handle, this makes things complicated. The previous commit changed things so that the modifications on the block bitmap are batched and done by ext4_set_bits() at the end of the for loop. This allows us to simplify things. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 22:24:41 -04:00
Yongqiang Yang	c3e94d1df9	ext4: let setup_new_group_blocks() set multiple bits at a time Rename mb_set_bits() to ext4_set_bits() and make it a global function so that setup_new_group_blocks() can use it. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 22:05:53 -04:00
Yongqiang Yang	2b79b09d13	ext4: fix a typo in ext4_group_extend() Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 21:53:35 -04:00
Yongqiang Yang	4740b830ed	ext4: let ext4_group_add_blocks() handle 0 blocks quickly If ext4_group_add_blocks() is called with 0 block, make it return 0 without doing any extra work. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 21:51:08 -04:00
Yongqiang Yang	cc7365dfe4	ext4: let ext4_group_add_blocks() return an error code This patch lets ext4_group_add_blocks() return an error code if it fails, so that upper functions can handle error correctly. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 21:46:07 -04:00
Yongqiang Yang	0529155e8a	ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks() Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 21:43:56 -04:00
Yongqiang Yang	ce723c31b5	ext4: prevent a fs with errors from being resized A filesystem with errors is not allowed to being resized, otherwise, it is easy to destroy the filesystem. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 21:39:09 -04:00
Yongqiang Yang	8f82f840ec	ext4: prevent parallel resizers by atomic bit ops Before this patch, parallel resizers are allowed and protected by a mutex lock, actually, there is no need to support parallel resizer, so this patch prevents parallel resizers by atmoic bit ops, like lock_page() and unlock_page() do. To do this, the patch removed the mutex lock s_resize_lock from struct ext4_sb_info and added a unsigned long field named s_resize_flags which inidicates if there is a resizer. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 21:35:44 -04:00
Linus Torvalds	e371d46ae4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: merge fchmod() and fchmodat() guts, kill ancient broken kludge xfs: fix misspelled S_IS...() xfs: get rid of open-coded S_ISREG(), etc. vfs: document locking requirements for d_move, __d_move and d_materialise_unique omfs: fix (mode & S_IFDIR) abuse btrfs: S_ISREG(mode) is not mode & S_IFREG... ima: fmode_t misspelled as mode_t... pci-label.c: size_t misspelled as mode_t jffs2: S_ISLNK(mode & S_IFMT) is pointless snd_msnd ->mode is fmode_t, not mode_t v9fs_iop_get_acl: get rid of unused variable vfs: dont chain pipe/anon/socket on superblock s_inodes list Documentation: Exporting: update description of d_splice_alias fs: add missing unlock in default_llseek()	2011-07-26 18:30:20 -07:00
Arun Sharma	60063497a9	atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:47 -07:00
Oleg Nesterov	32e107f71e	fs/exec.c:acct_arg_size(): ptl is no longer needed for add_mm_counter() acct_arg_size() takes ->page_table_lock around add_mm_counter() if !SPLIT_RSS_COUNTING. This is not needed after commit `172703b08c` ("mm: delete non-atomic mm counter implementation"). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:44 -07:00
Tetsuo Handa	b4edf8bd06	exec: do not retry load_binary method if CONFIG_MODULES=n If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats handler because the list will not be modified by request_module(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Richard Weinberger <richard@nod.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:44 -07:00
Tetsuo Handa	912193521b	exec: do not call request_module() twice from search_binary_handler() Currently, search_binary_handler() tries to load binary loader module using request_module() if a loader for the requested program is not yet loaded. But second attempt of request_module() does not affect the result of search_binary_handler(). If request_module() triggered recursion, calling request_module() twice causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions. It is not an infinite loop but is sufficient for users to consider as a hang up. Therefore, this patch changes not to call request_module() twice, making 1 to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Richard Weinberger <richard@nod.at> Tested-by: Richard Weinberger <richard@nod.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:44 -07:00
Michal Hocko	aacb3d17a7	fs/exec.c: use BUILD_BUG_ON for VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP Commit `a8bef8ff6e` ("mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS and VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time one, so BUILD_BUG_ON is more appropriate. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:44 -07:00
Vasiliy Kulikov	293eb1e777	proc: fix a race in do_io_accounting() If an inode's mode permits opening /proc/PID/io and the resulting file descriptor is kept across execve() of a setuid or similar binary, the ptrace_may_access() check tries to prevent using this fd against the task with escalated privileges. Unfortunately, there is a race in the check against execve(). If execve() is processed after the ptrace check, but before the actual io information gathering, io statistics will be gathered from the privileged process. At least in theory this might lead to gathering sensible information (like ssh/ftp password length) that wouldn't be available otherwise. Holding task->signal->cred_guard_mutex while gathering the io information should protect against the race. The order of locking is similar to the one inside of ptrace_attach(): first goes cred_guard_mutex, then lock_task_sighand(). Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:43 -07:00
Daisuke Ogino	d2857e79a2	procfs: return ENOENT on opening a being-removed proc entry Change the return value to ENOENT. This return value is then returned when opening the proc entry that have been removed. For example, open("/proc/bus/pci/XX/YY") when the corresponding device is being hot-removed. Signed-off-by: Daisuke Ogino <ogino.daisuke@jp.fujitsu.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:43 -07:00
Oleg Nesterov	99b6456748	do_coredump: fix the "ispipe" error check do_coredump() assumes that if format_corename() fails it should return -ENOMEM. This is not true, for example cn_print_exe_file() can propagate the error from d_path. Even if it was true, this is too fragile. Change the code to check "ispipe < 0". Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:43 -07:00
Jiri Slaby	2c563731fe	coredump: escape / in hostname and comm Change every occurence of / in comm and hostname to !. If the process changes its name to contain /, the core is not dumped (if the directory tree doesn't exist like that). The same with hostname being something like myhost/3. Fix this behaviour by using the escape loop used in %E. (We extract it to a separate function.) Now both with comm == myprocess/1 and hostname == myhost/1, the core is dumped like (kernel.core_pattern='core.%p.%e.%h): core.2349.myprocess!1.myhost!1 Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:43 -07:00
Jiri Slaby	3141c8b165	coredump: use task comm instead of (unknown) If we don't know the file corresponding to the binary (i.e. exe_file is unknown), use "task->comm (path unknown)" instead of simple "(unknown)" as suggested by ak. The fallback is the same as %e except it will append "(path unknown)". Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:43 -07:00
Linus Torvalds	ba5b56cb3e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) ceph: document unlocked d_parent accesses ceph: explicitly reference rename old_dentry parent dir in request ceph: document locking for ceph_set_dentry_offset ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug ceph: protect d_parent access in ceph_d_revalidate ceph: protect access to d_parent ceph: handle racing calls to ceph_init_dentry ceph: set dir complete frag after adding capability rbd: set blk_queue request sizes to object size ceph: set up readahead size when rsize is not passed rbd: cancel watch request when releasing the device ceph: ignore lease mask ceph: fix ceph_lookup_open intent usage ceph: only link open operations to directory unsafe list if O_CREAT\|O_TRUNC ceph: fix bad parent_inode calc in ceph_lookup_open ceph: avoid carrying Fw cap during write into page cache libceph: don't time out osd requests that haven't been received ceph: report f_bfree based on kb_avail rather than diffing. ceph: only queue capsnap if caps are dirty ceph: fix snap writeback when racing with writes ...	2011-07-26 13:38:50 -07:00
Al Viro	e57712ebeb	merge fchmod() and fchmodat() guts, kill ancient broken kludge The kludge in question is undocumented and doesn't work for 32bit binaries on amd64, sparc64 and s390. Passing (mode_t)-1 as mode had (since 0.99.14v and contrary to behaviour of any other Unix, prescriptions of POSIX, SuS and our own manpages) was kinda-sorta no-op. Note that any software relying on that (and looking for examples shows none) would be visibly broken on sparc64, where practically all userland is built 32bit. No such complaints noticed... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 15:07:43 -04:00
Al Viro	03209378b4	xfs: fix misspelled S_IS...() mode_t is not a bitmap... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 15:05:30 -04:00
Al Viro	abbede1b3a	xfs: get rid of open-coded S_ISREG(), etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 15:05:16 -04:00
Linus Torvalds	2ac232f37f	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t ext3.txt: update the links in the section "useful links" to the latest ones ext3: Fix data corruption in inodes with journalled data ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get ext3: Fix compilation with -DDX_DEBUG quota: Remove unused declaration jbd: Use WRITE_SYNC in journal checkpoint. jbd: Fix oops in journal_remove_journal_head() ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs() ext3/ioctl.c: silence sparse warnings about different address spaces ext3/ext4 Documentation: remove bh/nobh since it has been deprecated ext3: Improve truncate error handling ext3: use proper little-endian bitops ext2: include fs.h into ext2_fs.h ext3: Fix oops in ext3_try_to_allocate_with_rsv() jbd: fix a bug of leaking jh->b_jcount jbd: remove dependency on __GFP_NOFAIL ext3: Convert ext3 to new truncate calling convention jbd: Add fixed tracepoints ext3: Add fixed tracepoints Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and new fixed tracepoints.	2011-07-26 11:34:40 -07:00
Sage Weil	d79698da32	ceph: document unlocked d_parent accesses For the most part we don't care about racing with rename when directing MDS requests; either the old or new parent is fine. Document that, and do some minor cleanup. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:31:26 -07:00
Sage Weil	41b02e1f9b	ceph: explicitly reference rename old_dentry parent dir in request We carry a pin on the parent directory for the rename source and dest dentries. For the source it's r_locked_dir; we need to explicitly reference the old_dentry parent as well, since the dentry's d_parent may change between when the request was created and pinned and when it is freed. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:31:14 -07:00
Sage Weil	4f17726452	ceph: document locking for ceph_set_dentry_offset Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:31:08 -07:00
Sage Weil	e5f86dc377	ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug Have caller pass in a safely-obtained reference to the parent directory for calculating a dentry's hash valud. While we're here, simpify the flow through ceph_encode_fh() so that there is a single exit point and cleanup. Also fix a bug with the dentry hash calculation: calculate the hash for the dentry we were given, not its parent. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:55 -07:00
Sage Weil	bf1c6aca96	ceph: protect d_parent access in ceph_d_revalidate Protect d_parent with d_lock. Carry a reference. Simplify the flow so that there is a single exit point and cleanup. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:43 -07:00
Sage Weil	5f21c96dd5	ceph: protect access to d_parent d_parent is protected by d_lock: use it when looking up a dentry's parent directory inode. Also take a reference and drop it in the caller to avoid a use-after-free. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:29 -07:00
Sage Weil	48d0cbd124	ceph: handle racing calls to ceph_init_dentry The ->lookup() and prepopulate_readdir() callers are working with unhashed dentries, so we don't have to worry. The export.c callers, though, need to initialize something they got back from d_obtain_alias() and are potentially racing with other callers. Make sure we don't return unless the dentry is properly initialized (by us or someone else). Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:15 -07:00
Sage Weil	dfabbed6fd	ceph: set dir complete frag after adding capability Curretly ceph_add_cap clears the complete bit if we are newly issued the FILE_SHARED cap, which is normally the case for a newly issue cap on a new directory. That means we clear the just-set bit. Move the check that sets the flag to after the cap is added/updated. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:02 -07:00
Yehuda Sadeh	e985222743	ceph: set up readahead size when rsize is not passed This should improve the default read performance, as without it readahead is practically disabled. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2011-07-26 11:29:14 -07:00
Sage Weil	2f90b852e3	ceph: ignore lease mask The lease mask is no longer used (and it changed a while back). Instead, use a non-zero duration to indicate that there is a lease being issued. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:28:25 -07:00
Sage Weil	468640e32c	ceph: fix ceph_lookup_open intent usage We weren't properly calling lookup_instantiate_filp when setting up the lookup intent, which could lead to file leakage on errors. So: - use separate helper for the hidden snapdir translation, immediately following the mds request - use ceph_finish_lookup for the final dentry/return value dance in the exit path - lookup_instantiate_filp on success Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:28:11 -07:00
Sage Weil	9bae113a08	ceph: only link open operations to directory unsafe list if O_CREAT\|O_TRUNC We only need to put these on the directory unsafe list if they have side effects that fsync(2) should flush out. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:27:59 -07:00
Sage Weil	acda765788	ceph: fix bad parent_inode calc in ceph_lookup_open We were always getting NULL here because the intent file f_dentry is always NULL at this point, which means we were always passing NULL to ceph_mdsc_do_request. In reality, this was fine, since this isn't currently ever a write operation that needs to get strung on the dir's unsafe list. Use the dir explicitly, and only pass it if this open has side-effects that a dir fsync should flush. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:27:48 -07:00
Sage Weil	d8de9ab63a	ceph: avoid carrying Fw cap during write into page cache The generic_file_aio_write call may block on balance_dirty_pages while we flush data to the OSDs. If we hold a reference to the FILE_WR cap during that interval revocation by the MDS (e.g., to do a stat(2)) may be very slow. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:27:34 -07:00
Greg Farnum	8f04d42276	ceph: report f_bfree based on kb_avail rather than diffing. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>	2011-07-26 11:27:06 -07:00
Sage Weil	e77dc3e9c0	ceph: only queue capsnap if caps are dirty We used to go into this branch if i_wrbuffer_ref_head was non-zero. This was an ancient check from before we were careful about dealing with all kinds of caps (and not just dirty pages). It is cleaner to only queue a capsnap if there is an actual dirty cap. If we are racing with... something...we will end up here with ci->i_wrbuffer_refs but no dirty caps. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:41 -07:00
Sage Weil	af0ed569d7	ceph: fix snap writeback when racing with writes There are two problems that come up when we try to queue a capsnap while a write is in progress: - The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap with dirty == 0. That will crash later in __ceph_flush_snaps(). Or on the FILE_WR cap if a write is in progress. - We may not have i_head_snapc set, which causes problems pretty quickly. Look to the snaprealm in this case. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:31 -07:00
Sage Weil	9cfa1098dc	ceph: use flag bit for at_end readdir flag This saves us a word of memory per file. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:18 -07:00
Sage Weil	4918b6d140	ceph: add F_SYNC file flag to force sync (non-O_DIRECT) io This allows us to force IO through the sync path which you normally only get when multiple clients are reading/writing to the same file or by mounting with -o sync. Among other things, this lets test programs verify correctness with a single mount. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:07 -07:00
Sage Weil	252c6728de	ceph: add flags field to file_info Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:25:27 -07:00
Linus Torvalds	1d87c28e68	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: Cleanup: check return codes of crypto api calls CIFS: Fix oops while mounting with prefixpath [CIFS] Redundant null check after dereference cifs: use cifs_dirent in cifs_save_resume_key cifs: use cifs_dirent to replace cifs_get_name_from_search_buf cifs: introduce cifs_dirent cifs: cleanup cifs_filldir	2011-07-26 11:11:28 -07:00
Jeff Layton	c46c887744	vfs: document locking requirements for d_move, __d_move and d_materialise_unique Adding a comment to d_materialise_unique per Al's request... d_move and __d_move have some pretty substantial locking requirements, but they are not clearly documented. Add some comments spelling them out. Also, document the requirement for the i_mutex of the parent in d_materialise_unique. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 13:41:14 -04:00
Linus Torvalds	f01ef569cd	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits) mm: properly reflect task dirty limits in dirty_exceeded logic writeback: don't busy retry writeback on new/freeing inodes writeback: scale IO chunk size up to half device bandwidth writeback: trace global_dirty_state writeback: introduce max-pause and pass-good dirty limits writeback: introduce smoothed global dirty limit writeback: consolidate variable names in balance_dirty_pages() writeback: show bdi write bandwidth in debugfs writeback: bdi write bandwidth estimation writeback: account per-bdi accumulated written pages writeback: make writeback_control.nr_to_write straight writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr() writeback: trace event writeback_queue_io writeback: trace event writeback_single_inode writeback: remove .nonblocking and .encountered_congestion writeback: remove writeback_control.more_io writeback: skip balance_dirty_pages() for in-memory fs writeback: add bdi_dirty_limit() kernel-doc writeback: avoid extra sync work at enqueue time writeback: elevate queue_io() into wb_writeback() ... Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c	2011-07-26 10:39:54 -07:00
Al Viro	41c96486f2	omfs: fix (mode & S_IFDIR) abuse granted, on a filesystem that has only regular files and directories it happens to work, but really should be S_ISDIR(mode)... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 13:05:28 -04:00
Al Viro	569254b0cc	btrfs: S_ISREG(mode) is not mode & S_IFREG... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 13:05:05 -04:00
Al Viro	61effb519c	jffs2: S_ISLNK(mode & S_IFMT) is pointless it's S_ISLNK(mode), TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 13:00:35 -04:00
Al Viro	24a01d4ee4	v9fs_iop_get_acl: get rid of unused variable Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 12:57:42 -04:00
Eric Dumazet	a209dfc7b0	vfs: dont chain pipe/anon/socket on superblock s_inodes list Workloads using pipes and sockets hit inode_sb_list_lock contention. superblock s_inodes list is needed for quota, dirty, pagecache and fsnotify management. pipe/anon/socket fs are clearly not candidates for these. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 12:57:09 -04:00
Dan Carpenter	bacb2d816c	fs: add missing unlock in default_llseek() A recent change in linux-next, `982d816581` "fs: add SEEK_HOLE and SEEK_DATA flags" added some direct returns on error, but it should have been a goto out. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-26 12:57:09 -04:00
Jan Kara	2d859db3e4	ext4: fix data corruption in inodes with journalled data When journalling data for an inode (either because it is a symlink or because the filesystem is mounted in data=journal mode), ext4_evict_inode() can discard unwritten data by calling truncate_inode_pages(). This is because we don't mark the buffer / page dirty when journalling data but only add the buffer to the running transaction and thus mm does not know there are still unwritten data. Fix the problem by carefully tracking transaction containing inode's data, committing this transaction, and writing uncheckpointed buffers when inode should be reaped. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-26 09:07:11 -04:00
Paul Bolle	10bd2473b1	eventpoll: fix comment typo 'evenpoll' Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-07-26 11:44:54 +02:00
Steven Whitehouse	1923703991	GFS2: Fix mount hang caused by certain access pattern to sysfs files Depending upon the order of userspace/kernel during the mount process, this can result in a hang without the _all version of the completion. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-07-26 10:18:37 +01:00
Linus Torvalds	e08dc1325f	p9: avoid unused variable warning Commit `4e34e719e4` ("fs: take the ACL checks to common code") removed the use of the 'acl' variable in v9fs_iop_get_acl(), but left the variable definition around. Remove it to get rid of the warning: fs/9p/acl.c: In function ‘v9fs_iop_get_acl’: fs/9p/acl.c:101:20: warning: unused variable ‘acl’ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 23:43:53 -07:00
Linus Torvalds	91d44d9999	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: Squashfs: Make ZLIB compression support optional Squashfs: Update documentation for XZ and add squashfs-tools devel tree	2011-07-25 22:50:35 -07:00
Linus Torvalds	2dad3206db	Merge branch 'for-3.1' of git://linux-nfs.org/~bfields/linux * 'for-3.1' of git://linux-nfs.org/~bfields/linux: nfsd: don't break lease on CLAIM_DELEGATE_CUR locks: rename lock-manager ops nfsd4: update nfsv4.1 implementation notes nfsd: turn on reply cache for NFSv4 nfsd4: call nfsd4_release_compoundargs from pc_release nfsd41: Deny new lock before RECLAIM_COMPLETE done fs: locks: remove init_once nfsd41: check the size of request nfsd41: error out when client sets maxreq_sz or maxresp_sz too small nfsd4: fix file leak on open_downgrade nfsd4: remember to put RW access on stateid destruction NFSD: Added TEST_STATEID operation NFSD: added FREE_STATEID operation svcrpc: fix list-corrupting race on nfsd shutdown rpc: allow autoloading of gss mechanisms svcauth_unix.c: quiet sparse noise svcsock.c: include sunrpc.h to quiet sparse noise nfsd: Remove deprecated nfsctl system call and related code. NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND Fix up trivial conflicts in Documentation/feature-removal-schedule.txt	2011-07-25 22:49:19 -07:00
Linus Torvalds	84635d68be	vfs: fix check_acl compile error when CONFIG_FS_POSIX_ACL is not set Commit `e77819e57f` ("vfs: move ACL cache lookup into generic code") didn't take the FS_POSIX_ACL config variable into account - when that is not set, ACL's go away, and the cache helper functions do not exist, causing compile errors like fs/namei.c: In function 'check_acl': fs/namei.c:191:10: error: implicit declaration of function 'negative_cached_acl' fs/namei.c:196:2: error: implicit declaration of function 'get_cached_acl' fs/namei.c:196:6: warning: assignment makes pointer from integer without a cast fs/namei.c:212:11: error: implicit declaration of function 'set_cached_acl' Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Acked-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 22:47:03 -07:00
Linus Torvalds	45b583b10a	Merge 'akpm' patch series * Merge akpm patch series: (122 commits) drivers/connector/cn_proc.c: remove unused local Documentation/SubmitChecklist: add RCU debug config options reiserfs: use hweight_long() reiserfs: use proper little-endian bitops pnpacpi: register disabled resources drivers/rtc/rtc-tegra.c: properly initialize spinlock drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time() drivers/rtc: add support for Qualcomm PMIC8xxx RTC drivers/rtc/rtc-s3c.c: support clock gating drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200 init: skip calibration delay if previously done misc/eeprom: add eeprom access driver for digsy_mtc board misc/eeprom: add driver for microwire 93xx46 EEPROMs checkpatch.pl: update $logFunctions checkpatch: make utf-8 test --strict checkpatch.pl: add ability to ignore various messages checkpatch: add a "prefer __aligned" check checkpatch: validate signature styles and To: and Cc: lines checkpatch: add __rcu as a sparse modifier checkpatch: suggest using min_t or max_t ... Did this as a merge because of (trivial) conflicts in - Documentation/feature-removal-schedule.txt - arch/xtensa/include/asm/uaccess.h that were just easier to fix up in the merge than in the patch series.	2011-07-25 21:00:19 -07:00
Akinobu Mita	9d6bf5aa17	reiserfs: use hweight_long() Use hweight_long() to count free bits in the bitmap. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:17 -07:00
Akinobu Mita	0c2fd1bfb1	reiserfs: use proper little-endian bitops Using __test_and_{set,clear}_bit_le() with ignoring its return value can be replaced with __{set,clear}_bit_le(). This introduces reiserfs_{set,clear}_le_bit for __{set,clear}_bit_le and does the above change with them. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:17 -07:00
Hugh Dickins	708e3508c2	tmpfs: clone shmem_file_splice_read() Copy __generic_file_splice_read() and generic_file_splice_read() from fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make page_cache_pipe_buf_ops and spd_release_page() accessible to it. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:11 -07:00
David Rientjes	be8f684d73	oom: make deprecated use of oom_adj more verbose /proc/pid/oom_adj is deprecated and scheduled for removal in August 2012 according to Documentation/feature-removal-schedule.txt. This patch makes the warning more verbose by making it appear as a more serious problem (the presence of a stack trace and being multiline should attract more attention) so that applications still using the old interface can get fixed. Very popular users of the old interface have been converted since the oom killer rewrite has been introduced. udevd switched to the /proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and opensshd switched in 5.7p1. At the start of 2012, this should be changed into a WARN() to emit all such incidents and then finally remove the tunable in August 2012 as scheduled. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:09 -07:00
Becky Bruce	2b37c35e65	fs/hugetlbfs/inode.c: fix pgoff alignment checking on 32-bit This: vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT) is incorrect on 32-bit. It causes us to & the pgoff with something that looks like this (for a 4m hugepage): 0xfff003ff. The mask should be flipped and then shifted, to give you 0x0000_03fff. Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:07 -07:00
Linus Torvalds	14067ff536	vfs: make gcc generate more obvious code for acl permission checking The "fsuid is the inode owner" case is not necessarily always the likely case, but it's the case that doesn't do anything odd and that we want in straight-line code. Make gcc not generate random "jump around for the fun of it" code. This just helps me read profiles. That thing is one of the hottest parts of the whole pathname lookup. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 19:55:52 -07:00
Shirish Pargaonkar	14cae3243b	cifs: Cleanup: check return codes of crypto api calls Check return codes of crypto api calls and either log an error or log an error and return from the calling function with error. Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-25 22:12:10 +00:00
Pavel Shilovsky	f5bc1e755d	CIFS: Fix oops while mounting with prefixpath commit `fec11dd9a0` caused a regression when we have already mounted //server/share/a and want to mount //server/share/a/b. The problem is that lookup_one_len calls __lookup_hash with nd pointer as NULL. Then __lookup_hash calls do_revalidate in the case when dentry exists and we end up with NULL pointer deference in cifs_d_revalidate: if (nd->flags & LOOKUP_RCU) return -ECHILD; Fix this by checking nd for NULL. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com> CC: Stable <stable@kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-25 22:06:40 +00:00
Steve French	e010a5ef95	[CIFS] Redundant null check after dereference Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-25 22:04:32 +00:00
Sunil Mushran	93862d5e1a	ocfs2: Implement llseek() ocfs2 implements its own llseek() to provide the SEEK_HOLE/SEEK_DATA functionality. SEEK_HOLE sets the file pointer to the start of either a hole or an unwritten (preallocated) extent, that is greater than or equal to the supplied offset. SEEK_DATA sets the file pointer to the start of an allocated extent (not unwritten) that is greater than or equal to the supplied offset. If the supplied offset is on a desired region, then the file pointer is set to it. Offsets greater than or equal to the file size return -ENXIO. Unwritten (preallocated) extents are considered holes because the file system treats reads to such regions in the same way as it does to holes. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-25 14:58:15 -07:00
Christoph Hellwig	eaf35b1ea8	cifs: use cifs_dirent in cifs_save_resume_key Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-25 21:43:14 +00:00
Christoph Hellwig	f16d59b417	cifs: use cifs_dirent to replace cifs_get_name_from_search_buf This allows us to parse the on the wire structures only once in cifs_filldir. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-25 21:40:53 +00:00
Christoph Hellwig	cda0ec6a86	cifs: introduce cifs_dirent Introduce a generic directory entry structure, and factor the parsing of the various on the wire structures that can represent one into a common helper. Switch cifs_entry_is_dot over to use it as a start. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-25 21:36:44 +00:00
Mark Fasheh	38a1a91953	btrfs: don't BUG_ON allocation errors in btrfs_drop_snapshot In addition to properly handling allocation failure from btrfs_alloc_path, I also fixed up the kzalloc error handling code immediately below it. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2011-07-25 14:35:15 -07:00
Mark Fasheh	92b8e897f6	btrfs: Don't BUG_ON alloc_path errors in find_next_chunk I also removed the BUG_ON from error return of find_next_chunk in init_first_rw_device(). It turns out that the only caller of init_first_rw_device() also BUGS on any nonzero return so no actual behavior change has occurred here. do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk() which can now return -ENOMEM. Instead of setting space_info->full on any error from btrfs_alloc_chunk() I catch and return every error value _except_ -ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2011-07-25 14:34:54 -07:00
Christoph Hellwig	9feed6f8fb	cifs: cleanup cifs_filldir Use sensible variable names and formatting and remove some superflous checks on entry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-25 21:05:10 +00:00
Linus Torvalds	d3ec4844d4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) fs: Merge split strings treewide: fix potentially dangerous trailing ';' in #defined values/expressions uwb: Fix misspelling of neighbourhood in comment net, netfilter: Remove redundant goto in ebt_ulog_packet trivial: don't touch files that are removed in the staging tree lib/vsprintf: replace link to Draft by final RFC number doc: Kconfig: `to be' -> `be' doc: Kconfig: Typo: square -> squared doc: Konfig: Documentation/power/{pm => apm-acpi}.txt drivers/net: static should be at beginning of declaration drivers/media: static should be at beginning of declaration drivers/i2c: static should be at beginning of declaration XTENSA: static should be at beginning of declaration SH: static should be at beginning of declaration MIPS: static should be at beginning of declaration ARM: static should be at beginning of declaration rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check Update my e-mail address PCIe ASPM: forcedly -> forcibly gma500: push through device driver tree ... Fix up trivial conflicts: - arch/arm/mach-ep93xx/dma-m2p.c (deleted) - drivers/gpio/gpio-ep93xx.c (renamed and context nearby) - drivers/net/r8169.c (just context changes)	2011-07-25 13:56:39 -07:00
Chandra Seetharaman	c35a549c8b	xfs: Remove the macro XFS_BUFTARG_NAME Remove the definition and usages of the macro XFS_BUFTARG_NAME. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:37 -05:00
Chandra Seetharaman	49074c069c	xfs: Remove the macro XFS_BUF_TARGET Remove the definition and usages of the macro XFS_BUF_TARGET Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:31 -05:00
Chandra Seetharaman	e38c9b87e5	xfs: Remove the macro XFS_BUF_SET_TARGET Remove the macro XFS_BUF_SET_TARGET. hch: As all the buffer allocator already set ->b_target it should be safe to simply remove these calls. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:26 -05:00
Chandra Seetharaman	811e64c716	Replace the macro XFS_BUF_ISPINNED with helper xfs_buf_ispinned Replace the macro XFS_BUF_ISPINNED with an inline helper function xfs_buf_ispinned, and change all its usages. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:22 -05:00
Chandra Seetharaman	02fe03d909	xfs: Remove the macro XFS_BUF_SET_PTR Remove the definition and usages of the macro XFS_BUF_SET_PTR. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:17 -05:00
Chandra Seetharaman	6292604447	xfs: Remove the macro XFS_BUF_PTR Remove the definition and usages of the macro XFS_BUF_PTR. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:13 -05:00
Chandra Seetharaman	0095a21eb6	xfs: Remove macro XFS_BUF_SET_START Remove the definition and usage of the macro XFS_BUF_SET_START. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:09 -05:00
Chandra Seetharaman	72790aa119	xfs: Remove macro XFS_BUF_HOLD Remove the definition and usage of the macro XFS_BUF_HOLD Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:06 -05:00
Chandra Seetharaman	b75e40a419	xfs: Remove macro XFS_BUF_BUSY and family Remove the definitions and uses of the macros XFS_BUF_BUSY, XFS_BUF_UNBUSY, and XFS_BUF_ISBUSY. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 15:03:00 -05:00
Chandra Seetharaman	5a52c2a581	xfs: Remove the macro XFS_BUF_ERROR and family Remove the definitions and usage of the macros XFS_BUF_ERROR, XFS_BUF_GETERROR and XFS_BUF_ISERROR. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 14:57:46 -05:00
Chandra Seetharaman	ed43233be9	xfs: Remove the macro XFS_BUF_BFLAGS Remove the definition of the macro XFS_BUF_BFLAGS and its usage. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-25 14:57:36 -05:00
Linus Torvalds	0003230e82	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fs: take the ACL checks to common code bury posix_acl_..._masq() variants kill boilerplates around posix_acl_create_masq() generic_acl: no need to clone acl just to push it to set_cached_acl() kill boilerplate around posix_acl_chmod_masq() reiserfs: cache negative ACLs for v1 stat format xfs: cache negative ACLs if there is no attribute fork 9p: do no return 0 from ->check_acl without actually checking vfs: move ACL cache lookup into generic code CIFS: Fix oops while mounting with prefixpath xfs: Fix wrong return value of xfs_file_aio_write fix devtmpfs race caam: don't pass bogus S_IFCHR to debugfs_create_...() get rid of create_proc_entry() abuses - proc_mkdir() is there for purpose asus-wmi: ->is_visible() can't return negative fix jffs2 ACLs on big-endian with 16bit mode_t 9p: close ACL leaks ocfs2_init_acl(): fix a leak VFS : mount lock scalability for internal mounts	2011-07-25 12:53:15 -07:00
Trond Myklebust	ed1e6211a0	NFSv4: Don't use the delegation->inode in nfs_mark_return_delegation() nfs_mark_return_delegation() is usually called without any locking, and so it is not safe to dereference delegation->inode. Since the inode is only used to discover the nfs_client anyway, it makes more sense to have the callers pass a valid pointer to the nfs_server as a parameter. Reported-by: Ian Kent <raven@themaw.net> Cc: stable@kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-25 15:37:29 -04:00
Jeff Layton	73ca1001ed	nfs: don't use d_move in nfs_async_rename_done If the task that initiated the sillyrename ends up being killed by a fatal signal, then it will eventually return back to userspace and end up releasing the i_mutex. d_move however needs to be done while holding the i_mutex. Instead of using d_move here, just unhash the old and new dentries to prevent them from being found by lookups. With this change though, the dentries are now incorrect post-rename and do not reflect the actual name of the file on the server. I'm proceeding under the assumption that since they are unhashed that this isn't really a problem. In order for the sillydelete to still work though, the dname must be copied earlier when setting up the sillydelete info, and the name must be recopied if the sillydelete info has to be moved to a new dentry. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-25 15:00:21 -04:00
Stephen Rothwell	5f00bcb38e	Merge branch 'master' into devel and apply fixup from Stephen Rothwell: vfs/nfs: fixup for nfs_open_context change Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-25 14:53:52 -04:00
Christoph Hellwig	4e34e719e4	fs: take the ACL checks to common code Replace the ->check_acl method with a ->get_acl method that simply reads an ACL from disk after having a cache miss. This means we can replace the ACL checking boilerplate code with a single implementation in namei.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:30:23 -04:00
Al Viro	edde854e8b	bury posix_acl_..._masq() variants made static; no callers left outside of posix_acl.c. posix_acl_clone() also has lost all external callers and became static... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:27:32 -04:00
Al Viro	826cae2f2b	kill boilerplates around posix_acl_create_masq() new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with modified clone, on failure releases acl and replaces with NULL. Returns 0 or -ve on error. All callers of posix_acl_create_masq() switched. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:27:32 -04:00
Al Viro	95203befa8	generic_acl: no need to clone acl just to push it to set_cached_acl() In-core acls are copy-on-write, so the reference taken by set_cached_acl() will do just fine. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:27:31 -04:00
Al Viro	bc26ab5f65	kill boilerplate around posix_acl_chmod_masq() new helper: posix_acl_chmod(&acl, gfp, mode). Replaces acl with modified clone or with NULL if that has failed; returns 0 or -ve on error. All callers of posix_acl_chmod_masq() switched to that - they'd been doing exactly the same thing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:27:30 -04:00
Christoph Hellwig	4482a087d4	reiserfs: cache negative ACLs for v1 stat format Always set up a negative ACL cache entry if the inode can't have ACLs. That behaves much better than doing this check inside ->check_acl. Also remove the left over MAY_NOT_BLOCK check. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:25:38 -04:00
Christoph Hellwig	6311b10800	xfs: cache negative ACLs if there is no attribute fork Always set up a negative ACL cache entry if the inode doesn't have an attribute fork. That behaves much better than doing this check inside ->check_acl. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:25:38 -04:00
Christoph Hellwig	ebbb0ef287	9p: do no return 0 from ->check_acl without actually checking If we do not want to use ACLs we at least need to perform normal Unix permission checks. From the comment I'm not quite sure that's what is intended, but if 0p wants to do permission checks entirely on the server it needs to do so in ->permission, not in ->check_acl. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:25:38 -04:00
Linus Torvalds	e77819e57f	vfs: move ACL cache lookup into generic code This moves logic for checking the cached ACL values from low-level filesystems into generic code. The end result is a streamlined ACL check that doesn't need to load the inode->i_op->check_acl pointer at all for the common cached case. The filesystems also don't need to check for a non-blocking RCU walk case in their acl_check() functions, because that is all handled at a VFS layer. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:23:39 -04:00
Pavel Shilovsky	3ca30d40a9	CIFS: Fix oops while mounting with prefixpath commit `fec11dd9a0` caused a regression when we have already mounted //server/share/a and want to mount //server/share/a/b. The problem is that lookup_one_len calls __lookup_hash with nd pointer as NULL. Then __lookup_hash calls do_revalidate in the case when dentry exists and we end up with NULL pointer deference in cifs_d_revalidate: if (nd->flags & LOOKUP_RCU) return -ECHILD; Fix this by checking nd for NULL. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:23:21 -04:00
Markus Trippelsdorf	340a0a01b9	xfs: Fix wrong return value of xfs_file_aio_write The fsync prototype change commit `02c24a8218` accidentally overwrote the ssize_t return value of xfs_file_aio_write with 0 for SYNC type writes. Fix this by checking if an error occured when calling xfs_file_fsync and only change the return value in this case. In addition xfs_file_fsync actually returns a normal negative error, so fix this, too. Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:23:21 -04:00
Linus Torvalds	096a705bbc	Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits) block: strict rq_affinity backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu block: fix patch import error in max_discard_sectors check block: reorder request_queue to remove 64 bit alignment padding CFQ: add think time check for group CFQ: add think time check for service tree CFQ: move think time check variables to a separate struct fixlet: Remove fs_excl from struct task. cfq: Remove special treatment for metadata rqs. block: document blk_plug list access block: avoid building too big plug list compat_ioctl: fix make headers_check regression block: eliminate potential for infinite loop in blkdev_issue_discard compat_ioctl: fix warning caused by qemu block: flush MEDIA_CHANGE from drivers on close(2) blk-throttle: Make total_nr_queued unsigned block: Add __attribute__((format(printf...) and fix fallout fs/partitions/check.c: make local symbols static block:remove some spare spaces in genhd.c block:fix the comment error in blkdev.h ...	2011-07-25 10:33:36 -07:00
Dave Kleikamp	3c2c226285	jfs: clean up some compiler warnings jfs has a few variables being set but never used. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2011-07-25 11:01:12 -05:00
Wengang Wang	5cffff9e29	ocfs2: Fix ocfs2_page_mkwrite() This patch address two shortcomings in ocfs2_page_mkwrite(): 1. Makes the function return better VM_FAULT_* errors. 2. It handles a error that is triggered when a page is dropped from the mapping due to memory pressure. This patch locks the page to prevent that. [Patch was cleaned up by Sunil Mushran.] Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:36:54 -07:00
Sunil Mushran	a035bff6b8	ocfs2: Add comment about orphan scanning Add a comment that explains the reason as to why orphan scan scans all the slots. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:35:54 -07:00
Sunil Mushran	619c200de1	ocfs2: Clean up messages in the fs Convert useful messages from ML_NOTICE to KERN_NOTICE to improve readability. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:34:54 -07:00
Sunil Mushran	6b27f62fc7	ocfs2/cluster: Cluster up now includes network connections too The cluster up check only checks to see if the node is heartbeating or not. If yes it continues assuming that the node is connected to all the nodes. But if that is not the case, the cluster join aborts with a stack of errors that are not easy to comprehend. This patch adds the network connect check upfront and prints the nodes that the node is not yet connected to, before aborting. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:33:54 -07:00
Sunil Mushran	3ba169ccec	ocfs2/cluster: Add new function o2net_fill_node_map() Patch adds function o2net_fill_node_map() to return the bitmap of nodes that it is connected to. This bitmap is also accessible by the user via the debugfs file, /sys/kernel/debug/o2net/connected_nodes. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:32:54 -07:00
Sunil Mushran	bb570a5d9e	ocfs2/cluster: Fix output in file elapsed_time_in_ms The o2hb debugfs file, elapsed_time_in_ms, should return values only after the timer is armed atleast once. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:31:54 -07:00
Sunil Mushran	a2c0cc1579	ocfs2/dlm: dlmlock_remote() needs to account for remastery In dlmlock_remote(), we wait for the resource to stop being active before setting the inprogress flag. Active includes recovery, migration, etc. The problem here is that if the resource was being recovered or migrated, the new owner could very well be that node itself (and thus not a remote node). This problem was observed in Oracle bug#12583620. The error messages observed were as follows: dlm_send_remote_lock_request:337 ERROR: Error -40 (ELOOP) when sending message 503 (key 0xd6d8c7) to node 2 dlmlock_remote:271 ERROR: dlm status = DLM_BADARGS dlmlock:751 ERROR: dlm status = DLM_BADARGS Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:30:54 -07:00
Sunil Mushran	ff0a522e7d	ocfs2/dlm: Take inflight reference count for remotely mastered resources too The inflight reference count, in the lock resource, is taken to pin the resource in memory. We take it when a new resource is created and release it after a lock is attached to it. We do this to prevent the resource from getting purged prematurely. Earlier this reference count was being taken for locally mastered resources only. This patch extends the same functionality for remotely mastered ones. We are doing this because the same premature purging could occur for remotely mastered resources if the remote node were to die before completion of the create lock. Fix for Oracle bug#12405575. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:29:54 -07:00
Sunil Mushran	ed8625c6fb	ocfs2/dlm: Cleanup dlm_wait_for_node_death() and dlm_wait_for_node_recovery() dlm_wait_for_node_death() and dlm_wait_for_node_recovery() needed a facelift. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:28:54 -07:00
Sunil Mushran	e9f0b6a623	ocfs2/dlm: Trace insert/remove of resource to/from hash Add mlog to trace adding and removing the resource from/to the hash table. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:27:54 -07:00
Sunil Mushran	8d400b81cc	ocfs2/dlm: Clean up refmap helpers Patch cleans up helpers that set/clear refmap bits and grab/drop inflight lock ref counts. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:26:54 -07:00
Sunil Mushran	0afbba1322	ocfs2/dlm: Cleanup up dlm_finish_local_lockres_recovery() dlm_finish_local_lockres_recovery() needed a facelift. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:25:54 -07:00
Sunil Mushran	394eb3d38a	ocfs2: Clean up messages in stack_o2cb.c o2cb messages needed a facelift. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:24:54 -07:00
Sunil Mushran	8decab3c8d	ocfs2/dlm: Clean up messages in o2dlm o2dlm messages needed a facelift. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:23:54 -07:00
Sunil Mushran	1dfecf810e	ocfs2/cluster: Clean up messages in o2net o2net messages needed a facelift. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:22:54 -07:00
Sunil Mushran	d2eece3766	ocfs2/cluster: Abort heartbeat start on hard-ro devices Currently if the heartbeat device is hard-ro, the o2hb thread keeps chugging along and dumping errors along the way. The user needs to manually stop the heartbeat. The patch addresses this shortcoming by adding a limit to the number of times the hb thread will iterate in an unsteady state. If the hb thread does not ready steady state in that many interation, the start is aborted. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>	2011-07-24 10:21:54 -07:00
Al Viro	963945bf93	fix jffs2 ACLs on big-endian with 16bit mode_t casting int * to mode_t * is not a good thing - on a lot of big-endian architectures mode_t happens to be smaller than int and there it breaks quite spectaculary... Fucked-up-by: commit `cfc8dc6f6f` Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-24 10:12:01 -04:00
Al Viro	1ec95bf34d	9p: close ACL leaks Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-24 10:10:18 -04:00
Al Viro	c0d960f038	ocfs2_init_acl(): fix a leak Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-24 10:10:09 -04:00
Tim Chen	423e0ab086	VFS : mount lock scalability for internal mounts For a number of file systems that don't have a mount point (e.g. sockfs and pipefs), they are not marked as long term. Therefore in mntput_no_expire, all locks in vfs_mount lock are taken instead of just local cpu's lock to aggregate reference counts when we release reference to file objects. In fact, only local lock need to have been taken to update ref counts as these file systems are in no danger of going away until we are ready to unregister them. The attached patch marks file systems using kern_mount without mount point as long term. The contentions of vfs_mount lock is now eliminated. Before un-registering such file system, kern_unmount should be called to remove the long term flag and make the mount point ready to be freed. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-24 10:08:32 -04:00
Wu Fengguang	fcc5c22218	writeback: don't busy retry writeback on new/freeing inodes Fix a system hang bug introduced by commit `b7a2441f99` ("writeback: remove writeback_control.more_io") and `e8dfc3058` ("writeback: elevate queue_io() into wb_writeback()") easily reproducible with high memory pressure and lots of file creation/deletions, for example, a kernel build in limited memory. It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE state, the flusher will get stuck busy retrying that inode, never releasing wb->list_lock. The lock in turn blocks all kinds of other tasks when they are trying to grab it. As put by Jan, it's a safe change regarding data integrity. I_FREEING or I_WILL_FREE inodes are written back by iput_final() and it is reclaim code that is responsible for eventually removing them. So writeback code can safely ignore them. I_NEW inodes should move out of this state when they are fully set up and in the writeback round following that, we will consider them for writeback. So the change makes sense. CC: Jan Kara <jack@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-07-24 10:46:51 +08:00
Robin Dong	b7ca1e8ec5	ext4: correct comment for ext4_ext_check_cache The comment for ext4_ext_check_cache has a litte mistake. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-23 21:53:25 -04:00
Robin Dong	0737964bc9	ext4: correct the debug message in ext4_ext_insert_extent The debug message in ext4_ext_insert_extent before moving extent is incorrect (the "from xx to xx"). Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-23 21:51:07 -04:00
Robin Dong	5718789da5	ext4: remove unused argument in ext4_ext_next_leaf_block The argument "inode" in function ext4_ext_next_allocated_block looks useless, so clean it. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-23 21:49:07 -04:00
Tao Ma	6a0fe49308	ext4: remove ac_repeats from ext4_allocation_context ac_repeats isn't referenced in the mballoc code. So remove it. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-23 16:18:55 -04:00
Tao Ma	ced156e464	ext4: don't increment s_mb_buddies_generated in ext4_mb_release In ext4_mb_release, we use s_mb_buddies_generated++. Although the output is OK, but I don't think we need this extra ++. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-23 16:18:05 -04:00
Tao Ma	529da704ad	ext4: remove unnecessary ext4_get_group_info in ext4_mb_load_buddy ext4_mb_load_buddy() calls ext4_get_group_info() for setting both "grp" and "e4b->bd_info", but it could do "e4b->bd_info = grp". Reported-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-23 16:07:26 -04:00
Casey Bodley	0c12eaffdf	nfsd: don't break lease on CLAIM_DELEGATE_CUR CLAIM_DELEGATE_CUR is used in response to a broken lease; allowing it to break the lease and return EAGAIN leaves the client unable to make progress in returning the delegation nfs4_get_vfs_file() now takes struct nfsd4_open for access to the claim type, and calls nfsd_open() with NFSD_MAY_NOT_BREAK_LEASE when claim type is CLAIM_DELEGATE_CUR Signed-off-by: Casey Bodley <cbodley@citi.umich.edu> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-23 14:58:17 -04:00
Aneesh Kumar K.V	48e370ff93	fs/9p: add 9P2000.L unlinkat operation unlinkat - Remove a directory entry size[4] Tunlinkat tag[2] dirfid[4] name[s] flag[4] size[4] Runlinkat tag[2] older Tremove have the below request format size[4] Tremove tag[2] fid[4] The remove message is used to remove a directory entry either file or directory The remove opreation is actually a directory opertation and should ideally have dirfid, if not we cannot represent the fid on server with anything other than name. We will have to derive the directory name from fid in the Tremove request. NOTE: The operation doesn't clunk the unlink fid. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-07-23 09:32:52 -05:00
Aneesh Kumar K.V	9e8fb38e7d	fs/9p: add 9P2000.L renameat operation renameat - change name of file or directory size[4] Trenameat tag[2] olddirfid[4] oldname[s] newdirfid[4] newname[s] size[4] Rrenameat tag[2] older Trename have the below request format size[4] Trename tag[2] fid[4] newdirfid[4] name[s] The rename message is used to change the name of a file, possibly moving it to a new directory. The rename opreation is actually a directory opertation and should ideally have olddirfid, if not we cannot represent the fid on server with anything other than name. We will have to derive the old directory name from fid in the Trename request. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-07-23 09:32:51 -05:00
Aneesh Kumar K.V	ed80fcfac2	fs/9p: Always ask new inode in create This make sure we don't end up reusing the unlinked inode object. The ideal way is to use inode i_generation. But i_generation is not available in userspace always. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-07-23 09:32:50 -05:00
Prem Karat	a2dd43bb0d	fs/9p: Fix invalid mount options/args Without this fix, if any invalid mount options/args are passed while mouting the 9p fs, no error (-EINVAL) is returned and default arg value is assigned. This fix returns -EINVAL when an invalid arguement is found while parsing mount options. Signed-off-by: Prem Karat <prem.karat@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-07-23 09:32:48 -05:00
Aneesh Kumar K.V	fd2421f544	fs/9p: When doing inode lookup compare qid details and inode mode bits. This make sure we don't use wrong inode from the inode hash. The inode number of the file deleted is reused by the next file system object created and if we only use inode number for inode hash lookup we could end up with wrong struct inode. Also compare inode generation number. Not all Linux file system provide st_gen in userspace. So it could be 0; Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-07-23 09:32:48 -05:00
Aneesh Kumar K.V	2053d67c54	fs/9p: remove rename work around in 9p Now that VFS does the right thing remove the work around. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-07-23 09:32:47 -05:00
Linus Torvalds	bbd9d6f7fb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.	2011-07-22 19:02:39 -07:00
Jan Kara	b22570d9ab	ext3: Fix data corruption in inodes with journalled data When journalling data for an inode (either because it is a symlink or because the filesystem is mounted in data=journal mode), ext3_evict_inode() can discard unwritten data by calling truncate_inode_pages(). This is because we don't mark the buffer / page dirty when journalling data but only add the buffer to the running transaction and thus mm does not know there are still unwritten data. Fix the problem by carefully tracking transaction containing inode's data, committing this transaction, and writing uncheckpointed buffers when inode should be reaped. Signed-off-by: Jan Kara <jack@suse.cz>	2011-07-23 01:49:00 +02:00
Konstantin Khlebnikov	5a9a43646c	vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp Replace unclear (struct dentry ) to (struct file ) typecast with ERR_CAST() macro. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-22 19:42:13 -04:00
Jan Kara	d769b3c2ab	isofs: Remove global fs lock sbi->s_mutex isn't needed for isofs at all so we can just remove it. Generally, since isofs is always mounted read-only, filesystem structure cannot change under us. So buffer_head contents stays constant after it's filled in. That leaves us with possible changes of global data structures. Superblock changes only during filesystem mount (even remount does not change it), inodes are only filled in during reading from disk. So there are no changes of these structures to bother about. Arguments why sbi->s_mutex can be removed at each place: isofs_readdir: Accesses sb, inode, filp, local variables => s_mutex not needed isofs_lookup: Protected by directory's i_mutex. Accesses sb, inode, dentry, local variables => s_mutex not needed rock_ridge_symlink_readpage: Protected by page lock. Accesses sb, inode, local variables => s_mutex not needed. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-22 19:42:12 -04:00
Al Viro	22ba747f66	jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory We don't generate IN_DELETE_SELF on victim of overwriting rename() if it happens to be a directory. Trivially fixed by doing to ->i_nlink what we do ->pino_nlink a couple of lines later in jffs2_rename(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-22 19:42:11 -04:00
Al Viro	841590ce16	fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. On ramfs and other simple_rename() users IN_DELETE_SELF is not generated for victim of overwriting rename() if it's is a directory. Works on most of the local filesystems and really trivial to fix... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-22 19:42:11 -04:00
Matthew Garrett	dee28e72b6	pstore: Allow the user to explicitly choose a backend pstore only allows one backend to be registered at present, but the system may provide several. Add a parameter to allow the user to choose which backend will be used rather than just relying on load order. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2011-07-22 16:14:39 -07:00
Matthew Garrett	b94fdd077e	pstore: Make "part" unsigned We'll never have a negative part, so just make this an unsigned int. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2011-07-22 16:14:29 -07:00
Matthew Garrett	56280682ce	pstore: Add extra context for writes and erases EFI only provides small amounts of individual storage, and conventionally puts metadata in the storage variable name. Rather than add a metadata header to the (already limited) variable storage, it's easier for us to modify pstore to pass all the information we need to construct a unique variable name to the appropriate functions. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2011-07-22 16:14:20 -07:00
Matthew Garrett	638c1fd303	pstore: Extend API for more flexibility in new backends Some pstore implementations may not have a static context, so extend the API to pass the pstore_info struct to all calls and allow for a context pointer. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2011-07-22 16:14:06 -07:00
Linus Torvalds	8209f53d79	Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits) ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction connector: add an event for monitoring process tracers ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task() ptrace_init_task: initialize child->jobctl explicitly has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/ ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/ ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented() ptrace: ptrace_reparented() should check same_thread_group() redefine thread_group_leader() as exit_signal >= 0 do not change dead_task->exit_signal kill task_detached() reparent_leader: check EXIT_DEAD instead of task_detached() make do_notify_parent() __must_check, update the callers __ptrace_detach: avoid task_detached(), check do_notify_parent() kill tracehook_notify_death() make do_notify_parent() return bool ptrace: s/tracehook_tracer_task()/ptrace_parent()/ ...	2011-07-22 15:06:50 -07:00
Linus Torvalds	c1f792a5bf	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (49 commits) xfs: add size update tracepoint to IO completion xfs: convert AIL cursors to use struct list_head xfs: remove confusing ail cursor wrapper xfs: use a cursor for bulk AIL insertion xfs: failure mapping nfs fh to inode should return ESTALE xfs: Remove the second parameter to xfs_sb_count() xfs: remove the dead XFS_DABUF_DEBUG code xfs: remove leftovers of the old btree tracing code xfs: remove the dead QUOTADEBUG code xfs: remove the unused xfs_buf_delwri_sort function xfs: remove wrappers around b_iodone xfs: remove wrappers around b_fspriv xfs: add a proper transaction pointer to struct xfs_buf xfs: factor out xfs_da_grow_inode_int xfs: factor out xfs_dir2_leaf_find_stale xfs: cleanup struct xfs_dir2_free xfs: reshuffle dir2 headers xfs: start periodic workers later Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc" xfs: remove variables that serve no purpose in xfs_alloc_ag_vextent_exact() ...	2011-07-22 13:16:33 -07:00
Linus Torvalds	6aaf4404ab	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: don't limit active work items dlm: use workqueue for callbacks dlm: remove deadlock debug print dlm: improve rsb searches dlm: keep lkbs in idr dlm: fix kmalloc args dlm: don't do pointless NULL check, use kzalloc and fix order of arguments dlm: dump address of unknown node dlm: use vmalloc for hash tables dlm: show addresses in configfs	2011-07-22 13:16:07 -07:00
Linus Torvalds	ba1f9db908	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus: hfsplus: ensure bio requests are not smaller than the hardware sectors hfsplus: Add additional range check to handle on-disk corruptions hfsplus: Add error propagation for hfsplus_ext_write_extent_locked hfsplus: add error checking for hfs_find_init() hfsplus: lift the 2TB size limit hfsplus: fix overflow in hfsplus_read_wrapper hfsplus: fix overflow in hfsplus_get_block hfsplus: assignments inside `if' condition clean-up	2011-07-22 13:12:17 -07:00
Linus Torvalds	49302baa64	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: GFS2: combine duplicated block freeing routines GFS2: Add S_NOSEC support GFS2: Automatically adjust glock min hold time GFS2: Cache dir hash table in a contiguous buffer	2011-07-22 13:10:41 -07:00
Linus Torvalds	59a7ac1211	Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 * 'linux-next' of git://git.infradead.org/ubifs-2.6: (32 commits) MAINTAINERS: change e-mail of Adrian Hunter UBIFS: fix master node recovery UBIFS: improve power cut emulation testing UBIFS: rename recovery testing variables UBIFS: remove custom list of superblocks UBIFS: stop re-defining UBI operations UBIFS: switch to I/O helpers UBIFS: switch to ubifs_leb_write UBIFS: switch to ubifs_leb_read UBIFS: introduce more I/O helpers UBIFS: always print stacktrace when switching to R/O mode UBIFS: remove unused and unneeded debugging function UBIFS: add global debugfs knobs UBIFS: introduce debugfs helpers UBIFS: re-arrange debugging code a bit UBIFS: be more informative in failure mode UBIFS: switch self-check knobs to debugfs UBIFS: lessen amount of debugging check types UBIFS: introduce helper functions for debugging checks and tests UBIFS: amend debugging inode size check function prototype ...	2011-07-22 13:09:35 -07:00
Wang Sheng-Hui	03b5bb3429	ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get In ext2_xattr_get(), the code will acquire xattr_sem first, later checks the length of xattr name_len > 255. It's unnecessarily time consuming and also ext2_xattr_set() checks the length before other checks. So move the check before acquiring xattr_sem to make these two functions consistent. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-07-22 19:41:16 +02:00
Jean Delvare	df2e301fee	fs: Merge split strings No idea why these were split in the first place... Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-07-22 16:47:15 +02:00
Seth Forshee	6596528e39	hfsplus: ensure bio requests are not smaller than the hardware sectors Currently all bio requests are 512 bytes, which may fail for media whose physical sector size is larger than this. Ensure these requests are not smaller than the block device logical block size. BugLink: http://bugs.launchpad.net/bugs/734883 Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-07-22 16:37:44 +02:00
Naohiro Aota	aac4e4198e	hfsplus: Add additional range check to handle on-disk corruptions 'recoff' is read from disk and used for an argument to memcpy, so if the value read from disk is larger than the page size, it result to "general protection fault". This patch add additional range check for the value, so that disk fuzz won't cause such fault. Signed-off-by: Naohiro Aota <naota@elisp.net> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-07-22 16:36:56 +02:00
Oleg Nesterov	eac1b5e57d	ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever Test-case: void tfunc(void arg) { execvp("true", NULL); return NULL; } int main(void) { int pid; if (fork()) { pthread_t t; kill(getpid(), SIGSTOP); pthread_create(&t, NULL, tfunc, NULL); for (;;) pause(); } pid = getppid(); assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0); while (wait(NULL) > 0) ptrace(PTRACE_CONT, pid, 0,0); return 0; } It is racy, exit_notify() does __wake_up_parent() too. But in the likely case it triggers the problem: de_thread() does release_task() and the old leader goes away without the notification, the tracer sleeps in do_wait() without children/tracees. Change de_thread() to do __wake_up_parent(traced_leader->parent). Since it is already EXIT_DEAD we can do this without ptrace_unlink(), EXIT_DEAD threads do not exist from do_wait's pov. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-07-22 15:10:49 +02:00
Phillip Lougher	cc6d349714	Squashfs: Make ZLIB compression support optional Squashfs now supports XZ and LZO compression in addition to ZLIB. As such it no longer makes sense to always include ZLIB support. In particular embedded systems may only use LZO or XZ compression, and the ability to exclude ZLIB support will reduce kernel size. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>	2011-07-22 03:01:28 +01:00
Linus Torvalds	2bafc7a275	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: CIFS: Fix wrong length in cifs_iovec_read	2011-07-21 14:28:01 -07:00
Linus Torvalds	b91da88fed	vfs: drop conditional inode prefetch in __do_lookup_rcu It seems to hurt performance in real life. Yes, the inode will be used later, but the conditional doesn't seem to predict all that well (negative dentries are not uncommon) and it looks like the cost of prefetching is simply higher than depending on the cache doing the right thing. As usual. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-21 11:01:42 -07:00
Jan Beulich	b307d4655a	FS-Cache: Fix __fscache_uncache_all_inode_pages()'s outer loop The compiler, at least for ix86 and m68k, validly warns that the comparison: next <= (loff_t)-1 is always true (and it's always true also for x86-64 and probably all other arches - as long as pgoff_t isn't wider than loff_t). The intention appears to be to avoid wrapping of "next", so rather than eliminating the pointless comparison, fix the loop to indeed get exited when "next" would otherwise wrap. On m68k the following warning is observed: fs/fscache/page.c: In function '__fscache_uncache_all_inode_pages': fs/fscache/page.c:979: warning: comparison is always false due to limited range of data type Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Suresh Jayaraman <sjayaraman@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-21 10:59:16 -07:00
Pavel Shilovsky	2cebaa58b7	CIFS: Fix wrong length in cifs_iovec_read Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-21 00:48:05 +00:00
Al Viro	86c98e8cdb	Remove dead code in dget_parent() ->d_parent is never NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:48:04 -04:00
David Howells	e4b9f00581	AFS: Fix silly characters in a comment Fix silly characters in a comment in AFS code (some weird characters replaced the word 'flag' some point way back). Reported-by: viro@ZenIV.linux.org.uk Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:48:03 -04:00
Al Viro	4513d899c4	switch d_add_ci() to d_splice_alias() in "found negative" case as well Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:48:02 -04:00
Al Viro	6c673ab393	simplify gfs2_lookup() d_splice_alias() will DTRT when given NULL or ERR_PTR Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:48:02 -04:00
Al Viro	79ac5a46c5	jfs_lookup(): don't bother with . or .. they'll never be passed to ->lookup() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:48:01 -04:00
Al Viro	10d9f309d8	get rid of useless dget_parent() in btrfs rename() and link() ->d_parent is locked and stable there... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:48:00 -04:00
Al Viro	2fbe8c8ad1	get rid of useless dget_parent() in fs/btrfs/ioctl.c both callers there have dentry->d_parent stabilized by the fact that their caller had obtained dentry from lookup_one_len() and had not dropped ->i_mutex on parent since then. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:48:00 -04:00
Josef Bacik	02c24a8218	fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:59 -04:00
Josef Bacik	06222e491e	fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases we just return -EINVAL, in others we do the normal generic thing, and in others we're simply making sure that the properly due-dilligence is done. For example in NFS/CIFS we need to make sure the file size is update properly for the SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself that is all we have to do. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:58 -04:00
Josef Bacik	c334b1138b	Ext4: handle SEEK_HOLE/SEEK_DATA generically Since Ext4 has its own lseek we need to make sure it handles SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic case, somebody else can come along and make it do fancy things later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:57 -04:00
Josef Bacik	b26751575a	Btrfs: implement our own ->llseek In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek. Basically for the normal SEEK_*'s we will just defer to the generic helper, and for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest hole or data. Currently this helper doesn't check for delalloc bytes for prealloc space, so for now treat prealloc as data until that is fixed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:56 -04:00
Josef Bacik	982d816581	fs: add SEEK_HOLE and SEEK_DATA flags This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out using fiemap in things like cp cause more problems than it solves, so lets try and give userspace an interface that doesn't suck. We need to match solaris here, and the definitions are o If /whence/ is SEEK_HOLE, the offset of the start of the next hole greater than or equal to the supplied offset is returned. The definition of a hole is provided near the end of the DESCRIPTION. o If /whence/ is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset. So in the generic case the entire file is data and there is a virtual hole at the end. That means we will just return i_size for SEEK_HOLE and will return the same offset for SEEK_DATA. This is how Solaris does it so we have to do it the same way. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:56 -04:00
Christoph Hellwig	b4d5b10fb2	reiserfs: make reiserfs default to barrier=flush Change the default reiserfs mount option to barrier=flush. Based on a patch from Jeff Mahoney in the SuSE tree. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:55 -04:00
Christoph Hellwig	00eacd66cd	ext3: make ext3 mount default to barrier=1 This patch turns on barriers by default for ext3. mount -o barrier=0 will turn them off. Based on a patch from Chris Mason in the SuSE tree. Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Mahoney <jeffm@suse.com> Acked-by: Ted Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:54 -04:00
Al Viro	b85fd6bdc9	don't open-code parent_ino() in assorted ->readdir() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:54 -04:00
Al Viro	2def9e4ec7	minix_getattr(): don't bother with ->d_parent we can find superblock easier, TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:53 -04:00
Al Viro	ee60498f3e	coda_venus_readdir(): use offsetof() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:52 -04:00
Kay Sievers	f15146380d	fs: seq_file - add event counter to simplify poll() support Moving the event counter into the dynamically allocated 'struc seq_file' allows poll() support without the need to allocate its own tracking structure. All current users are switched over to use the new counter. Requested-by: Andrew Morton akpm@linux-foundation.org Acked-by: NeilBrown <neilb@suse.de> Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:50 -04:00
Christoph Hellwig	72c5052ddc	fs: move inode_dio_done to the end_io handler For filesystems that delay their end_io processing we should keep our i_dio_count until the the processing is done. Enable this by moving the inode_dio_done call to the end_io handler if one exist. Note that the actual move to the workqueue for ext4 and XFS is not done in this patch yet, but left to the filesystem maintainers. At least for XFS it's not needed yet either as XFS has an internal equivalent to i_dio_count. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:50 -04:00
Christoph Hellwig	aacfc19c62	fs: simplify the blockdev_direct_IO prototype Simple filesystems always pass inode->i_sb_bdev as the block device argument, and never need a end_io handler. Let's simply things for them and for my grepping activity by dropping these arguments. The only thing not falling into that scheme is ext4, which passes and end_io handler without needing special flags (yet), but given how messy the direct I/O code there is use of __blockdev_direct_IO in one instead of two out of three cases isn't going to make a large difference anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:49 -04:00
Christoph Hellwig	df2d6f2658	fs: always maintain i_dio_count Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING. This these filesystems to also protect truncate against direct I/O requests by using common code. Right now the only non-DIO_LOCKING filesystem that appears to do so is XFS, which uses an opencoded variant of the i_dio_count scheme. Behaviour doesn't change for filesystems never calling inode_dio_wait. For ext4 behaviour changes when using the dioread_nonlock option, which previously was missing any protection between truncate and direct I/O reads. For ocfs2 that handcrafted i_dio_count manipulations are replaced with the common code now enable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:48 -04:00
Christoph Hellwig	562c72aa57	fs: move inode_dio_wait calls into ->setattr Let filesystems handle waiting for direct I/O requests themselves instead of doing it beforehand. This means filesystem-specific locks to prevent new dio referenes from appearing can be held. This is important to allow generalizing i_dio_count to non-DIO_LOCKING filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:47 -04:00
Christoph Hellwig	bd5fe6c5eb	fs: kill i_alloc_sem i_alloc_sem is a rather special rw_semaphore. It's the last one that may be released by a non-owner, and it's write side is always mirrored by real exclusion. It's intended use it to wait for all pending direct I/O requests to finish before starting a truncate. Replace it with a hand-grown construct: - exclusion for truncates is already guaranteed by i_mutex, so it can simply fall way - the reader side is replaced by an i_dio_count member in struct inode that counts the number of pending direct I/O requests. Truncate can't proceed as long as it's non-zero - when i_dio_count reaches non-zero we wake up a pending truncate using wake_up_bit on a new bit in i_flags - new references to i_dio_count can't appear while we are waiting for it to read zero because the direct I/O count always needs i_mutex (or an equivalent like XFS's i_iolock) for starting a new operation. This scheme is much simpler, and saves the space of a spinlock_t and a struct list_head in struct inode (typically 160 bits on a non-debug 64-bit system). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:46 -04:00
Christoph Hellwig	f9b5570d7f	fs: simplify handling of zero sized reads in __blockdev_direct_IO Reject zero sized reads as soon as we know our I/O length, and don't borther with locks or allocations that might have to be cleaned up otherwise. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:45 -04:00
Jan Kara	9ea7df534e	ext4: Rewrite ext4_page_mkwrite() to use generic helpers Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This removes the need of using i_alloc_sem to avoid races with truncate which seems to be the wrong locking order according to lock ordering documented in mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to be problematic because we can decide to flush delay-allocated blocks which will acquire s_umount semaphore - again creating unpleasant lock dependency if not directly a deadlock. Also add a check for frozen filesystem so that we don't busyloop in page fault when the filesystem is frozen. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:45 -04:00
Christoph Hellwig	5826869158	fat: remove i_alloc_sem abuse Add a new rw_semaphore to protect bmap against truncate. Previous i_alloc_sem was abused for this, but it's going away in this series. Note that we can't simply use i_mutex, given that the swapon code calls ->bmap under it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:44 -04:00
Tobias Klauser	8c5dc70aae	VFS: Fixup kerneldoc for generic_permission() The flags parameter went away in d749519b444db985e40b897f73ce1898b11f997e Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:43 -04:00
Dave Chinner	8daaa83145	xfs: make use of new shrinker callout for the inode cache Convert the inode reclaim shrinker to use the new per-sb shrinker operations. This allows much bigger reclaim batches to be used, and allows the XFS inode cache to be shrunk in proportion with the VFS dentry and inode caches. This avoids the problem of the VFS caches being shrunk significantly before the XFS inode cache is shrunk resulting in imbalances in the caches during reclaim. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:42 -04:00
Dave Chinner	8ab47664d5	vfs: increase shrinker batch size Now that the per-sb shrinker is responsible for shrinking 2 or more caches, increase the batch size to keep econmies of scale for shrinking each cache. Increase the shrinker batch size to 1024 objects. To allow for a large increase in batch size, add a conditional reschedule to prune_icache_sb() so that we don't hold the LRU spin lock for too long. This mirrors the behaviour of the __shrink_dcache_sb(), and allows us to increase the batch size without needing to worry about problems caused by long lock hold times. To ensure that filesystems using the per-sb shrinker callouts don't cause problems, document that the object freeing method must reschedule appropriately inside loops. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:41 -04:00
Dave Chinner	0e1fdafd93	superblock: add filesystem shrinker operations Now we have a per-superblock shrinker implementation, we can add a filesystem specific callout to it to allow filesystem internal caches to be shrunk by the superblock shrinker. Rather than perpetuate the multipurpose shrinker callback API (i.e. nr_to_scan == 0 meaning "tell me how many objects freeable in the cache), two operations will be added. The first will return the number of objects that are freeable, the second is the actual shrinker call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:41 -04:00
Dave Chinner	4f8c19fdf3	inode: remove iprune_sem Now that we have per-sb shrinkers with a lifecycle that is a subset of the superblock lifecycle and can reliably detect a filesystem being unmounted, there is not longer any race condition for the iprune_sem to protect against. Hence we can remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:40 -04:00
Dave Chinner	b0d40c92ad	superblock: introduce per-sb cache shrinker infrastructure With context based shrinkers, we can implement a per-superblock shrinker that shrinks the caches attached to the superblock. We currently have global shrinkers for the inode and dentry caches that split up into per-superblock operations via a coarse proportioning method that does not batch very well. The global shrinkers also have a dependency - dentries pin inodes - so we have to be very careful about how we register the global shrinkers so that the implicit call order is always correct. With a per-sb shrinker callout, we can encode this dependency directly into the per-sb shrinker, hence avoiding the need for strictly ordering shrinker registrations. We also have no need for any proportioning code for the shrinker subsystem already provides this functionality across all shrinkers. Allowing the shrinker to operate on a single superblock at a time means that we do less superblock list traversals and locking and reclaim should batch more effectively. This should result in less CPU overhead for reclaim and potentially faster reclaim of items from each filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:10 -04:00
J. Bruce Fields	8fb47a4fbf	locks: rename lock-manager ops Both the filesystem and the lock manager can associate operations with a lock. Confusingly, one of them (fl_release_private) actually has the same name in both operation structures. It would save some confusion to give the lock-manager ops different names. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-20 20:23:19 -04:00
Dave Chinner	55fb25d5b3	xfs: add size update tracepoint to IO completion For improving insight into IO completion behaviour. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:38:04 -05:00
Dave Chinner	af3e40228f	xfs: convert AIL cursors to use struct list_head The list of active AIL cursors uses a roll-your-own linked list with special casing for the AIL push cursor. Simplify this code by replacing the list with standard struct list_head lists, and use a separate list_head to track the active cursors. This allows us to treat the AIL push cursor as a generic cursor rather than as a special case, further simplifying the code. Further, fix the duplicate push cursor initialisation that the special case handling was hiding, and clean up all the comments around the active cursor list handling. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:37:46 -05:00
Dave Chinner	16b5902943	xfs: remove confusing ail cursor wrapper xfs_trans_ail_cursor_set() doesn't set the cursor to the current log item, it sets it to the next item. There is already a function for doing this - xfs_trans_ail_cursor_next() - and the _set function is simply a two line wrapper. Remove it and open code the setting of the cursor in the two locations that call it to remove the confusion. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:37:37 -05:00
Dave Chinner	1d8c95a363	xfs: use a cursor for bulk AIL insertion Delayed logging can insert tens of thousands of log items into the AIL at the same LSN. When the committing of log commit records occur, we can get insertions occurring at an LSN that is not at the end of the AIL. If there are thousands of items in the AIL on the tail LSN, each insertion has to walk the AIL to find the correct place to insert the new item into the AIL. This can consume large amounts of CPU time and block other operations from occurring while the traversals are in progress. To avoid this repeated walk, use a AIL cursor to record where we should be inserting the new items into the AIL without having to repeat the walk. The cursor infrastructure already provides this functionality for push walks, so is a simple extension of existing code. While this will not avoid the initial walk, it will avoid repeating it tens of thousands of times during a single checkpoint commit. This version includes logic improvements from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:37:20 -05:00
J. Bruce Fields	ad1a2c878c	xfs: failure mapping nfs fh to inode should return ESTALE On xfs exports, nfsd is incorrectly returning ENOENT instead of ESTALE on attempts to use a filehandle of a deleted file (spotted with pynfs test PUTFH3). The ENOENT was coming from xfs_iget. (It's tempting to wonder whether we should just map all xfs_iget errors to ESTALE, but I don't believe so--xfs_iget can also return ENOMEM at least, which we wouldn't want mapped to ESTALE.) While we're at it, the other return of ENOENT in xfs_nfs_get_inode() also looks wrong. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:35:21 -05:00
Chandra Seetharaman	adab0f67d1	xfs: Remove the second parameter to xfs_sb_count() Remove the second parameter to xfs_sb_count() since all callers of the function set them. Also, fix the header comment regarding it being called periodically. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-20 18:35:03 -05:00
Bernd Schubert	c878c73f8d	ext3: Fix compilation with -DDX_DEBUG Compilation of ext3/namei.c brought up an error and warning messages when compiled with -DDX_DEBUG. Signed-off-by: Bernd Schubert<bernd.schubert@itwm.fraunhofer.de> Signed-off-by: Jan Kara <jack@suse.cz>	2011-07-20 23:16:33 +02:00
Dave Chinner	12ad3ab661	superblock: move pin_sb_for_writeback() to fs/super.c The per-sb shrinker has the same requirement as the writeback threads of ensuring that the superblock is usable and pinned for the time it takes to run the work. Both need to take a passive reference to the sb, take a read lock on the s_umount lock and then only continue if an unmount is not in progress. pin_sb_for_writeback() does this exactly, so move it to fs/super.c and rename it to grab_super_passive() and exporting it via fs/internal.h for all the VFS code to be able to use. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:38 -04:00
Dave Chinner	09cc9fc7a7	inode: move to per-sb LRU locks With the inode LRUs moving to per-sb structures, there is no longer a need for a global inode_lru_lock. The locking can be made more fine-grained by moving to a per-sb LRU lock, isolating the LRU operations of different filesytsems completely from each other. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:36 -04:00
Dave Chinner	98b745c647	inode: Make unused inode LRU per superblock The inode unused list is currently a global LRU. This does not match the other global filesystem cache - the dentry cache - which uses per-superblock LRU lists. Hence we have related filesystem object types using different LRU reclaimation schemes. To enable a per-superblock filesystem cache shrinker, both of these caches need to have per-sb unused object LRU lists. Hence this patch converts the global inode LRU to per-sb LRUs. The patch only does rudimentary per-sb propotioning in the shrinker infrastructure, as this gets removed when the per-sb shrinker callouts are introduced later on. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:35 -04:00
Dave Chinner	fcb94f72d3	inode: convert inode_stat.nr_unused to per-cpu counters Before we split up the inode_lru_lock, the unused inode counter needs to be made independent of the global inode_lru_lock. Convert it to per-cpu counters to do this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:34 -04:00
Al Viro	a9049376ee	make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) ... and simplify the living hell out of callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:26 -04:00
Al Viro	0c1aa9a952	deuglify squashfs_lookup() d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL so no need for that if (inode) ... in there (or ERR_PTR(0), for that matter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:24 -04:00
Al Viro	5b4b299cc7	nfsd4_list_rec_dir(): don't bother with reopening rec_file just rewind it to the beginning before vfs_readdir() and be done with that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:23 -04:00
Al Viro	e7f5909707	kill useless checks for sb->s_op == NULL never is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:21 -04:00
Al Viro	0ee5dc676a	btrfs: kill magical embedded struct superblock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:20 -04:00
Al Viro	fb408e6ccc	get rid of pointless checks for dentry->sb == NULL it never is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:19 -04:00
Al Viro	a4464dbc0c	Make ->d_sb assign-once and always non-NULL New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name). Allocates dentry, sets its ->d_sb to given superblock and sets ->d_op accordingly. Old d_alloc(NULL, name) callers are converted to that (all of them know what superblock they want). d_alloc() itself is left only for parent != NULl case; uses __d_alloc(), inserts result into the list of parent's children. Note that now ->d_sb is assign-once and never NULL and ->d_parent is never NULL either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:17 -04:00
Al Viro	e3c3d9c838	unexport kern_path_parent() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:16 -04:00
Al Viro	e0a0124936	switch vfs_path_lookup() to struct path Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:14 -04:00
Al Viro	ed75e95de5	kill lookup_create() folded into the only caller (kern_path_create()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:12 -04:00
Al Viro	dae6ad8f37	new helpers: kern_path_create/user_path_create combination of kern_path_parent() and lookup_create(). Does not expose struct nameidata to caller. Syscalls converted to that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:05 -04:00
Al Viro	49084c3bb2	kill LOOKUP_CONTINUE LOOKUP_PARENT is equivalent to it now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:03 -04:00
Al Viro	8aeb376ca0	nfs: LOOKUP_{OPEN,CREATE,EXCL} is set only on the last step Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:02 -04:00
Al Viro	4352780386	cifs_lookup(): LOOKUP_OPEN is set only on the last component Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:00 -04:00
Al Viro	a127e0af59	ceph: LOOKUP_OPEN is set only when it's the last component Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:59 -04:00
Al Viro	5c0f360b08	jfs_ci_revalidate() is safe from RCU mode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:57 -04:00
Al Viro	407938e79e	LOOKUP_CREATE and LOOKUP_RENAME_TARGET can be set only on the last step Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:56 -04:00
Al Viro	dd7dd556e4	no need to check for LOOKUP_OPEN in ->create() instances ... it will be set in nd->flag for all cases with non-NULL nd (i.e. when called from do_last()). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:56 -04:00
Al Viro	bf6c7f6c7b	don't pass nameidata to vfs_create() from ecryptfs_create() Instead of playing with removal of LOOKUP_OPEN, mangling (and restoring) nd->path, just pass NULL to vfs_create(). The whole point of what's being done there is to suppress any attempts to open file by underlying fs, which is what nd == NULL indicates. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:54 -04:00
Al Viro	8a5e929dd2	don't transliterate lower bits of ->intent.open.flags to FMODE_... ->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:52 -04:00
Al Viro	554a8b9f54	Don't pass nameidata when calling vfs_create() from mknod() All instances can cope with that now (and ceph one actually starts working properly). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:49 -04:00
Al Viro	f7c85868fc	fix mknod() on nfs4 (hopefully) a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE) b) default (!LOOKUP_OPEN) open_flags is O_CREAT\|O_EXCL\|FMODE_READ, not 0 c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN; otherwise we need to issue CLOSE, lest we leak stateid on server. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:46 -04:00
Al Viro	511415980a	nameidata_to_nfs_open_context() doesn't need nameidata, actually... just open flags; switched to passing just those and renamed to create_nfs_open_context() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:45 -04:00
Al Viro	3d4ff43d89	nfs_open_context doesn't need struct path either just dentry, please... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:44 -04:00
Al Viro	82a2c1b77a	nfs4_opendata doesn't need struct path either Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:42 -04:00
Al Viro	643168c2dc	nfs4_closedata doesn't need to mess with struct path instead of path_get()/path_put(), we can just use nfs_sb_{,de}active() to pin the superblock down. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:41 -04:00
Al Viro	7c97c200e2	cifs: fix the type of cifs_demultiplex_thread() ... and get rid of a bogus typecast, while we are at it; it's not just that we want a function returning int and not void, but cast to pointer to function taking void * and returning void would be (void ()(void )) and not (void )(void ), TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:39 -04:00
Al Viro	beefebf1aa	ecryptfs_inode_permission() doesn't need to bail out on RCU ... now that inode_permission() can take MAY_NOT_BLOCK and handle it properly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:38 -04:00
Al Viro	d2d9e9fbc2	merge do_revalidate() into its only caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:34 -04:00
Al Viro	4ad5abb3d0	no reason to keep exec_permission() separate now cache footprint alone makes it a bad idea... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:32 -04:00
Al Viro	d594e7ec4d	massage generic_permission() to treat directories on a separate path Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:30 -04:00
Al Viro	eecdd358b4	->permission() sanitizing: don't pass flags to exec_permission() pass mask instead; kill security_inode_exec_permission() since we can use security_inode_permission() instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:29 -04:00
Al Viro	10556cb21a	->permission() sanitizing: don't pass flags to ->permission() not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:24 -04:00
Al Viro	2830ba7f34	->permission() sanitizing: don't pass flags to generic_permission() redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:22 -04:00
Al Viro	7e40145eb1	->permission() sanitizing: don't pass flags to ->check_acl() not used in the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:21 -04:00
Al Viro	9c2c703929	->permission() sanitizing: pass MAY_NOT_BLOCK to ->check_acl() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:19 -04:00
Al Viro	1fc0f78ca9	->permission() sanitizing: MAY_NOT_BLOCK Duplicate the flags argument into mask bitmap. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:18 -04:00
Al Viro	178ea73521	kill check_acl callback of generic_permission() its value depends only on inode and does not change; we might as well store it in ->i_op->check_acl and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:16 -04:00
Al Viro	07b8ce1ee8	lockless get_write_access/deny_write_access new helpers: atomic_inc_unless_negative()/atomic_dec_unless_positive() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:14 -04:00
Al Viro	f4d6ff89d8	move exec_permission() up to the rest of permission-related functions ... and convert the comment before it into linuxdoc form. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:13 -04:00
Al Viro	3bfa784a65	kill file_permission() completely convert the last remaining caller to inode_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:11 -04:00
Al Viro	1b5d783c94	consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling new helper: would_dump(bprm, file). Checks if we are allowed to read the file and if we are not - sets ENFORCE_NODUMP. Exported, used in places that previously open-coded the same logics. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:10 -04:00
Al Viro	78f32a9b47	switch path_init() to exec_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:08 -04:00
Al Viro	6f28610974	switch udf_ioctl() to inode_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:07 -04:00
Al Viro	4cf27141cb	make exec_permission(dir) really equivalent to inode_permission(dir, MAY_EXEC) capability overrides apply only to the default case; if fs has ->permission() that does _not_ call generic_permission(), we have no business doing them. Moreover, if it has ->permission() that does call generic_permission(), we have no need to recheck capabilities. Besides, the capability overrides should apply only if we got EACCES from acl_permission_check(); any other value (-EIO, etc.) should be returned to caller, capabilities or not capabilities. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:05 -04:00
Al Viro	43e15cdbef	new helper: iterate_supers_type() Call the given function for all superblocks of given type. Function gets a superblock (with s_umount locked shared) and (void *) argument supplied by caller of iterator. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:04 -04:00
Josef Bacik	44396f4b5c	fs: add a DCACHE_NEED_LOOKUP flag for d_flags Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order for readdir, but unfortunately in the case of anything that stats the files in order that readdir spits back (like oh say ls) that means we still have to do the normal lookup of the file, which means looking up our other index and then looking up the inode. What I want is a way to create dummy dentries when we find them in readdir so that when ls or anything else subsequently does a stat(), we already have the location information in the dentry and can go straight to the inode itself. The lookup stuff just assumes that if it finds a dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP flag so that the lookup code knows it still needs to run i_op->lookup() on the parent to get the inode for the dentry. I have tested this with btrfs and I went from something that looks like this http://people.redhat.com/jwhiter/ls-noreada.png To this http://people.redhat.com/jwhiter/ls-good.png Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:03 -04:00
Akinobu Mita	f7b88631a8	fs/libfs.c: fix simple_attr_write() on 32bit machines Assume that /sys/kernel/debug/dummy64 is debugfs file created by debugfs_create_x64(). # cd /sys/kernel/debug # echo 0x1234567812345678 > dummy64 # cat dummy64 0x0000000012345678 # echo 0x80000000 > dummy64 # cat dummy64 0xffffffff80000000 A value larger than INT_MAX cannot be written to the debugfs file created by debugfs_create_u64 or debugfs_create_x64 on 32bit machine. Because simple_attr_write() uses simple_strtol() for the conversion. To fix this, use simple_strtoll() instead. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-19 22:09:30 -07:00
Linus Torvalds	e501f29c72	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: vfs: fix race in rcu lookup of pruned dentry Fix cifs_get_root() [ Edited the last commit to get rid of a 'unused variable "seq"' warning due to Al editing the patch. - Linus ]	2011-07-19 21:50:21 -07:00
Linus Torvalds	5943026240	vfs: fix race in rcu lookup of pruned dentry Don't update *inode in __follow_mount_rcu() until we'd verified that there is mountpoint there. Kudos to Hugh Dickins for catching that one in the first place and eventually figuring out the solution (and catching a braino in the earlier version of patch). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-19 21:49:01 -07:00
David Teigland	10d1459faf	dlm: don't limit active work items Allow multiple workqueue items (locks with callbacks) to be processed concurrently. There should be no reason not to take advantage of this workqueue feature. Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-19 14:22:32 -05:00
Al Viro	fec11dd9a0	Fix cifs_get_root() Add missing ->i_mutex, convert to lookup_one_len() instead of (broken) open-coded analog, cope with getting something like a//b as relative pathname. Simplify the hell out of it, while we are there... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jeff Layton <jlayton@redhat.com>	2011-07-18 13:51:58 -04:00
Mimi Zohar	975d294373	evm: imbed evm_inode_post_setattr Changing the inode's metadata may require the 'security.evm' extended attribute to be re-calculated and updated. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>	2011-07-18 12:29:44 -04:00
Mimi Zohar	c7b87de23b	evm: evm_inode_post_removexattr When an EVM protected extended attribute is removed, update 'security.evm'. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>	2011-07-18 12:29:43 -04:00
Mimi Zohar	1601fbad2b	xattr: define vfs_getxattr_alloc and vfs_xattr_cmp vfs_getxattr_alloc() and vfs_xattr_cmp() are two new kernel xattr helper functions. vfs_getxattr_alloc() first allocates memory for the requested xattr and then retrieves it. vfs_xattr_cmp() compares a given value with the contents of an extended attribute. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>	2011-07-18 12:29:39 -04:00
Mimi Zohar	9d8f13ba3f	security: new security_inode_init_security API adds function callback This patch changes the security_inode_init_security API by adding a filesystem specific callback to write security extended attributes. This change is in preparation for supporting the initialization of multiple LSM xattrs and the EVM xattr. Initially the callback function walks an array of xattrs, writing each xattr separately, but could be optimized to write multiple xattrs at once. For existing security_inode_init_security() calls, which have not yet been converted to use the new callback function, such as those in reiserfs and ocfs2, this patch defines security_old_inode_init_security(). Signed-off-by: Mimi Zohar <zohar@us.ibm.com>	2011-07-18 12:29:38 -04:00
Linus Torvalds	d36c30181c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: hppfs_lookup(): don't open-code lookup_one_len() hppfs: fix dentry leak cramfs: get_cramfs_inode() returns ERR_PTR() on failure ufs should use d_splice_alias() fix exofs ->get_parent() ceph analog of cifs build_path_from_dentry() race fix cifs: build_path_from_dentry() race fix	2011-07-18 09:03:15 -07:00
J. Bruce Fields	1091006c5e	nfsd: turn on reply cache for NFSv4 It's sort of ridiculous that we've never had a working reply cache for NFSv4. On the other hand, we may still not: our current reply cache is likely not very good, especially in the TCP case (which is the only case that matters for v4). What we really need here is some serious testing. Anyway, here's a start. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-18 09:39:01 -04:00
J. Bruce Fields	3e98abffd1	nfsd4: call nfsd4_release_compoundargs from pc_release This simplifies cleanup a bit. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-18 09:38:02 -04:00
Robin Dong	d46203159e	ext4: avoid eh_entries overflow before insert extent_idx If eh_entries is equal to (or greater than) eh_max, the operation of inserting new extent_idx will make number of entries overflow. So check eh_entries before inserting the new extent_idx. Although there is no bug case according the code (function ext4_ext_insert_index is called by ext4_ext_split and ext4_ext_split is called only if the index block has free space), the right logic should be "lookup the capacity before insertion". Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 23:43:42 -04:00
Robin Dong	015861badd	ext4: avoid wasted extent cache lookup if !PUNCH_OUT_EXT This patch avoids an extraneous lookup of the extent cache in ext4_ext_map_blocks() when the flag EXT4_GET_BLOCKS_PUNCH_OUT_EXT is absent. The existing logic was performing the lookup but not making use of the result. The patch simply reverses the order of evaluation in the condition. Since ext4_ext_in_cache() does not initialize newex on misses, bypassing its invocation does not introduce any new issue in this regard. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Eric Gouriou <egouriou@google.com>	2011-07-17 23:27:43 -04:00
Al Viro	0916a5e45f	hppfs_lookup(): don't open-code lookup_one_len() ... and it's getting it wrong, too - missing ->d_revalidate() calls when it's dealing with filesystem (procfs) that has non-trivial ->d_revalidate()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-17 23:22:48 -04:00
Al Viro	3cc0658e35	hppfs: fix dentry leak Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-17 23:22:17 -04:00
Al Viro	0577d1ba41	cramfs: get_cramfs_inode() returns ERR_PTR() on failure ... and we want to report these failures in ->lookup() anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-17 23:22:02 -04:00
Al Viro	642c937b4e	ufs should use d_splice_alias() it's NFS-exportable, so... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-17 23:21:35 -04:00
Allison Henderson	c6a0371cbe	ext4: remove unneeded parameter to ext4_ext_remove_space() This patch removes the extra parameter in ext4_ext_remove_space() which is no longer needed. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 23:21:03 -04:00
Al Viro	a803b8067e	fix exofs ->get_parent() NULL is not a possible return value for that method, TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-17 23:20:29 -04:00
Allison Henderson	f7d0d3797f	ext4: punch hole optimizations: skip un-needed extent lookup This patch optimizes the punch hole operation by skipping the tree walking code that is used by truncate. Since punch hole is done through map blocks, the path to the extent is already known in this function, so we do not need to look it up again. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 23:17:02 -04:00
Dan Ehrenberg	3eb0865843	ext4: ignore a stripe width of 1 If the stripe width was set to 1, then this patch will ignore that stripe width and ext4 will act as if the stripe width were 0 with respect to optimizing allocations. Signed-off-by: Dan Ehrenberg <dehrenberg@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 21:18:51 -04:00
Dan Ehrenberg	d7a1fee135	ext4: make the preallocation size be a multiple of stripe size Previously, if a stripe width was provided, then it would be used as the preallocation granularity, with no santiy checking and no way to override this. Now, mb_prealloc_size defaults to the smallest multiple of stripe size that is greater than or equal to the old default mb_prealloc_size, and this can be overridden with the sysfs interface. Signed-off-by: Dan Ehrenberg <dehrenberg@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 21:11:30 -04:00
Linus Torvalds	f560f6697f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] update cifs to version 1.74 [CIFS] update limit for snprintf in cifs_construct_tcon cifs: Fix signing failure when server mandates signing for NTLMSSP	2011-07-17 12:49:55 -07:00
Al Viro	1b71fe2efa	ceph analog of cifs build_path_from_dentry() race fix ... unfortunately, cifs bug got copied. Fix is essentially the same. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-16 23:43:58 -04:00
Al Viro	dc137bf553	cifs: build_path_from_dentry() race fix deal with d_move() races properly; rename_lock read-retry loop, rcu_read_lock() held while walking to root, d_lock held over subtraction from namelen and copying the component to stabilize ->d_name. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-16 23:37:20 -04:00
Bernd Schubert	265c6a0f92	ext4: fix compilation with -DDX_DEBUG Compilation of ext4/namei.c brought up an error and warning messages when compiled with -DDX_DEBUG Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-16 19:41:23 -04:00
J. Bruce Fields	f85ef69ce0	pnfs: simplify pnfs files module autoloading Embed the necessary alias into the module rather than waiting for someone to add it to /etc/modprobe.conf Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 19:21:58 -04:00
J. Bruce Fields	674e405b8b	nfs: document nfsv4 sillyrename issues Somebody working on this code asked what the deal was with NFSv4, since this comment notes that it's v2/v3's statelessness that requires sillyrename. Shouldn't hurt to document the answer. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 19:14:00 -04:00
Mi Jinlong	ab1350b2b3	nfsd41: Deny new lock before RECLAIM_COMPLETE done Before nfs41 client's RECLAIM_COMPLETE done, nfs server should deny any new locks or opens. rfc5661: " Whenever a client establishes a new client ID and before it does the first non-reclaim operation that obtains a lock, it MUST send a RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no locks to reclaim. If non-reclaim locking operations are done before the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. " Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 19:00:40 -04:00
Miklos Szeredi	ee19cc406d	fs: locks: remove init_once From: Miklos Szeredi <mszeredi@suse.cz> Remove SLAB initialization entirely, as suggested by Bruce and Linus. Allocate with __GFP_ZERO instead and only initialize list heads. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 19:00:39 -04:00
Mi Jinlong	ae82a8d06f	nfsd41: check the size of request Check in SEQUENCE that the request doesn't exceed maxreq_sz for the given session. Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 19:00:00 -04:00
Mi Jinlong	1b74c25bc1	nfsd41: error out when client sets maxreq_sz or maxresp_sz too small According to RFC5661, 18.36.3, "if the client selects a value for ca_maxresponsesize such that a replier on a channel could never send a response,the server SHOULD return NFS4ERR_TOOSMALL in the CREATE_SESSION reply." So, error out when the client sets a maxreq_sz less than the minimum possible SEQUENCE request size, or sets a maxresp_sz less than the minimum possible SEQUENCE reply size. Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 18:58:51 -04:00
J. Bruce Fields	f197c27196	nfsd4: fix file leak on open_downgrade Stateid's hold a read reference for a read open, a write reference for a write open, and an additional one of each for each read+write open. The latter wasn't getting put on a downgrade, so something like: open RW open R downgrade to R was resulting in a file leak. Also fix an imbalance in an error path. Regression from `7d94784293` "nfsd4: fix downgrade/lock logic". Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 18:58:49 -04:00
J. Bruce Fields	499f3edc23	nfsd4: remember to put RW access on stateid destruction Without this, for example, open read open read+write close will result in a struct file leak. Regression from `7d94784293` "nfsd4: fix downgrade/lock logic". Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 18:58:49 -04:00
Bryan Schumaker	1745680454	NFSD: Added TEST_STATEID operation This operation is used by the client to check the validity of a list of stateids. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 18:58:48 -04:00
Bryan Schumaker	e1ca12dfb1	NFSD: added FREE_STATEID operation This operation is used by the client to tell the server to free a stateid. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 18:58:47 -04:00
NeilBrown	49b28684fd	nfsd: Remove deprecated nfsctl system call and related code. As promised in feature-removal-schedule.txt it is time to remove the nfsctl system call. Userspace has perferred to not use this call throughout 2.6 and it has been excluded in the default configuration since 2.6.36 (9 months ago). So this patch removes all the code that was being compiled out. There are still references to sys_nfsctl in various arch systemcall tables and related code. These should be cleaned out too, probably in the next merge window. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 18:58:42 -04:00
Benny Halevy	094b5d74f4	NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND DESTROY_CLIENTID MAY be preceded with a SEQUENCE operation as long as the client ID derived from the session ID of SEQUENCE is not the same as the client ID to be destroyed. If the client IDs are the same, then the server MUST return NFS4ERR_CLIENTID_BUSY. (that's not implemented yet) If DESTROY_CLIENTID is not prefixed by SEQUENCE, it MUST be the only operation in the COMPOUND request (otherwise, the server MUST return NFS4ERR_NOT_ONLY_OP). This fixes the error return; before, we returned NFS4ERR_OP_NOT_IN_SESSION; after this patch, we return NFS4ERR_NOTSUPP. Signed-off-by: Benny Halevy <benny@tonian.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-07-15 18:58:41 -04:00
David Teigland	23e8e1aaac	dlm: use workqueue for callbacks Instead of creating our own kthread (dlm_astd) to deliver callbacks for all lockspaces, use a per-lockspace workqueue to deliver the callbacks. This eliminates complications and slowdowns from many lockspaces sharing the same thread. Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-15 12:30:43 -05:00
Linus Torvalds	da1b001a2a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fix loop checks in d_materialise_unique() Fix ->d_lock locking order in unlazy_walk()	2011-07-15 09:55:39 -07:00
Trond Myklebust	94b134ac8e	NFS: Convert nfs4_set_ds_client to EXPORT_SYMBOL_GPL This is not part of an external ABI... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:24 -04:00
Trond Myklebust	9e00abc3c2	SUNRPC: sunrpc should not explicitly depend on NFS config options Change explicit references to CONFIG_NFS_V4_1 to implicit ones Get rid of the unnecessary defines in backchannel_rqst.c and bc_svc.c: the Makefile takes care of those dependency. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:23 -04:00
Trond Myklebust	1f9453578f	NFS: Clean up - simplify the switch to read/write-through-MDS Use nfs_pageio_reset_read_mds and nfs_pageio_reset_write_mds instead of completely reinitialising the struct nfs_pageio_descriptor. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:22 -04:00
Trond Myklebust	dce81290ee	NFS: Move the pnfs write code into pnfs.c ...and ensure that we recoalese to take into account differences in differences in block sizes when falling back to write through the MDS. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:22 -04:00
Trond Myklebust	493292ddc7	NFS: Move the pnfs read code into pnfs.c ...and ensure that we recoalese to take into account differences in block sizes when falling back to read through the MDS. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:21 -04:00
Trond Myklebust	d9156f9f36	NFS: Allow the nfs_pageio_descriptor to signal that a re-coalesce is needed If an attempt to do pNFS fails, and we have to fall back to writing through the MDS, then we may want to re-coalesce the requests that we already have since the block size for the MDS read/writes may be different to that of the DS read/writes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:21 -04:00
Trond Myklebust	d097971d8a	NFS: Use the nfs_pageio_descriptor->pg_bsize in the read/write request Instead of looking up the rsize and wsize, the routines that generate the RPC requests should really be using the pg_bsize, since that is what we use when deciding whether or not to coalesce write requests... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:20 -04:00
Trond Myklebust	50828d7e67	NFS: Cache rpc_ops in struct nfs_pageio_descriptor Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:20 -04:00
Trond Myklebust	275acaafd4	NFS: Clean up: split out the RPC transmission from nfs_pagein_multi/one ...and do the same for nfs_flush_multi/one. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:12:17 -04:00
Peng Tao	3b6091846d	NFS: fix return value of nfs_pagein_one/nfs_flush_one Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-15 09:11:28 -04:00
Eric Sandeen	46fcb2ed29	GFS2: combine duplicated block freeing routines __gfs2_free_data and __gfs2_free_meta are almost identical, and can be trivially combined. [This is as per Eric's original patch minus gfs2_free_data() which had no callers left and plus the conversion of the bmap.c calls to these functions. All in all, a nice clean up] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-07-15 09:32:52 +01:00
Steven Whitehouse	9964afbb79	GFS2: Add S_NOSEC support This adds S_NOSEC support to GFS2. We set/reset the flag either when a user calls setattr or when we have just regained the glock from another node. The flag is only set if there are no xattrs on the inode and there is no suid bit set. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>	2011-07-15 09:32:35 +01:00
Bob Peterson	7cf8dcd3b6	GFS2: Automatically adjust glock min hold time This patch is a performance improvement for GFS2 in a clustered environment. It makes the glock hold time self-adjusting. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-07-15 09:32:11 +01:00
Steven Whitehouse	17d539f049	GFS2: Cache dir hash table in a contiguous buffer This patch adds a cache for the hash table to the directory code in order to help simplify the way in which the hash table is accessed. This is intended to be a first step towards introducing some performance improvements in the directory code. There are two follow ups that I'm hoping to see fairly shortly. One is to simplify the hash table reading code now that we always read the complete hash table, whether we want one entry or all of them. The other is to introduce readahead on the heads of the hash chains which are referred to from the table. The hash table is a maximum of 128k in size, so it is not worth trying to read it in small chunks. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-07-15 09:31:48 +01:00
Al Viro	1836750115	fix loop checks in d_materialise_unique() Both __d_unalias() and __d_materialise_dentry() need loop prevention. Grab rename_lock in caller, check for loops there... As a side benefit, we have dentry_lock_for_move() called only under rename_lock, which seriously reduces deadlock potential of the execrable "locking order" used for ->d_lock. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-14 21:33:41 -04:00
Mark Fasheh	17e9f796bd	btrfs: Don't BUG_ON alloc_path errors in btrfs_balance() Dealing with this seems trivial - the only caller of btrfs_balance() is btrfs_ioctl() which passes the error code directly back to userspace. There also isn't much state to unwind (if I'm wrong about this point, we can always safely move the allocation to the top of btrfs_balance() anyway). Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2011-07-14 14:14:45 -07:00
Mark Fasheh	1748f843a0	btrfs: Don't BUG_ON alloc_path errors in btrfs_read_locked_inode btrfs_iget() also needed an update so that errors from btrfs_locked_inode() are caught and bubbled back up. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2011-07-14 14:14:45 -07:00
Mark Fasheh	0eb0e19cde	btrfs: Don't BUG_ON alloc_path errors in btrfs_truncate_inode_items I moved the path allocation up a few lines to the top of the function so that we couldn't get into the state where we've dropped delayed items and the extent cache but fail due to -ENOMEM. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2011-07-14 14:14:45 -07:00
Mark Fasheh	1e5063d093	btrfs: Don't BUG_ON alloc_path errors in replay_one_buffer() The two ->process_func call sites in tree-log.c which were ignoring a return code have also been updated to gracefully exit as well. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2011-07-14 14:14:44 -07:00
Mark Fasheh	d8926bb3ba	btrfs: don't BUG_ON btrfs_alloc_path() errors This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation failure. All the sites that are fixed in this patch were checked by me to be fairly trivial to fix because of at least one of two criteria: - Callers of the function catch errors from it already so bubbling the error up will be handled. - Callers of the function might BUG_ON any nonzero return code in which case there is no behavior changed (but we still got to remove a BUG_ON) The following functions were updated: btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group, btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written, btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink, insert_reserved_file_extent, and run_delalloc_nocow Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2011-07-14 14:14:44 -07:00
David Teigland	883ba74f43	dlm: remove deadlock debug print gfs2 recently began using this feature heavily, creating more debug output than we want to see. Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-14 12:31:49 -05:00
Linus Torvalds	5dcd07b9f3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Resolve inode eviction and ail list interaction bug GFS2: Fix race during filesystem mount GFS2: force a log flush when invalidating the rindex glock	2011-07-14 10:20:42 -07:00
Steven Whitehouse	380f7c65a7	GFS2: Resolve inode eviction and ail list interaction bug This patch contains a few misc fixes which resolve a recently reported issue. This patch has been a real team effort and has received a lot of testing. The first issue is that the ail lock needs to be held over a few more operations. The lock thats added into gfs2_releasepage() may possibly be a candidate for replacing with RCU at some future point, but at this stage we've gone for the obvious fix. The second issue is that gfs2_write_inode() can end up calling a glock recursively when called from gfs2_evict_inode() via the syncing code, so it needs a guard added. The third issue is that we either need to not truncate the metadata pages of inodes which have zero link count, but which we cannot deallocate due to them still being in use by other nodes, or we need to ensure that those pages have all made it through the journal and ail lists first. This patch takes the former approach, but the latter has also been tested and there is nothing to choose between them performance-wise. So again, we could revise that decision in the future. Also, the inode eviction process is now better documented. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Tested-by: Bob Peterson <rpeterso@redhat.com> Tested-by: Abhijith Das <adas@redhat.com> Reported-by: Barry J. Marson <bmarson@redhat.com> Reported-by: David Teigland <teigland@redhat.com>	2011-07-14 08:59:44 +01:00
Linus Torvalds	201f92e2ca	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: SUNRPC: Fix use of static variable in rpcb_getport_async NFSv4.1: update nfs4_fattr_bitmap_maxsz SUNRPC: Fix a race between work-queue and rpc_killall_tasks pnfs: write: Set mds_offset in the generic layer - it is needed by all LDs	2011-07-13 14:34:08 -07:00
Christoph Hellwig	d0f9e8fb4c	xfs: remove the dead XFS_DABUF_DEBUG code Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:50 +02:00
Christoph Hellwig	c84470dda7	xfs: remove leftovers of the old btree tracing code Remove various bits left over from the old kdb-only btree tracing code, but leave the actual trace point stubs in place to ease adding new event based btree tracing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:50 +02:00
Christoph Hellwig	ea15ab3cdd	xfs: remove the dead QUOTADEBUG code Remove the dead hash table test rid which has been rotting away under QUOTADEBUG, including some code that was compiled for normal debug builds, but not actually called without QUOTADEBUG, and enable a few cheap debug checks that were hidden under QUOTADEBUG for normal debug builds. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:50 +02:00
Christoph Hellwig	54244fec67	xfs: remove the unused xfs_buf_delwri_sort function Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	cb669ca570	xfs: remove wrappers around b_iodone Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	adadbeefb3	xfs: remove wrappers around b_fspriv Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	bf9d9013a2	xfs: add a proper transaction pointer to struct xfs_buf Replace the typeless b_fspriv2 and the ugly macros around it with a properly typed transaction pointer. As a fallout the log buffer state debug checks are also removed. We could have kept them using casts, but as they do not have a real purpose we can as well just remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	77936d0280	xfs: factor out xfs_da_grow_inode_int xfs_da_grow_inode and xfs_dir2_grow_inode are mostly duplicate code. Factor the meat of those two functions into a new common helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:49 +02:00
Christoph Hellwig	a230a1df40	xfs: factor out xfs_dir2_leaf_find_stale Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:48 +02:00
Christoph Hellwig	a00b7745c6	xfs: cleanup struct xfs_dir2_free Change the bests array to be a proper variable sized entry. This is done easily as no one relies on the size of the structure. Also change XFS_DIR2_MAX_FREE_BESTS to an inline function while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:48 +02:00
Christoph Hellwig	5792664070	xfs: reshuffle dir2 headers Replace the current mess of dir2 headers with just three that have a clear purpose: - xfs_dir2_format.h for all format definitions, including the inline helpers to access our variable size structures - xfs_dir2_priv.h for all prototypes that are internal to the dir2 code and not needed by anything outside of the directory code. For this purpose xfs_da_btree.c, and phase6.c in xfs_repair are considered part of the directory code. - xfs_dir2.h for the public interface to the directory code In addition to the reshuffle I have also update the comments to not only match the new file structure, but also to describe the directory format better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-13 13:43:48 +02:00
Christoph Hellwig	2bcf6e970f	xfs: start periodic workers later Start the periodic sync workers only after we have finished xfs_mountfs and thus fully set up the filesystem structures. Without this we can call into xfs_qm_sync before the quotainfo strucute is set up if the mount takes unusually long, and probably hit other incomplete states as well. Also clean up the xfs_fs_fill_super error path by using consistent label names, and removing an impossible to reach case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-07-13 13:43:48 +02:00
Al Viro	94c0d4ecbe	Fix ->d_lock locking order in unlazy_walk() Make sure that child is still a child of parent before nested locking of child->d_lock in unlazy_walk(); otherwise we are risking a violation of locking order and deadlocks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-12 21:40:23 -04:00
David Teigland	3881ac04eb	dlm: improve rsb searches By pre-allocating rsb structs before searching the hash table, they can be inserted immediately. This avoids always having to repeat the search when adding the struct to hash list. This also adds space to the rsb struct for a max resource name, so an rsb allocation can be used by any request. The constant size also allows us to finally use a slab for the rsb structs. Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-12 16:02:09 -05:00
Steve French	c2ec9471b5	[CIFS] update cifs to version 1.74 Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-12 19:15:02 +00:00
Steve French	ea1be1a3c3	[CIFS] update limit for snprintf in cifs_construct_tcon In `34c87901e1` "Shrink stack space usage in cifs_construct_tcon" we change the size of the username name buffer from MAX_USERNAME_SIZE (256) to 28. This call to snprintf() needs to be updated as well. Reported by Dan Carpenter. Reviewed-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-12 19:14:24 +00:00
Shirish Pargaonkar	62411ab2fe	cifs: Fix signing failure when server mandates signing for NTLMSSP When using NTLMSSP authentication mechanism, if server mandates signing, keep the flags in type 3 messages of the NTLMSSP exchange same as in type 1 messages (i.e. keep the indicated capabilities same). Some of the servers such as Samba, expect the flags such as Negotiate_Key_Exchange in type 3 message of NTLMSSP exchange as well. Some servers like Windows do not. https://bugzilla.samba.org/show_bug.cgi?id=8212 Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-12 19:14:23 +00:00
Trond Myklebust	6e4efd5685	NFS: Clean up nfs_read_rpcsetup and nfs_write_rpcsetup Split them up into two parts: one which sets up the struct nfs_read/write_data, the other which sets up the actual RPC call or pNFS call. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:42:02 -04:00
Trond Myklebust	87ed5eb44a	NFS: Don't use DATA_SYNC writes If we're writing back data, and the FLUSH_STABLE flag is set, then we always want to use NFS_FILE_SYNC, since we're always in a situation where we're doing page reclaim, and so we want to free up the page as quickly as possible. If we're in the FLUSH_COND_STABLE case, then we either want to use another unstable write (if we have to do a commit anyway) or again, we want to use NFS_FILE_SYNC because we know that we have no more pages to write out. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:42:01 -04:00
Andy Adamson	c47abcf8ff	NFSv4.1: do not use deviceids after MDS clientid invalidation Mark all deviceids established under an expired MDS clientid as invalid. Stop all new i/o through DS and send through the MDS. Don't use any new LAYOUTGETs that use the invalid deviceid. Purge all layouts established under the expired MDS clientid. Remove the MDS clientid deviceid and data servers reference Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:29 -04:00
Trond Myklebust	a56aaa02b1	NFSv4.1: Clean up layoutreturn Since we take a reference to it, we really ought to pass the a pointer to the layout header in the arguments instead of assuming that NFS_I(inode)->layout will forever point to the correct object. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:29 -04:00
Andy Adamson	7c24d9489f	NFSv4.1: File layout only supports whole file layouts Ask for whole file layouts. Until support for layout segments is fully supported in the file layout code, discard non-whole file layouts. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:28 -04:00
Trond Myklebust	47cb498e93	NFSv4.1: Clean ups for the device id cache The fact that the global device id cache holds a reference to the nfs4_deviceid_node until it is invisible to rcu lookups implies that we can always assume that the reference count is non-zero in _find_get_deviceid. Also clean up nfs4_put_deviceid_node and the removal of the device id from the cache. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:28 -04:00
Trond Myklebust	e885de1a5b	NFSv4.1: Fall back to ordinary i/o through the mds if we have no layout segment Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:28 -04:00
Trond Myklebust	d8007d4dd6	NFSv4.1: Add an initialisation callback for pNFS Ensure that we always get a layout before setting up the i/o request. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:28 -04:00
Trond Myklebust	1751c3638f	NFS: Cleanup of the nfs_pageio code in preparation for a pnfs bugfix We need to ensure that the layouts are set up before we can decide to coalesce requests. To do so, we want to further split up the struct nfs_pageio_descriptor operations into an initialisation callback, a coalescing test callback, and a 'do i/o' callback. This patch cleans up the existing callback methods before adding the 'initialisation' callback. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:28 -04:00
Bryan Schumaker	f062eb6ced	NFS: test and free stateids during recovery When recovering open files and locks, the stateid should be tested against the server and freed if it is invalid. This patch adds new recovery functions for NFS v4.1. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:28 -04:00
Bryan Schumaker	9aeda35fd6	NFS: added FREE_STATEID call FREE_STATEID is used to tell the server that we want to free a stateid that no longer has any locks associated with it. This allows the client to reclaim locks without encountering edge conditions documented in section 8.4.3 of RFC 5661. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:28 -04:00
Bryan Schumaker	7d9747947a	NFS: Added TEST_STATEID call This patch adds in the xdr for doing a TEST_STATEID call with a single stateid. RFC 5661 allows multiple stateids to be tested in a single call, but only testing one keeps things simpler for now. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:27 -04:00
Bryan Schumaker	fca78d6d2c	NFS: Add SECINFO_NO_NAME procedure If the client is using NFS v4.1, then we can use SECINFO_NO_NAME to find the secflavor for the initial mount. If the server doesn't support SECINFO_NO_NAME then I fall back on the "guess and check" method used for v4.0 mounts. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:27 -04:00
Weston Andros Adamson	6382a44138	NFS: move pnfs layouts to nfs_server structure Layouts should be tracked per nfs_server (aka superblock) instead of per struct nfs_client, which may have multiple FSIDs associated with it. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:27 -04:00
Weston Andros Adamson	35dbbc99e9	NFS: fix comment We support IPv4 and IPv6 now. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:27 -04:00
Weston Andros Adamson	78fe0f41d9	NFS: use scope from exchange_id to skip reclaim can be skipped if the "eir_server_scope" from the exchange_id proc differs from previous calls. Also, in the future server_scope will be useful for determining whether client trunking is available Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:27 -04:00
Weston Andros Adamson	7e574f0d39	NFS: pnfs: loop over multipath addrs on connect Don't just use the first addr in the multipath list - instead, loop over addresses when calling nfs4_set_ds_client() (which calls connect) until it is successful. Although this is not real multipath support, it's a quick fix to handle when an MDS sends a list of addresses for a DS and some of the addr families are unsupported or misconfigured (like no routable ipv6 addr assigned). This will attempt all paths to the DS before giving up, instead of immediately falling back to the MDS. As before, an error encountered after a successful connect() will cause all i/o to fall back to the MDS. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:27 -04:00
Weston Andros Adamson	14f9a6076f	NFS: Parse and store all multipath DS addresses This parses and stores all addresses associated with each data server, laying the groundwork for supporting multipath to data servers. - Skips over addresses that cannot be parsed (ie IPv6 addrs if v6 is not enabled). Only fails if none of the addresses are recognizable - Currently only uses the first address that parsed cleanly - Tested against pynfs server (modified to support multipath) Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:27 -04:00
Weston Andros Adamson	c9895cb69b	NFS: pnfs IPv6 support Handle ipv6 remote addresses from GETDEVICEINFO - supports netid "tcp" for ipv4 and "tcp6" for ipv6 as rfc 5665 specifies - added ds_remotestr to avoid having to handle different AFs in every dprintk - tested against pynfs 4.1 server, submitting ipv6 support patch to pynfs - tested with IPv6 disabled, it compiles cleanly and relies on rpc_pton to refuse to accept IPv6 addresses Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:26 -04:00
Vasily Averin	82c2c8b861	lockd: properly convert be32 values in debug messages lockd: server returns status 50331648 it's quite hard to understand that number in this message is 3 in big endian Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-12 13:40:26 -04:00
Steven Whitehouse	3942ae5319	GFS2: Fix race during filesystem mount There is a potential race during filesystem mounting which has recently been reported. It occurs when the userland gfs_controld is able to process requests fast enough that it tries to use the sysfs interface before the lock module is properly initialised. This is a pretty unusual case as normally the lock module initialisation is very quick compared with gfs_controld. This patch adds an interruptible completion which is used to ensure that userland will wait for the initialisation of the lock module to complete. There are other potential solutions to this problem, but this is the quickest at this stage and has been tested both with and without mount.gfs2 present in the system. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reported-by: David Booher <dbooher@adams.net>	2011-07-12 09:15:46 +01:00
Benjamin Marzinski	1ce533686c	GFS2: force a log flush when invalidating the rindex glock Right now, there is nothing that forces the log to get flushed when a node drops its rindex glock so that another node can grow the filesystem. If the log doesn't get flushed, GFS2 can corrupt the sd_log_le_rg list in the following way. A node puts an rgd on the list in rg_lo_add(), and then the rindex glock is dropped so the other node can grow the filesystem. When the node reacquires the rindex glock, that rgd gets deleted in clear_rgrpdi() before ever being removed from the list by gfs2_log_flush(). This code simply forces a log flush when the rindex glock is invalidated, solving the problem. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-07-12 09:15:24 +01:00
Justin TerAvest	4aede84b33	fixlet: Remove fs_excl from struct task. fs_excl is a poor man's priority inheritance for filesystems to hint to the block layer that an operation is important. It was never clearly specified, not widely adopted, and will not prevent starvation in many cases (like across cgroups). fs_excl was introduced with the time sliced CFQ IO scheduler, to indicate when a process held FS exclusive resources and thus needed a boost. It doesn't cover all file systems, and it was never fully complete. Lets kill it. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-12 08:35:10 +02:00
Andy Adamson	e5012d1f38	NFSv4.1: update nfs4_fattr_bitmap_maxsz Attribute IDs assigned in RFC 5661 now require three bitmaps. Fixes hitting a BUG_ON in xdr_shrink_bufhead when getting ACLs. Signed-off-by: Andy Adamson <andros@netapp.com> Cc:stable@kernel.org [2.6.39] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-07-11 19:14:38 -04:00
Lukas Czerner	afb86178cb	ext4: remove unnecessary comments in ext4_orphan_add() The comment from Al Viro about possible race in the ext4_orphan_add() is not justified. There is no race possible as we always have either i_mutex locked, or the inode can not be referenced from outside hence the J_ASSERS should not be hit from the reason described in comment. This commit replaces it with notion that we are holding i_mutex so it should not be possible for i_nlink to be changed while waiting for s_orphan_lock. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 18:47:04 -04:00
Tao Ma	caaf7a29d3	ext4: Fix a double free of sbi->s_group_info in ext4_mb_init_backend If we meet with an error in ext4_mb_add_groupinfo, we kfree sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)], but fail to reset it to NULL. So the caller ext4_mb_init_backend will try to kfree it again and causes a double free. So fix it by resetting it to NULL. Some typo in comments of mballoc.c are also changed. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 18:42:42 -04:00
Tao Ma	823ba01fc0	ext4: fix a race which could leak memory in ext4_groupinfo_create_slab() In ext4_groupinfo_create_slab, we create ext4_groupinfo_caches within ext4_grpinfo_slab_create_mutex, but set it outside the lock, and there does exist some case that we may create it twice and causes a memory leak. So set it before we call mutex_unlock. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 18:26:01 -04:00
Robin Dong	598dbdf243	ext4: avoid unneeded ext4_ext_next_leaf_block() while inserting extents Optimize ext4_ext_insert_extent() by avoiding ext4_ext_next_leaf_block() when the result is not used/needed. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 18:24:01 -04:00
Linus Torvalds	71a1b44b03	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: drop spinlock before calling cifs_put_tlink cifs: fix expand_dfs_referral cifs: move bdi_setup_and_register outside of CONFIG_CIFS_DFS_UPCALL cifs: factor smb_vol allocation out of cifs_setup_volume_info cifs: have cifs_cleanup_volume_info not take a double pointer cifs: fix build_unc_path_to_root to account for a prefixpath cifs: remove bogus call to cifs_cleanup_volume_info	2011-07-11 12:48:24 -07:00
Jeff Layton	f484b5d001	cifs: drop spinlock before calling cifs_put_tlink ...as that function can sleep. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-11 18:40:52 +00:00
Robin Dong	ffb505ff0f	ext4: remove redundant goto in ext4_ext_insert_extent() If eh->eh_entries is smaller than eh->eh_max, the routine will go to the "repeat" and then go to "has_space" directlly , since argument "depth" and "eh" are not even changed. Therefore, goto "has_space" directly and remove redundant "repeat" tag. Signed-off-by: Robin Dong <sanbai@taobao.com>	2011-07-11 11:43:59 -04:00
Alex Elder	b2ce397400	Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc" This reverts commit `7a249cf83d`. That commit created a situation that could lead to a filesystem hang. As Dave Chinner pointed out, xfs_trans_alloc() could hold a reference to m_active_trans (i.e., keep it non-zero) and then wait for SB_FREEZE_TRANS to complete. Meanwhile a filesystem freeze request could set SB_FREEZE_TRANS and then wait for m_active_trans to drop to zero. Nobody benefits from this sequence of events... Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-11 10:21:03 -05:00
Josef Bacik	df98b6e2c5	Btrfs: fix how we merge extent states and deal with cached states First, we can sometimes free the state we're merging, which means anybody who calls merge_state() may have the state it passed in free'ed. This is problematic because we could end up caching the state, which makes caching useless as the state will no longer be part of the tree. So instead of free'ing the state we passed into merge_state(), set it's end to the other->end and free the other state. This way we are sure to cache the correct state. Also because we can merge states together, instead of only using the cache'd state if it's start == the start we are looking for, go ahead and use it if the start we are looking for is within the range of the cached state. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-11 10:00:48 -04:00
Josef Bacik	2f356126c5	Btrfs: use the normal checksumming infrastructure for free space cache We used to store the checksums of the space cache directly in the space cache, however that doesn't work out too well if we have more space than we can fit the checksums into the first page. So instead use the normal checksumming infrastructure. There were problems with doing this originally but those problems don't exist now so this works out fine. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-11 09:58:49 -04:00
Josef Bacik	fdb5effd5c	Btrfs: serialize flushers in reserve_metadata_bytes We keep having problems with early enospc, and that's because our method of making space is inherently racy. The problem is we can have one guy trying to make space for himself, and in the meantime people come in and steal his reservation. In order to stop this we make a waitqueue and put anybody who comes into reserve_metadata_bytes on that waitqueue if somebody is trying to make more space. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-11 09:58:48 -04:00
Josef Bacik	b5009945be	Btrfs: do transaction space reservation before joining the transaction We have to do weird things when handling enospc in the transaction joining code. Because we've already joined the transaction we cannot commit the transaction within the reservation code since it will deadlock, so we have to return EAGAIN and then make sure we don't retry too many times. Instead of doing this, just do the reservation the normal way before we join the transaction, that way we can do whatever we want to try and reclaim space, and then if it fails we know for sure we are out of space and we can return ENOSPC. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-11 09:58:47 -04:00
Josef Bacik	fa09200b83	Btrfs: try to only do one btrfs_search_slot in do_setxattr I've been watching how many btrfs_search_slot()'s we do and I noticed that when we create a file with selinux enabled we were doing 2 each time we initialize the security context. That's because we lookup the xattr first so we can delete it if we're setting a new value to an existing xattr. But in the create case we don't have any xattrs, so it is completely useless to have the extra lookup. So re-arrange things so that we only lookup first if we specifically have XATTR_REPLACE. That way in the basic case we only do 1 search, and in the more complicated case we do the normal 2 lookups. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-11 09:58:45 -04:00
David Teigland	3d6aa675ff	dlm: keep lkbs in idr This is simpler and quicker than the hash table, and avoids needing to search the hash list for every new lkid to check if it's used. Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-11 08:43:45 -05:00
David Teigland	a22ca48068	dlm: fix kmalloc args The gfp and size args were switched. Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-11 08:40:53 -05:00
Jesper Juhl	5d70828a77	dlm: don't do pointless NULL check, use kzalloc and fix order of arguments In fs/dlm/lock.c in the dlm_scan_waiters() function there are 3 small issues: 1) There's no need to test the return value of the allocation and do a memset if is succeedes. Just use kzalloc() to obtain zeroed memory. 2) Since kfree() handles NULL pointers gracefully, the test of 'warned' against NULL before the kfree() after the loop is completely pointless. Remove it. 3) The arguments to kmalloc() (now kzalloc()) were swapped. Thanks to Dr. David Alan Gilbert for pointing this out. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-11 08:39:42 -05:00
Jiri Kosina	b7e9c223be	Merge branch 'master' into for-next Sync with Linus' tree to be able to apply pending patches that are based on newer code already present upstream.	2011-07-11 14:15:55 +02:00
Tao Ma	22612283f7	ext4: Change the wrong param comment for ext4_trim_all_free at ext4_trim_all_free() comment, there is no longer an @e4b parameter, instead it is @group. Reported-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 00:04:34 -04:00
Tao Ma	3d56b8d2c7	ext4: Speed up FITRIM by recording flags in ext4_group_info In ext4, when FITRIM is called every time, we iterate all the groups and do trim one by one. It is a bit time wasting if the group has been trimmed and there is no change since the last trim. So this patch adds a new flag in ext4_group_info->bb_state to indicate that the group has been trimmed, and it will be cleared if some blocks is freed(in release_blocks_on_commit). Another trim_minlen is added in ext4_sb_info to record the last minlen we use to trim the volume, so that if the caller provide a small one, we will go on the trim regardless of the bb_state. A simple test with my intel x25m ssd: df -h shows: /dev/sdb1 40G 21G 17G 56% /mnt/ext4 Block size: 4096 run the FITRIM with the following parameter: range.start = 0; range.len = UINT64_MAX; range.minlen = 1048576; without the patch: [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.505s user 0m0.000s sys 0m1.224s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.359s user 0m0.000s sys 0m1.178s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.228s user 0m0.000s sys 0m1.151s with the patch: [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.625s user 0m0.000s sys 0m1.269s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m0.002s user 0m0.000s sys 0m0.001s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m0.002s user 0m0.000s sys 0m0.001s A big improvement for the 2nd and 3rd run. Even after I delete some big image files, it is still much faster than iterating the whole disk. [root@boyu-tm test]# time ./ftrim /mnt/ext4/a real 0m1.217s user 0m0.000s sys 0m0.196s Cc: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 00:03:38 -04:00
Tao Ma	b3d4c2b10b	ext4: Add new ext4 trim tracepoints Add ext4_trim_extent and ext4_trim_all_free. Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 00:01:52 -04:00
Tao Ma	169ddc3ec8	ext4: speed up group trim with the right free block count When we trim some free blocks in a group of ext4, we need to calculate the free blocks properly and check whether there are enough freed blocks left for us to trim. Current solution will only calculate free spaces if they are large for a trim which isn't appropriate. Let us see a small example: a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k. And minblocks is 1M. With current solution, we have to iterate the whole group since these 300k will never be subtracted from 1.5M. But actually we should exit after we find the first 2 free spaces since the left 3 chunks only sum up to 900K if we subtract the first 600K although they can't be trimed. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 00:00:07 -04:00
Tao Ma	22f1045743	ext4: fix trim length underflow with small trim length In `0f0a25b`, we adjust 'len' with s_first_data_block - start, but it could underflow in case blocksize=1K, fstrim_range.len=512 and fstrim_range.start = 0. In this case, when we run the code: len -= first_data_blk - start; len will be underflow to -1ULL. In the end, although we are safe that last_group check later will limit the trim to the whole volume, but that isn't what the user really want. So this patch fix it. It also adds the check for 'start' like ext3 so that we can break immediately if the start is invalid. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-10 23:52:37 -04:00
Theodore Ts'o	12706394bc	ext4: add tracepoint for ext4_journal_start This will help debug who is responsible for starting a jbd2 transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-10 22:37:50 -04:00
Theodore Ts'o	4862fd6047	jbd2: remove jbd2_dev_to_name() from jbd2 tracepoints Using function calls in TP_printk causes perf heartburn, so print the MAJOR/MINOR device numbers instead. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-10 22:05:08 -04:00
Jiaying Zhang	575a1d4bdf	ext4: free allocated and pre-allocated blocks when check_eofblocks_fl fails Upon corrupted inode or disk failures, we may fail after we already allocate some blocks from the inode or take some blocks from the inode's preallocation list, but before we successfully insert the corresponding extent to the extent tree. In this case, we should free any allocated blocks and discard the inode's preallocated blocks because the entries in the inode's preallocation list may be in an inconsistent state. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-07-10 20:07:25 -04:00
Maxim Patlasov	7132de744b	ext4: fix i_blocks/quota accounting when extent insertion fails The current implementation of ext4_free_blocks() always calls dquot_free_block This looks quite sensible in the most cases: blocks to be freed are associated with inode and were accounted in quota and i_blocks some time ago. However, there is a case when blocks to free were not accounted by the time calling ext4_free_blocks() yet: 1. delalloc is on, write_begin pre-allocated some space in quota 2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks() 3. then ext4_ext_map_blocks() gets an error (e.g. ENOSPC) from ext4_ext_insert_extent() and calls ext4_free_blocks(). In this scenario, ext4_free_blocks() calls dquot_free_block() who, in turn, decrements i_blocks for blocks which were not accounted yet (due to delalloc) After clean umount, e2fsck reports something like: > Inode 21, i_blocks is 5080, should be 5128. Fix<y>? because i_blocks was erroneously decremented as explained above. The patch fixes the problem by passing the new flag EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request that the dquot_free_block() call be skipped. Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-07-10 19:37:48 -04:00
Wu Fengguang	1a12d8bd7b	writeback: scale IO chunk size up to half device bandwidth Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not holding I_SYNC for too long. (At least, that was the comment previously.) This doesn't make sense now because the only time we wait for I_SYNC is if we are calling sync or fsync, and in that case we need to write out all of the data anyway. Previously there may have been other code paths that waited on I_SYNC, but not any more. -- Theodore Ts'o So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages will adapt to as large as the storage device can write within 500ms. XFS is observed to do IO completions in a batch, and the batch size is equal to the write chunk size. To avoid dirty pages to suddenly drop out of balance_dirty_pages()'s dirty control scope and create large fluctuations, the chunk size is also limited to half the control scope. The balance_dirty_pages() control scrope is [(background_thresh + dirty_thresh) / 2, dirty_thresh] which is by default [15%, 20%] of global dirty pages, whose range size is dirty_thresh / DIRTY_FULL_SCOPE. The adpative write chunk size will be rounded to the nearest 4MB boundary. http://bugzilla.kernel.org/show_bug.cgi?id=13930 CC: Theodore Ts'o <tytso@mit.edu> CC: Dave Chinner <david@fromorbit.com> CC: Chris Mason <chris.mason@oracle.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-07-09 22:09:03 -07:00
Wu Fengguang	c42843f2f0	writeback: introduce smoothed global dirty limit The start of a heavy weight application (ie. KVM) may instantly knock down determine_dirtyable_memory() if the swap is not enabled or full. global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi dirty thresholds that are _much_ lower than the global/bdi dirty pages. balance_dirty_pages() will then heavily throttle all dirtiers including the light ones, until the dirty pages drop below the new dirty thresholds. During this _deep_ dirty-exceeded state, the system may appear rather unresponsive to the users. About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty threshold to heavy dirtiers than light ones, and the dirty pages will be throttled around the heavy dirtiers' dirty threshold and reasonably below the light dirtiers' dirty threshold. In this state, only the heavy dirtiers will be throttled and the dirty pages are carefully controlled to not exceed the light dirtiers' dirty threshold. However if the threshold itself suddenly drops below the number of dirty pages, the light dirtiers will get heavily throttled. So introduce global_dirty_limit for tracking the global dirty threshold with policies - follow downwards slowly - follow up in one shot global_dirty_limit can effectively mask out the impact of sudden drop of dirtyable memory. It will be used in the next patch for two new type of dirty limits. Note that the new dirty limits are not going to avoid throttling the light dirtiers, but could limit their sleep time to 200ms. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-07-09 22:09:02 -07:00
Wu Fengguang	e98be2d599	writeback: bdi write bandwidth estimation The estimation value will start from 100MB/s and adapt to the real bandwidth in seconds. It tries to update the bandwidth only when disk is fully utilized. Any inactive period of more than one second will be skipped. The estimated bandwidth will be reflecting how fast the device can writeout when _fully utilized_, and won't drop to 0 when it goes idle. The value will remain constant at disk idle time. At busy write time, if not considering fluctuations, it will also remain high unless be knocked down by possible concurrent reads that compete for the disk time and bandwidth with async writes. The estimation is not done purely in the flusher because there is no guarantee for write_cache_pages() to return timely to update bandwidth. The bdi->avg_write_bandwidth smoothing is very effective for filtering out sudden spikes, however may be a little biased in long term. The overheads are low because the bdi bandwidth update only occurs at 200ms intervals. The 200ms update interval is suitable, because it's not possible to get the real bandwidth for the instance at all, due to large fluctuations. The NFS commits can be as large as seconds worth of data. One XFS completion may be as large as half second worth of data if we are going to increase the write chunk to half second worth of data. In ext4, fluctuations with time period of around 5 seconds is observed. And there is another pattern of irregular periods of up to 20 seconds on SSD tests. That's why we are not only doing the estimation at 200ms intervals, but also averaging them over a period of 3 seconds and then go further to do another level of smoothing in avg_write_bandwidth. CC: Li Shaohua <shaohua.li@intel.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-07-09 22:09:01 -07:00
Wu Fengguang	d46db3d582	writeback: make writeback_control.nr_to_write straight Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there. struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers. It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value. It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. Acked-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-07-09 22:09:01 -07:00
Jeff Layton	b9bce2e9f9	cifs: fix expand_dfs_referral Regression introduced in commit `724d9f1cfb`. Prior to that, expand_dfs_referral would regenerate the mount data string and then call cifs_parse_mount_options to re-parse it (klunky, but it worked). The above commit moved cifs_parse_mount_options out of cifs_mount, so the re-parsing of the new mount options no longer occurred. Fix it by making expand_dfs_referral re-parse the mount options. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-09 21:25:57 +00:00
Jeff Layton	20547490c1	cifs: move bdi_setup_and_register outside of CONFIG_CIFS_DFS_UPCALL This needs to be done regardless of whether that KConfig option is set or not. Reported-by: Sven-Haegar Koch <haegar@sdinet.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-09 20:29:51 +00:00
Linus Torvalds	1acc9309eb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: btrfs: fix oops when doing space balance Btrfs: don't panic if we get an error while balancing V2 btrfs: add missing options displayed in mount output	2011-07-08 23:25:45 -07:00
Chandra Seetharaman	81463b1ca8	xfs: remove variables that serve no purpose in xfs_alloc_ag_vextent_exact() Remove two variables that serve no purpose in xfs_alloc_ag_vextent_exact(). Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-08 11:32:51 -05:00
Eric Sandeen	c0e090ced2	xfs: consolidate & clarify mount sanity checks Pavol pointed out that there is one silent error case in the mount path, and that others are rather uninformative. I've taken Pavol's suggested patch and extended it a bit to also: * fix a message which says "turned off" but actually errors out * consolidate the vaguely differentiated "SB sanity check [12]" messages, and hexdump the superblock for analysis Original-patch-by: Pavol Gono <Pavol.Gono@siemens.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-08 11:32:51 -05:00
Linus Torvalds	54af2bd25c	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: unpin stale inodes directly in IOP_COMMITTED	2011-07-08 09:00:51 -07:00
Christoph Hellwig	e163cbde98	xfs: avoid a few disk cache flushes There is no need for a pre-flush when doing writing the second part of a split log buffer, and if we are using an external log there is no need to do a full cache flush of the log device at all given that all writes to it use the FUA flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:36 +02:00
Christoph Hellwig	1d5ae5dfee	xfs: cleanup I/O-related buffer flags Remove the unused and misnamed _XBF_RUN_QUEUES flag, rename XBF_LOG_BUFFER to the more fitting XBF_SYNCIO, and split XBF_ORDERED into XBF_FUA and XBF_FLUSH to allow more fine grained control over the bio flags. Also cleanup processing of the flags in _xfs_buf_ioapply to make more sense, and renumber the sparse flag number space to group flags by purpose. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:32 +02:00
Christoph Hellwig	c8da0faf6b	xfs: return the buffer locked from xfs_buf_get_uncached All other xfs_buf_get/read-like helpers return the buffer locked, make sure xfs_buf_get_uncached isn't different for no reason. Half of the callers already lock it directly after, and the others probably should also keep it locked if only for consistency and beeing able to use xfs_buf_rele, but I'll leave that for later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:25 +02:00
Christoph Hellwig	0c842ad46a	xfs: clean up buffer locking helpers Rename xfs_buf_cond_lock and reverse it's return value to fit most other trylock operations in the Kernel and XFS (with the exception of down_trylock, after which xfs_buf_cond_lock was modelled), and replace xfs_buf_lock_val with an xfs_buf_islocked for use in asserts, or and opencoded variant in tracing. remove the XFS_BUF_* wrappers for all the locking helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:19 +02:00
Christoph Hellwig	bbb4197c73	xfs: remove the unused xfs_bufhash structure Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:10 +02:00
Christoph Hellwig	69ef921b55	xfs: byteswap constants instead of variables Micro-optimize various comparisms by always byteswapping the constant instead of the variable, which allows to do the swap at compile instead of runtime. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:36:05 +02:00
Christoph Hellwig	218106a110	xfs: use generic get_unaligned_beXX helpers Switch the shortform directory code over to use the generic get_unaligned_beXX helpers instead of reinventing them. As a result kill off xfs_arch.h and move the setting of XFS_NATIVE_HOST into xfs_linux.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:58 +02:00
Christoph Hellwig	2282396d81	xfs: cleanup struct xfs_dir2_leaf Simplify the confusing xfs_dir2_leaf structure. It is supposed to describe an XFS dir2 leaf format btree block, but due to the variable sized nature of almost all elements in it it can't actuall do anything close to that job. Remove the members that are after the first variable sized array, given that they could only be used for sizeof expressions that can as well just use the underlying types directly, and make the ents array a real C99 variable sized array. Also factor out the xfs_dir2_leaf_size, to make the sizing of a leaf entry which already was convoluted somewhat readable after using the longer type names in the sizeof expressions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:53 +02:00
Christoph Hellwig	3ed8638f88	xfs: cleanup the definition of struct xfs_dir2_data_entry Remove the tag member which is at a variable offset after the actual name, and make name a real variable sized C99 array instead of the incorrect one-sized array which confuses (not only) gcc. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:50 +02:00
Christoph Hellwig	0ba9cd84ef	xfs: kill struct xfs_dir2_data Remove the confusing xfs_dir2_data structure. It is supposed to describe an XFS dir2 data btree block, but due to the variable sized nature of almost all elements in it it can't actuall do anything close to that job. In addition to accessing the fixed offset header structure it was only used to get a pointer to the first dir or unused entry after it, which can be trivially replaced by pointer arithmetics on the header pointer. For most users that is actually more natural anyway, as they don't use a typed pointer but rather a character pointer for further arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:42 +02:00
Christoph Hellwig	c2066e2662	xfs: avoid usage of struct xfs_dir2_data In most places we can simply pass around and use the struct xfs_dir2_data_hdr, which is the first and most important member of struct xfs_dir2_data instead of the full structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:38 +02:00
Christoph Hellwig	a64b041797	xfs: kill struct xfs_dir2_block Remove the confusing xfs_dir2_block structure. It is supposed to describe an XFS dir2 block format btree block, but due to the variable sized nature of almost all elements in it it can't actuall do anything close to that job. In addition to accessing the fixed offset header structure it was only used to get a pointer to the first dir or unused entry after it, which can be trivially replaced by pointer arithmetics on the header pointer. For most users that is actually more natural anyway, as they don't use a typed pointer but rather a character pointer for further arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:32 +02:00
Christoph Hellwig	4f6ae1a49e	xfs: avoid usage of struct xfs_dir2_block In most places we can simply pass around and use the struct xfs_dir2_data_hdr, which is the first and most important member of struct xfs_dir2_block instead of the full structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:27 +02:00
Christoph Hellwig	78f70cd7b7	xfs: cleanup the definition of struct xfs_dir2_sf_entry Remove the inumber member which is at a variable offset after the actual name, and make name a real variable sized C99 array instead of the incorrect one-sized array which confuses (not only) gcc. Based on this clean up the helpers to calculate the entry size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:19 +02:00
Christoph Hellwig	ac8ba50f6b	xfs: kill struct xfs_dir2_sf The list field of it is never cactually used, so all uses can simply be replaced with the xfs_dir2_sf_hdr_t type that it has as first member. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:13 +02:00
Christoph Hellwig	8bc3878758	xfs: cleanup shortform directory inode number handling Refactor the shortform directory helpers that deal with the 32-bit vs 64-bit wide inode numbers into more sensible helpers, and kill the xfs_intino_t typedef that is now superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:35:03 +02:00
Christoph Hellwig	4fb44c8272	xfs: factor out xfs_dir2_leaf_find_entry Add a new xfs_dir2_leaf_find_entry helper to factor out some duplicate code from xfs_dir2_leaf_addname xfs_dir2_leafn_add. Found by Eric Sandeen using an automated code duplication checker. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:59 +02:00
Christoph Hellwig	29d104af0a	xfs: kill the unused struct xfs_sync_work Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:51 +02:00
Christoph Hellwig	f3ca87389d	xfs: remove i_transp Remove the transaction pointer in the inode. It's only used to avoid passing down an argument in the bmap code, and for a few asserts in the transaction code right now. Also use the local variable ip in a few more places in xfs_inode_item_unlock, so that it isn't only used for debug builds after the above change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:47 +02:00
Christoph Hellwig	7a249cf83d	xfs: fix filesystsem freeze race in xfs_trans_alloc As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem freeze when it sleeps during the memory allocation. Fix this by moving the wait_for_freeze call after the memory allocation. This means moving the freeze into the low-level _xfs_trans_alloc helper, which thus grows a new argument. Also fix up some comments in that area while at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <david@fromorbit.com>	2011-07-08 14:34:42 +02:00
Christoph Hellwig	33b8f7c247	xfs: improve sync behaviour in the face of aggressive dirtying The following script from Wu Fengguang shows very bad behaviour in XFS when aggressively dirtying data during a sync on XFS, with sync times up to almost 10 times as long as ext4. A large part of the issue is that XFS writes data out itself two times in the ->sync_fs method, overriding the livelock protection in the core writeback code, and another issue is the lock-less xfs_ioend_wait call, which doesn't prevent new ioend from being queue up while waiting for the count to reach zero. This patch removes the XFS-internal sync calls and relies on the VFS to do it's work just like all other filesystems do. Note that the i_iocount wait which is rather suboptimal is simply removed here. We already do it in ->write_inode, which keeps the current supoptimal behaviour. We'll eventually need to remove that as well, but that's material for a separate commit. ------------------------------ snip ------------------------------ #!/bin/sh umount /dev/sda7 mkfs.xfs -f /dev/sda7 # mkfs.ext4 /dev/sda7 # mkfs.btrfs /dev/sda7 mount /dev/sda7 /fs echo $((50<<20)) > /proc/sys/vm/dirty_bytes pid= for i in `seq 10` do dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 & pid="$pid $!" done sleep 1 tic=$(date +'%s') sync tac=$(date +'%s') echo echo sync time: $((tac-tic)) egrep '(Dirty\|Writeback\|NFS_Unstable)' /proc/meminfo pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; } ------------------------------ snip ------------------------------ Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:39 +02:00
Christoph Hellwig	8f04c47aa9	xfs: split xfs_itruncate_finish Split the guts of xfs_itruncate_finish that loop over the existing extents and calls xfs_bunmapi on them into a new helper, xfs_itruncate_externs. Make xfs_attr_inactive call it directly instead of xfs_itruncate_finish, which allows to simplify the latter a lot, by only letting it deal with the data fork. As a result xfs_itruncate_finish is renamed to xfs_itruncate_data to make its use case more obvious. Also remove the sync parameter from xfs_itruncate_data, which has been unessecary since the introduction of the busy extent list in 2002, and completely dead code since 2003 when the XFS_BMAPI_ASYNC parameter was made a no-op. I can't actually see why the xfs_attr_inactive needs to set the transaction sync, but let's keep this patch simple and without changes in behaviour. Also avoid passing a useless argument to xfs_isize_check, and make it private to xfs_inode.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:34 +02:00
Christoph Hellwig	857b9778d8	xfs: kill xfs_itruncate_start xfs_itruncate_start is a rather length wrapper that evaluates to a call to xfs_ioend_wait and xfs_tosspages, and only has two callers. Instead of using the complicated checks left over from IRIX where we can to truncate the pagecache just call xfs_tosspages (aka truncate_inode_pages) directly as we want to get rid of all data after i_size, and truncate_inode_pages handles incorrect alignments and too large offsets just fine. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:30 +02:00
Christoph Hellwig	681b120018	xfs: always log timestamp updates in xfs_setattr_size Get rid of the special case where we use unlogged timestamp updates for a truncate to the current inode size, and just call xfs_setattr_nonsize for it to treat it like a utimes calls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:26 +02:00
Christoph Hellwig	c4ed4243c4	xfs: split xfs_setattr Split up xfs_setattr into two functions, one for the complex truncate handling, and one for the trivial attribute updates. Also move both new routines to xfs_iops.c as they are fairly Linux-specific. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:23 +02:00
Christoph Hellwig	dec58f1dfd	xfs: work around bogus gcc warning in xfs_allocbt_init_cursor GCC 4.6 complains about an array subscript is above array bounds when using the btree index to index into the agf_levels array. The only two indices passed in are 0 and 1, and we have an assert insuring that. Replace the trick of using the array index directly with using constants in the already existing branch for assigning the XFS_BTREE_LASTREC_UPDATE flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:18 +02:00
Christoph Hellwig	dbcdde3e76	xfs: re-enable non-blocking behaviour in xfs_map_blocks The non-blockig behaviour in xfs_vm_writepage currently is conditional on having both the WB_SYNC_NONE sync_mode and the nonblocking flag set. The latter used to be used by both pdflush, kswapd and a few other places in older kernels, but has been fading out starting with the introduction of the per-bdi flusher threads. Enable the non-blocking behaviour for all WB_SYNC_NONE calls to get back the behaviour we want. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:14 +02:00
Christoph Hellwig	680a647b49	xfs: PF_FSTRANS should never be set in ->writepage Now that we reject direct reclaim in addition to always using GFP_NOFS allocation there's no chance we'll ever end up in ->writepage with PF_FSTRANS set. Add a WARN_ON if we hit this case, and stop checking if we'd actually need to start a transaction. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-07-08 14:34:05 +02:00
Anatolij Gustschin	19495f70d1	UBIFS: fix master node recovery When the 1st LEB was unmapped and written but 2nd LEB not, the master node recovery doesn't succeed after power cut. We see following error when mounting UBIFS partition on NOR flash: UBIFS error (pid 1137): ubifs_recover_master_node: failed to recover master node Correct 2nd master node offset check is needed to fix the problem. If the 2nd master node is at the end in the 2nd LEB, first master node is used for recovery. When checking for this condition we should check whether the master node is exactly at the end of the LEB (without remaining empty space) or whether it is followed by an empty space less than the master node size. Artem: when the error happened, offs2 = 261120, sz = 512, c->leb_size = 262016. Signed-off-by: Anatolij Gustschin <agust@denx.de> Signed-off-by: Artem Bityutskiy <dedekind1@gmail.com>	2011-07-08 06:53:18 +03:00
Jeff Layton	04db79b015	cifs: factor smb_vol allocation out of cifs_setup_volume_info Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-08 03:51:23 +00:00
David Howells	c902ce1bfb	FS-Cache: Add a helper to bulk uncache pages on an inode Add an FS-Cache helper to bulk uncache pages on an inode. This will only work for the circumstance where the pages in the cache correspond 1:1 with the pages attached to an inode's page cache. This is required for CIFS and NFS: When disabling inode cookie, we were returning the cookie and setting cifsi->fscache to NULL but failed to invalidate any previously mapped pages. This resulted in "Bad page state" errors and manifested in other kind of errors when running fsstress. Fix it by uncaching mapped pages when we disable the inode cookie. This patch should fix the following oops and "Bad page state" errors seen during fsstress testing. ------------[ cut here ]------------ kernel BUG at fs/cachefiles/namei.c:201! invalid opcode: 0000 [#1] SMP Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles] RSP: 0018:ffff88002ce6dd00 EFLAGS: 00010282 RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282 RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300 R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840 R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0 FS: 00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0) Stack: 0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00 ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380 ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56 Call Trace: cachefiles_lookup_object+0x78/0xd4 [cachefiles] fscache_lookup_object+0x131/0x16d [fscache] fscache_object_work_func+0x1bc/0x669 [fscache] process_one_work+0x186/0x298 worker_thread+0xda/0x15d kthread+0x84/0x8c kernel_thread_helper+0x4/0x10 RIP cachefiles_walk_to_object+0x436/0x745 [cachefiles] ---[ end trace 1d481c9af1804caa ]--- I tested the uncaching by the following means: (1) Create a big file on my NFS server (104857600 bytes). (2) Read the file into the cache with md5sum on the NFS client. Look in /proc/fs/fscache/stats: Pages : mrk=25601 unc=0 (3) Open the file for read/write ("bash 5<>/warthog/bigfile"). Look in proc again: Pages : mrk=25601 unc=25601 Reported-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de> cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-07 13:21:56 -07:00
Alexey Khoroshilov	dd7f3d5458	hfsplus: Add error propagation for hfsplus_ext_write_extent_locked Implement error propagation through the callers of hfsplus_ext_write_extent_locked(). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-07-07 17:45:46 +02:00
Alexey Khoroshilov	5bd9d99d10	hfsplus: add error checking for hfs_find_init() hfs_find_init() may fail with ENOMEM, but there are places, where the returned value is not checked. The consequences can be very unpleasant, e.g. kfree uninitialized pointer and inappropriate mutex unlocking. The patch adds checks for errors in hfs_find_init(). Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-07-07 17:45:46 +02:00
Miao Xie	149e2d76b4	btrfs: fix oops when doing space balance We need to make sure the data relocation inode doesn't go through the delayed metadata updates, otherwise we get an oops during balance: kernel BUG at fs/btrfs/relocation.c:4303! [SNIP] Call Trace: [<ffffffffa03143fd>] ? update_ref_for_cow+0x22d/0x330 [btrfs] [<ffffffffa0314951>] __btrfs_cow_block+0x451/0x5e0 [btrfs] [<ffffffffa031355d>] ? read_block_for_search+0x14d/0x4d0 [btrfs] [<ffffffffa0314beb>] btrfs_cow_block+0x10b/0x240 [btrfs] [<ffffffffa031acae>] btrfs_search_slot+0x49e/0x7a0 [btrfs] [<ffffffffa032d8af>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [<ffffffff8147bf0e>] ? mutex_lock+0x1e/0x50 [<ffffffffa0380cf1>] btrfs_update_delayed_inode+0x71/0x160 [btrfs] [<ffffffffa037ff27>] ? __btrfs_release_delayed_node+0x67/0x190 [btrfs] [<ffffffffa0381cf8>] btrfs_run_delayed_items+0xe8/0x120 [btrfs] [<ffffffffa03365e0>] btrfs_commit_transaction+0x250/0x850 [btrfs] [<ffffffff810f91d9>] ? find_get_pages+0x39/0x130 [<ffffffffa0336cd5>] ? join_transaction+0x25/0x250 [btrfs] [<ffffffff81081de0>] ? wake_up_bit+0x40/0x40 [<ffffffffa03785fa>] prepare_to_relocate+0xda/0xf0 [btrfs] [<ffffffffa037f2bb>] relocate_block_group+0x4b/0x620 [btrfs] [<ffffffffa0334cf5>] ? btrfs_clean_old_snapshots+0x35/0x150 [btrfs] [<ffffffffa037fa43>] btrfs_relocate_block_group+0x1b3/0x2e0 [btrfs] [<ffffffffa0368ec0>] ? btrfs_tree_unlock+0x50/0x50 [btrfs] [<ffffffffa035e39b>] btrfs_relocate_chunk+0x8b/0x670 [btrfs] [<ffffffffa031303d>] ? btrfs_set_path_blocking+0x3d/0x50 [btrfs] [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs] [<ffffffffa031bea1>] ? btrfs_previous_item+0xb1/0x150 [btrfs] [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs] [<ffffffffa035f5aa>] btrfs_balance+0x21a/0x2b0 [btrfs] [<ffffffffa0368898>] btrfs_ioctl+0x798/0xd20 [btrfs] [<ffffffff8111e358>] ? handle_mm_fault+0x148/0x270 [<ffffffff814809e8>] ? do_page_fault+0x1d8/0x4b0 [<ffffffff81160d6a>] do_vfs_ioctl+0x9a/0x540 [<ffffffff811612b1>] sys_ioctl+0xa1/0xb0 [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa037c1cc>] btrfs_reloc_cow_block+0x22c/0x270 [btrfs] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-06 18:51:53 -04:00
Josef Bacik	508794eb5e	Btrfs: don't panic if we get an error while balancing V2 A user reported an error where if we try to balance an fs after a device has been removed it will blow up. This is because we get an EIO back and this is where BUG_ON(ret) bites us in the ass. To fix we just exit. Thanks, Reported-by: Anand Jain <Anand.Jain@oracle.com> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-06 18:46:43 -04:00
David Sterba	0942caa373	btrfs: add missing options displayed in mount output There are three missed mount options settable by user which are not currently displayed in mount output. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-06 18:46:43 -04:00
Masatake YAMATO	bcaadf5c1a	dlm: dump address of unknown node When the dlm fails to make a network connection to another node, include the address of the node in the error message. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-06 16:37:23 -05:00
Dave Chinner	1316d4da3f	xfs: unpin stale inodes directly in IOP_COMMITTED When inodes are marked stale in a transaction, they are treated specially when the inode log item is being inserted into the AIL. It tries to avoid moving the log item forward in the AIL due to a race condition with the writing the underlying buffer back to disk. The was "fixed" in commit `de25c18` ("xfs: avoid moving stale inodes in the AIL"). To avoid moving the item forward, we return a LSN smaller than the commit_lsn of the completing transaction, thereby trying to trick the commit code into not moving the inode forward at all. I'm not sure this ever worked as intended - it assumes the inode is already in the AIL, but I don't think the returned LSN would have been small enough to prevent moving the inode. It appears that the reason it worked is that the lower LSN of the inodes meant they were inserted into the AIL and flushed before the inode buffer (which was moved to the commit_lsn of the transaction). The big problem is that with delayed logging, the returning of the different LSN means insertion takes the slow, non-bulk path. Worse yet is that insertion is to a position -before- the commit_lsn so it is doing a AIL traversal on every insertion, and has to walk over all the items that have already been inserted into the AIL. It's expensive. To compound the matter further, with delayed logging inodes are likely to go from clean to stale in a single checkpoint, which means they aren't even in the AIL at all when we come across them at AIL insertion time. Hence these were all getting inserted into the AIL when they simply do not need to be as inodes marked XFS_ISTALE are never written back. Transactional/recovery integrity is maintained in this case by the other items in the unlink transaction that were modified (e.g. the AGI btree blocks) and committed in the same checkpoint. So to fix this, simply unpin the stale inodes directly in xfs_inode_item_committed() and return -1 to indicate that the AIL insertion code does not need to do any further processing of these inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-07-06 15:44:40 -05:00
Jeff Layton	f9e59bcba2	cifs: have cifs_cleanup_volume_info not take a double pointer ...as that makes for a cumbersome interface. Make it take a regular smb_vol pointer and rely on the caller to zero it out if needed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-06 20:03:05 +00:00
Jeff Layton	b2a0fa1520	cifs: fix build_unc_path_to_root to account for a prefixpath Regression introduced by commit `f87d39d951`. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-06 20:03:05 +00:00
Jeff Layton	677d8537d8	cifs: remove bogus call to cifs_cleanup_volume_info This call to cifs_cleanup_volume_info is clearly wrong. As soon as it's called the following call to cifs_get_tcp_session will oops as the volume_info pointer will then be NULL. The caller of cifs_mount should clean up this data since it passed it in. There's no need for us to call this here. Regression introduced by commit `724d9f1cfb`. Reported-by: Adam Williamson <awilliam@redhat.com> Cc: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-06 20:03:04 +00:00
Davidlohr Bueso	bcb65a797e	FDPIC: Fix memory leak The shdr4extnum variable isn't being freed in the cleanup process of elf_fdpic_core_dump(). Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-06 12:15:16 -07:00
Miklos Szeredi	a51cb91d81	fs: fix lock initialization locks_alloc_lock() assumed that the allocated struct file_lock is already initialized to zero members. This is only true for the first allocation of the structure, after reuse some of the members will have random values. This will for example result in passing random fl_start values to userspace in fuse for FL_FLOCK locks, which is an information leak at best. Fix by reinitializing those members which may be non-zero after freeing. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-06 10:41:13 -07:00
Linus Torvalds	121782a248	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: fix sync and dio writes across stripe boundaries libceph: fix page calculation for non-page-aligned io ceph: fix page alignment corrections	2011-07-05 13:15:57 -07:00
Linus Torvalds	a8728d3554	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus: hfsplus: Fix double iput of the same inode in hfsplus_fill_super() hfsplus: add missing call to bio_put()	2011-07-05 10:04:27 -07:00
Artem Bityutskiy	a7fa94a9fe	UBIFS: improve power cut emulation testing This patch cleans-up and improves the power cut testing: 1. Kill custom 'simple_random()' function and use 'random32()' instead. 2. Make timeout larger 3. When cutting the buffer - fill the end with random data sometimes, not only with 0xFFs. 4. Some times cut in the middle of the buffer, not always at the end. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:34 +03:00
Artem Bityutskiy	d27462a518	UBIFS: rename recovery testing variables Since the recovery testing is effectively about emulating power cuts by UBIFS, use "power cut" as the base term for all the related variables and name them correspondingly. This is just a minor clean-up for the sake of readability. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:34 +03:00
Artem Bityutskiy	f57cb188cc	UBIFS: remove custom list of superblocks This is a clean-up of the power-cut emulation code - remove the custom list of superblocks which we maintained to find the superblock by the UBI volume descriptor. We do not need that crud any longer, because now we can get the superblock as a function argument. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:33 +03:00
Artem Bityutskiy	0a541b14e8	UBIFS: stop re-defining UBI operations Now when we use UBIFS helpers for all the I/O, we can remove the horrible hack of re-defining UBI I/O functions. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:33 +03:00
Artem Bityutskiy	d3b2578f56	UBIFS: switch to I/O helpers Switch the rest of direct UBI calls to UBIFS helper functions. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:33 +03:00
Artem Bityutskiy	987226a5d3	UBIFS: switch to ubifs_leb_write Stop using 'ubi_leb_write()' directly and switch to the 'ubifs_leb_write()' helper. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:33 +03:00
Artem Bityutskiy	d304820a1f	UBIFS: switch to ubifs_leb_read Instead of using 'ubi_read()' function directly, used the 'ubifs_leb_read()' helper function instead. This allows to get rid of several redundant error messages and make sure that we always have a stack dump on read errors. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:33 +03:00
Artem Bityutskiy	83cef708c6	UBIFS: introduce more I/O helpers Introduce the following I/O helper functions: 'ubifs_leb_read()', 'ubifs_leb_write()', 'ubifs_leb_change()', 'ubifs_leb_unmap()', 'ubifs_leb_map()', 'ubifs_is_mapped(). The idea is to wrap all UBI I/O functions in order to encapsulate various assertions and error path handling (error message, stack dump, switching to R/O mode). And there are some other benefits of this which will be used in the following patches. This patch does not switch whole UBIFS to use these functions yet. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:33 +03:00
Artem Bityutskiy	d033c98b17	UBIFS: always print stacktrace when switching to R/O mode When switching to R/O mode due to an I/O error, always dump the stack, not only when debugging is enabled. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:33 +03:00
Artem Bityutskiy	891a54a153	UBIFS: remove unused and unneeded debugging function This patch contains several minor clean-up and preparational cahnges. 1. Remove 'dbg_read()', 'dbg_write()', 'dbg_change()', and 'dbg_leb_erase()' functions as they are not used. 2. Remove 'dbg_leb_read()' and 'dbg_is_mapped()' as they are not really needed, it is fine to let reads go through in failure mode. 3. Rename 'offset' argument to 'offs' to be consistent with the rest of UBIFS code. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:32 +03:00
Artem Bityutskiy	e7717060dd	UBIFS: add global debugfs knobs Now we have per-FS (superblock) debugfs knobs, but they have one drawback - you have to first mount the FS and only after this you can switch self-checks on/off. But often we want to have the checks enabled during the mount. Introduce global debugging knobs for this purpose. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:32 +03:00
Artem Bityutskiy	28488fc28a	UBIFS: introduce debugfs helpers Separate out pieces of code from the debugfs file read/write functions and create separate 'interpret_user_input()'/'provide_user_output()' helpers. These helpers will be needed in one of the following patches, so this is just a preparational change. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:32 +03:00
Artem Bityutskiy	7dae997de6	UBIFS: re-arrange debugging code a bit Move 'dbg_debugfs_init()' and 'dbg_debugfs_exit()' functions which initialize debugfs for whole UBIFS subsystem below the code which initializes debugfs for a particular UBIFS instance. And do the same for 'ubifs_debugging_init()' and 'ubifs_debugging_exit()' functions. This layout is a bit better for the next patches, so this is just a preparation. Also, rename 'open_debugfs_file()' into 'dfs_file_open()' for consistency. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:32 +03:00
Artem Bityutskiy	24a4f8009e	UBIFS: be more informative in failure mode When we are testing UBIFS recovery, it is better to print in which eraseblock we are going to fail. Currently UBIFS prints it only if recovery debugging messages are enabled, but this is not very practical. So change 'dbg_rcvry()' messages to 'ubifs_warn()' messages. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:30 +03:00
Artem Bityutskiy	81e79d38df	UBIFS: switch self-check knobs to debugfs UBIFS has many built-in self-check functions which can be enabled using the debug_chks module parameter or the corresponding sysfs file (/sys/module/ubifs/parameters/debug_chks). However, this is not flexible enough because it is not per-filesystem. This patch moves this to debugfs interfaces. We already have debugfs support, so this patch just adds more debugfs files. While looking at debugfs support I've noticed that it is racy WRT file-system unmount, and added a TODO entry for that. This problem has been there for long time and it is quite standard debugfs PITA. The plan is to fix this later. This patch is simple, but it is large because it changes many places where we check if a particular type of checks is enabled or disabled. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:28 +03:00
Artem Bityutskiy	8d7819b4af	UBIFS: lessen amount of debugging check types We have too many different debugging checks - lessen the amount by merging all index-related checks into one. At the same time, move the "force in-the-gap" test to the "index checks" class, because it is too heavy for the "general" class. This patch merges TNC, Old index, and Index size check and calles this just "index checks". Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:28 +03:00
Artem Bityutskiy	2b1844a8c9	UBIFS: introduce helper functions for debugging checks and tests This patch introduces helper functions for all debugging checks, so instead of doing if (!(ubifs_chk_flags & UBIFS_CHK_GEN)) we now do if (!dbg_is_chk_gen(c)) This is a preparation to further changes where the flags will go away, and we'll need to only change the helper functions, but the code which utilizes them won't be touched. At the same time this patch removes 'dbg_force_in_the_gaps()', 'dbg_force_in_the_gaps_enabled()', and dbg_failure_mode helpers for consistency. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:28 +03:00
Artem Bityutskiy	d808efb407	UBIFS: amend debugging inode size check function prototype Add 'const struct ubifs_info *c' parameter to 'dbg_check_synced_i_size()' function because we'll need it in the next patch when we switch to debugfs. So this patch is just a preparation. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:27 +03:00
Artem Bityutskiy	bb2615d4d1	UBIFS: amend debugging name check function prototype Add 'struct ubifs_info *c' parameter to the 'dbg_check_name()' debugging function - it will be needed in one of the following commits where we switch to debugfs. So this is just a preparation. Mark parameters as 'const' while on it. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:27 +03:00
Artem Bityutskiy	06b282a4cc	UBIFS: add few commentaries about TNC Add a couple of comments - while looking into TNC I could not easily figure out few facts, so it is a good idea to document them in the code. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:27 +03:00
Artem Bityutskiy	3766244769	UBIFS: use correct flags in lprops The UBIFS lpt tree is in many aspects similar to the TNC tree, and we have similar flags for these trees. And by mistake we use the COW_ZNODE flag for LPT in some places, instead of the right flag COW_CNODE. And this works only because these two constants have the same value. This patch makes all the LPT code to use COW_CNODE and also changes COW_CNODE constant value to make sure we do not misuse the flags any more. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:27 +03:00
Artem Bityutskiy	f42eed7cba	UBIFS: harmonize znode flag helpers We have 3 znode flags: cow, obsolete, dirty. For the last flag we have a 'ubifs_zn_dirty()' helper function, but for the other 2 flags we use 'test_bit()' directly. This patch makes the situation more consistent and introduces helpers for the other 2 flags: 'ubifs_zn_cow()' and 'ubifs_zn_obsolete()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:27 +03:00
Artem Bityutskiy	1f42596ec0	UBIFS: remove dead code Remove dead pieces of code under "if (c->min_io_size == 1)" statement - we never execute it because in UBIFS 'c->min_io_size' is always at least 8. This are leftovers from old pre-mainline prototype. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:27 +03:00
Artem Bityutskiy	12e776a088	UBIFS: remove unnecessary brackets Remove unnecessary brackets in "inode->i_flags \|= (S_NOCMTIME)" statement to make the code not look silly. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:27 +03:00
Artem Bityutskiy	a29fa9dfa4	UBIFS: minor cleanup: use S_ISREG helper Instead of using long "(inode->i_mode & S_IFMT) != S_IFREG" expression, use shorted "!S_ISREG(inode->i_mode)". Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:26 +03:00
Artem Bityutskiy	1b51e98365	UBIFS: rename dbg_check_dir_size function Since this function is not only about size checking, rename it to 'dbg_check_dir()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:26 +03:00
Artem Bityutskiy	4315fb4072	UBIFS: improve inode dumping function Teach 'dbg_dump_inode()' dump directory entries for directory inodes. This requires few additional changes: 1. The 'c' argument of 'dbg_dump_inode()' cannot be const any more. 2. Users of 'dbg_dump_inode()' should not have 'tnc_mutex' locked. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:26 +03:00
Artem Bityutskiy	bfcf677dec	UBIFS: dump stack when pnode or nnode reading fails When we fail to read a pnode or nnode - print stacktrace if debugging is enabled. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:26 +03:00
Artem Bityutskiy	ae380ce047	UBIFS: lessen the size of debugging info data structure This patch lessens the 'struct ubifs_debug_info' size by 90 bytes by allocating less bytes for the debugfs root directory name. It introduces macros for the name patter an length instead of hard-coding 100 bytes. It also makes UBIFS use 'snprintf()' and teaches it to gracefully catch situations when the name array is too short. Additionally, this patch makes 2 unrelated changes - I just thought they do not deserve separate commits: simplifies 'ubifs_assert()' for non-debugging case and makes 'dbg_debugfs_init()' properly verify debugfs return code which may be an error code or NULL, so we should you 'IS_ERR_OR_NULL()' instead of 'IS_ERR()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:26 +03:00
Artem Bityutskiy	549c999a76	UBIFS: return EROFS in case of broken commit If commit failed and it is in broken state, UBIFS switches to R/O mode. Most operations return -EROFS in this case, except of commit which returns -EINVAL. Make it return -EROFS too for consistency. This is also important for our power cut emulation testing. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-07-04 10:54:26 +03:00
Bryn M. Reeves	c282af4990	dlm: use vmalloc for hash tables Allocate dlm hash tables in the vmalloc area to allow a greater maximum size without restructuring of the hash table code. Signed-off-by: Bryn M. Reeves <bmr@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2011-07-01 15:49:23 -05:00
Johannes Stezenbach	390192b300	compat_ioctl: fix warning caused by qemu On Linux x86_64 host with 32bit userspace, running qemu or even just "qemu-img create -f qcow2 some.img 1G" causes a kernel warning: ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img ioctl 00005326 is CDROM_DRIVE_STATUS, ioctl 801c0204 is FDGETPRM. The warning appears because the Linux compat-ioctl handler for these ioctls only applies to block devices, while qemu also uses the ioctls on plain files. Signed-off-by: Johannes Stezenbach <js@sig21.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-01 22:32:26 +02:00
Denys Vlasenko	bb188d7e64	ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop When multithreaded program execs under ptrace, all traced threads report WIFEXITED status, except for thread group leader and the thread which execs. Unless tracer tracks thread group relationship between tracees, which is a nontrivial task, it will not detect that execed thread no longer exists. This patch allows tracer to figure out which thread performed this exec, by requesting PTRACE_GETEVENTMSG in PTRACE_EVENT_EXEC stop. Another, samller problem which is solved by this patch is that tracer now can figure out which of the several concurrent execs in multithreaded program succeeded. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-07-01 18:51:49 +02:00
Jeff Layton	ee1b3ea9e6	cifs: set socket send and receive timeouts before attempting connect Benjamin S. reported that he was unable to suspend his machine while it had a cifs share mounted. The freezer caused this to spew when he tried it: -----------------------[snip]------------------ PM: Syncing filesystems ... done. Freezing user space processes ... (elapsed 0.01 seconds) done. Freezing remaining freezable tasks ... Freezing of tasks failed after 20.01 seconds (1 tasks refusing to freeze, wq_busy=0): cifsd S ffff880127f7b1b0 0 1821 2 0x00800000 ffff880127f7b1b0 0000000000000046 ffff88005fe008a8 ffff8800ffffffff ffff880127cee6b0 0000000000011100 ffff880127737fd8 0000000000004000 ffff880127737fd8 0000000000011100 ffff880127f7b1b0 ffff880127736010 Call Trace: [<ffffffff811e85dd>] ? sk_reset_timer+0xf/0x19 [<ffffffff8122cf3f>] ? tcp_connect+0x43c/0x445 [<ffffffff8123374e>] ? tcp_v4_connect+0x40d/0x47f [<ffffffff8126ce41>] ? schedule_timeout+0x21/0x1ad [<ffffffff8126e358>] ? _raw_spin_lock_bh+0x9/0x1f [<ffffffff811e81c7>] ? release_sock+0x19/0xef [<ffffffff8123e8be>] ? inet_stream_connect+0x14c/0x24a [<ffffffff8104485b>] ? autoremove_wake_function+0x0/0x2a [<ffffffffa02ccfe2>] ? ipv4_connect+0x39c/0x3b5 [cifs] [<ffffffffa02cd7b7>] ? cifs_reconnect+0x1fc/0x28a [cifs] [<ffffffffa02cdbdc>] ? cifs_demultiplex_thread+0x397/0xb9f [cifs] [<ffffffff81076afc>] ? perf_event_exit_task+0xb9/0x1bf [<ffffffffa02cd845>] ? cifs_demultiplex_thread+0x0/0xb9f [cifs] [<ffffffffa02cd845>] ? cifs_demultiplex_thread+0x0/0xb9f [cifs] [<ffffffff810444a1>] ? kthread+0x7a/0x82 [<ffffffff81002d14>] ? kernel_thread_helper+0x4/0x10 [<ffffffff81044427>] ? kthread+0x0/0x82 [<ffffffff81002d10>] ? kernel_thread_helper+0x0/0x10 Restarting tasks ... done. -----------------------[snip]------------------ We do attempt to perform a try_to_freeze in cifs_reconnect, but the connection attempt itself seems to be taking longer than 20s to time out. The connect timeout is governed by the socket send and receive timeouts, so we can shorten that period by setting those timeouts before attempting the connect instead of after. Adam Williamson tested the patch and said that it seems to have fixed suspending on his laptop when a cifs share is mounted. Reported-by: Benjamin S <da_joind@gmx.net> Tested-by: Adam Williamson <awilliam@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-07-01 16:15:30 +00:00
Tejun Heo	85ef06d1d2	block: flush MEDIA_CHANGE from drivers on close(2) Currently, only open(2) is defined as the 'clearing' point. It has two roles - first, it's an acknowledgement from userland indicating that the event has been received and kernel can clear pending states and proceed to generate more events. Secondly, it's passed on to device drivers as a hint indicating that a synchronization point has been reached and it might want to take a deeper look at the device. The latter currently is only used by sr which uses two different mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY to discover events, where the former is lighter weight and safe to be used repeatedly but may not provide full coverage. Among other things, GET_EVENT can't detect media removal while TUR can. This patch makes close(2) - blkdev_put() - indicate clearing hint for MEDIA_CHANGE to drivers. disk_check_events() is renamed to disk_flush_events() and updated to take @mask for events to flush which is or'd to ev->clearing and will be passed to the driver on the next ->check_events() invocation. This change makes sr generate MEDIA_CHANGE when media is ejected from userland - e.g. with eject(1). Note: Given the current usage, it seems @clearing hint is needlessly complex. disk_clear_events() can simply clear all events and the hint can be boolean @flush. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-01 16:17:47 +02:00
Jens Axboe	04bf7869ca	Merge branch 'for-linus' into for-3.1/core Conflicts: block/blk-throttle.c block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-01 16:17:13 +02:00
Masatake YAMATO	55b3286d3d	dlm: show addresses in configfs Display all addresses the dlm is using for the local node from the configfs file config/dlm/<cluster>/comms/<comm>/addr_list Also make the addr file write only. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2011-06-30 14:45:28 -05:00
Christoph Hellwig	c6d5f5fa65	hfsplus: lift the 2TB size limit Replace the hardcoded 2TB limit with a dynamic limit based on the block size now that we have fixed the few overflows preventing operation with large volumes. Signed-off-by: Christoph Hellwig <hch@tuxera.com>	2011-06-30 13:40:59 +02:00
Christoph Hellwig	4ba2d5fdcf	hfsplus: fix overflow in hfsplus_read_wrapper For partitions larger than 2TB or at such an offset the hfs wrapper code in hfsplus might overflow the range representable in a 32-bit data type. Make sure we use a sector_t for the arithmetics leading to it. I'm not sure this code can be readed at all as hfs itself never supported such large volumes. Signed-off-by: Christoph Hellwig <hch@tuxera.com>	2011-06-30 13:40:59 +02:00
Christoph Hellwig	bf1a1b31fa	hfsplus: fix overflow in hfsplus_get_block For filesystems larger than 2TB the final sector number passed to map_bh might overflow the range representable in a 32-bit data type. Make sure we use a sector_t for it and the arithmetics calculating it. Signed-off-by: Christoph Hellwig <hch@tuxera.com>	2011-06-30 13:40:58 +02:00
Anton Salikhmetov	2b4f9ca8a5	hfsplus: assignments inside `if' condition clean-up Make assignments outside `if' conditions for unicode.c module where the checkpatch.pl script reported this coding style error. Signed-off-by: Anton Salikhmetov <alexo@tuxera.com> Signed-off-by: Christoph Hellwig <hch@tuxera.com>	2011-06-30 13:40:58 +02:00
Alexey Khoroshilov	032016a56a	hfsplus: Fix double iput of the same inode in hfsplus_fill_super() There is a misprint in resource deallocation code on error path in hfsplus_fill_super(): the sbi->alloc_file inode is iput twice, while the root inode in not iput at all. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-06-30 13:38:39 +02:00
Seth Forshee	50176ddefa	hfsplus: add missing call to bio_put() hfsplus leaks bio objects by failing to call bio_put() on the bios it allocates. Add the missing call to fix the leak. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Cc: <stable@kernel.org> # .38.x, .39.x Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-06-30 13:28:32 +02:00
James Morris	5b944a71a1	Merge branch 'linus' into next	2011-06-30 18:43:56 +10:00
Theodore Ts'o	275d3ba6b4	ext4: remove loop around bio_alloc() These days, bio_alloc() is guaranteed to never fail (as long as nvecs is less than BIO_MAX_PAGES), so we don't need the loop around the struct bio allocation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-29 21:44:45 -04:00
Boaz Harrosh	2bea038c52	pnfs: write: Set mds_offset in the generic layer - it is needed by all LDs In current pnfs tree, all the layouts set mds_offset in their .write_pagelist member. mds_offset is only used by generic layer and should be handled by it. This patch is for upstream. It is needed in this -rc series to fix a bug in objects layout_commit. I'll send patches for objects and blocks to be squashed into current pnfs tree. TODO: It looks like the read path needs the same patch. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-28 14:12:11 -04:00
Vasiliy Kulikov	1d1221f375	proc: restrict access to /proc/PID/io /proc/PID/io may be used for gathering private information. E.g. for openssh and vsftpd daemons wchars/rchars may be used to learn the precise password length. Restrict it to processes being able to ptrace the target process. ptrace_may_access() is needed to prevent keeping open file descriptor of "io" file, executing setuid binary and gathering io information of the setuid'ed process. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-28 09:39:11 -07:00
Yongqiang Yang	9331b62610	ext4: quiet 'unused variables' compile warnings Unused variables was deleted. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-28 10:19:05 -04:00
Eric Sandeen	f86186b44b	ext4: refactor duplicated block placement code I found that ext4_ext_find_goal() and ext4_find_near() share the same code for returning a coloured start block based on i_block_group. We can refactor this into a common function so that they don't diverge in the future. Thanks to adilger for suggesting the new function name. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-28 10:01:31 -04:00
Jan Kara	08142579b6	mm: fix assertion mapping->nrpages == 0 in end_writeback() Under heavy memory and filesystem load, users observe the assertion mapping->nrpages == 0 in end_writeback() trigger. This can be caused by page reclaim reclaiming the last page from a mapping in the following race: CPU0 CPU1 ... shrink_page_list() __remove_mapping() __delete_from_page_cache() radix_tree_delete() evict_inode() truncate_inode_pages() truncate_inode_pages_range() pagevec_lookup() - finds nothing end_writeback() mapping->nrpages != 0 -> BUG page->mapping = NULL mapping->nrpages-- Fix the problem by doing a reliable check of mapping->nrpages under mapping->tree_lock in end_writeback(). Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out by Miklos Szeredi <mszeredi@suse.de>. Cc: Jay <jinshan.xiong@whamcloud.com> Cc: Miklos Szeredi <mszeredi@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-27 18:00:13 -07:00
Bob Liu	2b4b2482e7	romfs: fix romfs_get_unmapped_area() argument check romfs_get_unmapped_area() checks argument `len' without considering PAGE_ALIGN which will cause do_mmap_pgoff() return -EINVAL error after commit `f67d9b1576` ("nommu: add page_align to mmap"). Fix the check by changing it in same way ramfs_nommu_get_unmapped_area() was changed in ramfs/file-nommu.c. Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Greg Ungerer <gerg@snapgear.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-27 18:00:12 -07:00
Amir Goldstein	dae1e52cb1	ext4: move ext4_ind_* functions from inode.c to indirect.c This patch moves functions from inode.c to indirect.c. The moved functions are ext4_ind_* functions and their helpers. Functions called from inode.c are declared extern. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 19:40:50 -04:00
Theodore Ts'o	9f125d641b	ext4: move common truncate functions to header file Move two functions that will be needed by the indirect functions to be moved to indirect.c as well as inode.c to truncate.h as inline functions, so that we can avoid having duplicate copies of the function (which can be a maintenance problem) without having to expose them as globally functions. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 19:16:04 -04:00
Theodore Ts'o	1f7d1e7741	ext4: move __ext4_check_blockref to block_validity.c In preparation for moving the indirect functions to a separate file, move __ext4_check_blockref() to block_validity.c and rename it to ext4_check_blockref() which is exported as globally visible function. Also, rename the cpp macro ext4_check_inode_blockref() to ext4_ind_check_inode(), to make it clear that it is only valid for use with non-extent mapped inodes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 19:16:02 -04:00
Tao Ma	a212d1a71d	jbd: Use WRITE_SYNC in journal checkpoint. In journal checkpoint, we write the buffer and wait for its finish. But in cfq, the async queue has a very low priority, and in our test, if there are too many sync queues and every queue is filled up with requests, and the process will hang waiting for the log space. So this patch tries to use WRITE_SYNC in __flush_batch so that the request will be moved into sync queue and handled by cfq timely. We also use the new plug, sot that all the WRITE_SYNC requests can be given as a whole when we unplug it. Reported-by: Robin Dong <sanbai@taobao.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-28 00:06:41 +02:00
Amir Goldstein	8bb2b24712	ext4: rename ext4_indirect_* funcs to ext4_ind_* We are going to move all ext4_ind_* functions to indirect.c. Before we do that, let's rename 2 functions called ext4_indirect_* to ext4_ind_*, to keep to the naming convention. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 17:10:28 -04:00
Amir Goldstein	ff9893dc8a	ext4: split ext4_ind_truncate from ext4_truncate We are about to move all indirect inode functions to a new file. Before we do that, let's split ext4_ind_truncate() out of ext4_truncate() leaving only generic code in the latter, so we will be able to move ext4_ind_truncate() to the new file. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 16:36:31 -04:00
Linus Torvalds	af4087e0e6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: btrfs: fix inconsonant inode information Btrfs: make sure to update total_bitmaps when freeing cache V3 Btrfs: fix type mismatch in find_free_extent() Btrfs: make sure to record the transid in new inodes	2011-06-27 13:32:14 -07:00
Robin Dong	ed7a7e1672	ext4: fix incorrect error msg in ext4_ext_insert_index In function ext4_ext_insert_index when eh_entries of curp is bigger than eh_max, error messages will be printed out, but the content is about logical and ei_block, that's incorret. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 15:35:53 -04:00
Oleg Nesterov	087806b128	redefine thread_group_leader() as exit_signal >= 0 Change de_thread() to set old_leader->exit_signal = -1. This is good for the consistency, it is no longer the leader and all sub-threads have exit_signal = -1 set by copy_process(CLONE_THREAD). And this allows us to micro-optimize thread_group_leader(), it can simply check exit_signal >= 0. This also makes sense because we should move ->group_leader from task_struct to signal_struct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:10 +02:00
Tao Ma	d3ad8434aa	jbd2: use WRITE_SYNC in journal checkpoint In journal checkpoint, we write the buffer and wait for its finish. But in cfq, the async queue has a very low priority, and in our test, if there are too many sync queues and every queue is filled up with requests, the write request will be delayed for quite a long time and all the tasks which are waiting for journal space will end with errors like: INFO: task attr_set:3816 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. attr_set D ffff880028393480 0 3816 1 0x00000000 ffff8802073fbae8 0000000000000086 ffff8802140847c8 ffff8800283934e8 ffff8802073fb9d8 ffffffff8103e456 ffff8802140847b8 ffff8801ed728080 ffff8801db4bc080 ffff8801ed728450 ffff880028393480 0000000000000002 Call Trace: [<ffffffff8103e456>] ? __dequeue_entity+0x33/0x38 [<ffffffff8103caad>] ? need_resched+0x23/0x2d [<ffffffff814006a6>] ? thread_return+0xa2/0xbc [<ffffffffa01f6224>] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2] [<ffffffffa01f6224>] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2] [<ffffffff81400d31>] __mutex_lock_common+0x14e/0x1a9 [<ffffffffa021dbfb>] ? brelse+0x13/0x15 [ext4] [<ffffffff81400ddb>] __mutex_lock_slowpath+0x19/0x1b [<ffffffff81400b2d>] mutex_lock+0x1b/0x32 [<ffffffffa01f927b>] __jbd2_journal_insert_checkpoint+0xe3/0x20c [jbd2] [<ffffffffa01f547b>] start_this_handle+0x438/0x527 [jbd2] [<ffffffff8106f491>] ? autoremove_wake_function+0x0/0x3e [<ffffffffa01f560b>] jbd2_journal_start+0xa1/0xcc [jbd2] [<ffffffffa02353be>] ext4_journal_start_sb+0x57/0x81 [ext4] [<ffffffffa024a314>] ext4_xattr_set+0x6c/0xe3 [ext4] [<ffffffffa024aaff>] ext4_xattr_user_set+0x42/0x4b [ext4] [<ffffffff81145adb>] generic_setxattr+0x6b/0x76 [<ffffffff81146ac0>] __vfs_setxattr_noperm+0x47/0xc0 [<ffffffff81146bb8>] vfs_setxattr+0x7f/0x9a [<ffffffff81146c88>] setxattr+0xb5/0xe8 [<ffffffff81137467>] ? do_filp_open+0x571/0xa6e [<ffffffff81146d26>] sys_fsetxattr+0x6b/0x91 [<ffffffff81002d32>] system_call_fastpath+0x16/0x1b So this patch tries to use WRITE_SYNC in __flush_batch so that the request will be moved into sync queue and handled by cfq timely. We also use the new plug, sot that all the WRITE_SYNC requests can be given as a whole when we unplug it. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Reported-by: Robin Dong <sanbai@taobao.com>	2011-06-27 12:36:29 -04:00
Linus Torvalds	4699d4423c	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: prevent bogus assert when trying to remove non-existent attribute xfs: clear XFS_IDIRTY_RELEASE on truncate down xfs: reset inode per-lifetime state when recycling it	2011-06-27 09:01:29 -07:00
Miao Xie	2f7e33d432	btrfs: fix inconsonant inode information When iputting the inode, We may leave the delayed nodes if they have some delayed items that have not been dealt with. So when the inode is read again, we must look up the relative delayed node, and use the information in it to initialize the inode. Or we will get inconsonant inode information, it may cause that the same directory index number is allocated again, and hit the following oops: [ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17) [ 5447.569766] ------------[ cut here ]------------ [ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301! [SNIP] [ 5447.790721] Call Trace: [ 5447.793191] [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs] [ 5447.800156] [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs] [ 5447.806517] [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs] [ 5447.812876] [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs] [ 5447.818961] [<ffffffff8111f840>] vfs_create+0x72/0x92 [ 5447.824090] [<ffffffff8111fa8c>] do_last+0x22c/0x40b [ 5447.829133] [<ffffffff8112076a>] path_openat+0xc0/0x2ef [ 5447.834438] [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44 [ 5447.841216] [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67 [ 5447.847846] [<ffffffff81121a79>] do_filp_open+0x3d/0x87 [ 5447.853156] [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d [ 5447.859072] [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80 [ 5447.864636] [<ffffffff8111f179>] ? do_getname+0x14b/0x173 [ 5447.870112] [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26 [ 5447.875682] [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10 [ 5447.880882] [<ffffffff81112d39>] do_sys_open+0x69/0xae [ 5447.886153] [<ffffffff81112db1>] sys_open+0x20/0x22 [ 5447.891114] [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b Fix it by reusing the old delayed node. Reported-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-27 11:34:27 -04:00
Roberto Sassu	1252cc3b23	eCryptfs: added support for the encrypted key type The function ecryptfs_keyring_auth_tok_for_sig() has been modified in order to search keys of both 'user' and 'encrypted' types. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Acked-by: Gianluca Ramunno <ramunno@polito.it> Acked-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>	2011-06-27 09:11:21 -04:00
Roberto Sassu	f8f8527103	eCryptfs: export global eCryptfs definitions to include/linux/ecryptfs.h Some eCryptfs specific definitions, such as the current version and the authentication token structure, are moved to the new include file 'include/linux/ecryptfs.h', in order to be available for all kernel subsystems. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Acked-by: Gianluca Ramunno <ramunno@polito.it> Acked-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>	2011-06-27 09:11:02 -04:00
Jan Kara	bb189247f3	jbd: Fix oops in journal_remove_journal_head() journal_remove_journal_head() can oops when trying to access journal_head returned by bh2jh(). This is caused for example by the following race: TASK1 TASK2 journal_commit_transaction() ... processing t_forget list __journal_refile_buffer(jh); if (!jh->b_transaction) { jbd_unlock_bh_state(bh); journal_try_to_free_buffers() journal_grab_journal_head(bh) jbd_lock_bh_state(bh) __journal_try_to_free_buffer() journal_put_journal_head(jh) journal_remove_journal_head(bh); journal_put_journal_head() in TASK2 sees that b_jcount == 0 and buffer is not part of any transaction and thus frees journal_head before TASK1 gets to doing so. Note that even buffer_head can be released by try_to_free_buffers() after journal_put_journal_head() which adds even larger opportunity for oops (but I didn't see this happen in reality). Fix the problem by making transactions hold their own journal_head reference (in b_jcount). That way we don't have to remove journal_head explicitely via journal_remove_journal_head() and instead just remove journal_head when b_jcount drops to zero. The result of this is that [__]journal_refile_buffer(), [__]journal_unfile_buffer(), and __journal_remove_checkpoint() can free journal_head which needs modification of a few callers. Also we have to be careful because once journal_head is removed, buffer_head might be freed as well. So we have to get our own buffer_head reference where it matters. Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-27 11:44:37 +02:00
Linus Torvalds	258e43fdb0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: mark CONFIG_CIFS_NFSD_EXPORT as BROKEN cifs: free blkcipher in smbhash	2011-06-26 19:40:31 -07:00
Linus Torvalds	804a007f54	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: cifs: propagate errors from cifs_get_root() to mount(2) cifs: tidy cifs_do_mount() up a bit cifs: more breakage on mount failures cifs: close sget() races cifs: pull freeing mountdata/dropping nls/freeing cifs_sb into cifs_umount() cifs: move cifs_umount() call into ->kill_sb() cifs: pull cifs_mount() call up sanitize cifs_umount() prototype cifs: initialize ->tlink_tree in cifs_setup_cifs_sb() cifs: allocate mountdata earlier cifs: leak on mount if we share superblock cifs: don't pass superblock to cifs_mount() cifs: don't leak nls on mount failure cifs: double free on mount failure take bdi setup/destruction into cifs_mount/cifs_umount Acked-by: Steve French <smfrench@gmail.com>	2011-06-26 19:39:22 -07:00
Lukas Czerner	2c2ea9451f	ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs() We should return -EINVAL when the FITRIM parameters are not sane, but currently we are exiting silently if start is beyond the end of the file system. This commit fixes this so we return -EINVAL as other file systems do. Signed-off-by: Lukas Czerner <lczerner@redhat.com> CC: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:53 +02:00
H Hartley Sweeten	81fe8c62fe	ext3/ioctl.c: silence sparse warnings about different address spaces The 'from' argument for copy_from_user and the 'to' argument for copy_to_user should both be tagged as __user address space. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:53 +02:00
Jan Kara	ee3e77f180	ext3: Improve truncate error handling New truncate calling convention allows us to handle errors from ext3_block_truncate_page(). So reorganize the code so that ext3_block_truncate_page() is called before we change inode size. This also removes unnecessary block zeroing from error recovery after failed buffered writes (zeroing isn't needed because we could have never written non-zero data to disk). We have to be careful and keep zeroing in direct IO write error recovery because there we might have already overwritten end of the last file block. Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:52 +02:00
Jan Kara	ad95c5e9bc	ext3: Fix oops in ext3_try_to_allocate_with_rsv() Block allocation is called from two places: ext3_get_blocks_handle() and ext3_xattr_block_set(). These two callers are not necessarily synchronized because xattr code holds only xattr_sem and i_mutex, and ext3_get_blocks_handle() may hold only truncate_mutex when called from writepage() path. Block reservation code does not expect two concurrent allocations to happen to the same inode and thus assertions can be triggered or reservation structure corruption can occur. Fix the problem by taking truncate_mutex in xattr code to serialize allocations. CC: Sage Weil <sage@newdream.net> CC: stable@kernel.org Reported-by: Fyodor Ustinov <ufm@ufm.su> Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:52 +02:00
Ding Dinghua	bd5c9e1854	jbd: fix a bug of leaking jh->b_jcount journal_get_create_access should drop jh->b_jcount in error handling path Signed-off-by: Ding Dinghua <dingdinghua@nrchpc.ac.cn> Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:51 +02:00
Jan Kara	05713082ab	jbd: remove dependency on __GFP_NOFAIL The callers of start_this_handle() (or better ext3_journal_start()) are not really prepared to handle allocation failures. Such failures can for example result in silent data loss when it happens in ext3_..._writepage(). OTOH __GFP_NOFAIL is going away so we just retry allocation in start_this_handle(). This loop is potentially dangerous because the oom killer cannot be invoked for GFP_NOFS allocation, so there is a potential for infinitely looping. But still this is better than silent data loss. Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:51 +02:00
Jan Kara	40680f2fa4	ext3: Convert ext3 to new truncate calling convention Mostly trivial conversion. We fix a bug that IS_IMMUTABLE and IS_APPEND files could not be truncated during failed writes as we change the code. In fact the test is not needed at all because both IS_IMMUTABLE and IS_APPEND is tested in upper layers in do_sys_[f]truncate(), may_write(), etc. Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:51 +02:00
Lukas Czerner	99cb1a318c	jbd: Add fixed tracepoints This commit adds fixed tracepoint for jbd. It has been based on fixed tracepoints for jbd2, however there are missing those for collecting statistics, since I think that it will require more intrusive patch so I should have its own commit, if someone decide that it is needed. Also there are new tracepoints in __journal_drop_transaction() and journal_update_superblock(). The list of jbd tracepoints: jbd_checkpoint jbd_start_commit jbd_commit_locking jbd_commit_flushing jbd_commit_logging jbd_drop_transaction jbd_end_commit jbd_do_submit_data jbd_cleanup_journal_tail jbd_update_superblock_end Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:51 +02:00
Lukas Czerner	785c4bcc0d	ext3: Add fixed tracepoints This commit adds fixed tracepoints to the ext3 code. It is based on ext4 tracepoints, however due to the differences of both file systems, there are some tracepoints missing (those for delaloc and for multi-block allocator) and there are some ext3 specific as well (for reservation windows). Here is a list: ext3_free_inode ext3_request_inode ext3_allocate_inode ext3_evict_inode ext3_drop_inode ext3_mark_inode_dirty ext3_write_begin ext3_ordered_write_end ext3_writeback_write_end ext3_journalled_write_end ext3_ordered_writepage ext3_writeback_writepage ext3_journalled_writepage ext3_readpage ext3_releasepage ext3_invalidatepage ext3_discard_blocks ext3_request_blocks ext3_allocate_blocks ext3_free_blocks ext3_sync_file_enter ext3_sync_file_exit ext3_sync_fs ext3_rsv_window_add ext3_discard_reservation ext3_alloc_new_reservation ext3_reserved ext3_forget ext3_read_block_bitmap ext3_direct_IO_enter ext3_direct_IO_exit ext3_unlink_enter ext3_unlink_exit ext3_truncate_enter ext3_truncate_exit ext3_get_blocks_enter ext3_get_blocks_exit ext3_load_inode Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz>	2011-06-25 17:29:51 +02:00
Josef Bacik	9b90f51353	Btrfs: make sure to update total_bitmaps when freeing cache V3 A user reported this bug again where we have more bitmaps than we are supposed to. This is because we failed to load the free space cache, but don't update the ctl->total_bitmaps counter when we remove entries from the tree. This patch fixes this problem and we should be good to go again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-25 09:31:06 -04:00
Ilya Dryomov	e0f5406727	Btrfs: fix type mismatch in find_free_extent() data parameter should be u64 because a full-sized chunk flags field is passed instead of 0/1 for distinguishing data from metadata. All underlying functions expect u64. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-25 09:31:06 -04:00
Al Viro	9403c9c598	cifs: propagate errors from cifs_get_root() to mount(2) ... instead of just failing with -EINVAL Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:43 -04:00
Al Viro	5c4f1ad7c6	cifs: tidy cifs_do_mount() up a bit Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:42 -04:00
Al Viro	fa18f1bdce	cifs: more breakage on mount failures if cifs_get_root() fails, we end up with ->mount() returning NULL, which is not what callers expect. Moreover, in case of superblock reuse we end up leaking a superblock reference... Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:42 -04:00
Al Viro	ee01a14d9d	cifs: close sget() races have ->s_fs_info set by the set() callback passed to sget() Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:42 -04:00
Al Viro	d757d71bfc	cifs: pull freeing mountdata/dropping nls/freeing cifs_sb into cifs_umount() all callers of cifs_umount() proceed to do the same thing; pull it into cifs_umount() itself. Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:42 -04:00
Al Viro	98ab494dd1	cifs: move cifs_umount() call into ->kill_sb() instead of calling it manually in case if cifs_read_super() fails to set ->s_root, just call it from ->kill_sb(). cifs_put_super() is gone now and we have cifs_sb shutdown and destruction done after the superblock is gone from ->s_instances. Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:42 -04:00
Al Viro	97d1152ace	cifs: pull cifs_mount() call up ... to the point prior to sget(). Now we have cifs_sb set up early enough. Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:42 -04:00
Al Viro	2a9b99516c	sanitize cifs_umount() prototype a) superblock argument is unused b) it always returns 0 Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:42 -04:00
Al Viro	2ced6f6935	cifs: initialize ->tlink_tree in cifs_setup_cifs_sb() no need to wait until cifs_read_super() and we need it done by the time cifs_mount() will be called. Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:42 -04:00
Al Viro	5d3bc605ca	cifs: allocate mountdata earlier pull mountdata allocation up, so that it won't stand in the way when we lift cifs_mount() to location before sget(). Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:41 -04:00
Al Viro	d687ca380f	cifs: leak on mount if we share superblock cifs_sb and nls end up leaked... Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:41 -04:00
Al Viro	2c6292ae4b	cifs: don't pass superblock to cifs_mount() To close sget() races we'll need to be able to set cifs_sb up before we get the superblock, so we'll want to be able to do cifs_mount() earlier. Fortunately, it's easy to do - setting ->s_maxbytes can be done in cifs_read_super(), ditto for ->s_time_gran and as for putting MS_POSIXACL into ->s_flags, we can mirror it in ->mnt_cifs_flags until cifs_read_super() is called. Kill unused 'devname' argument, while we are at it... Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:41 -04:00
Al Viro	ca171baaad	cifs: don't leak nls on mount failure if cifs_sb allocation fails, we still need to drop nls we'd stashed into volume_info - the one we would've copied to cifs_sb if we could allocate the latter. Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:41 -04:00
Al Viro	6d6861757d	cifs: double free on mount failure if we get to out_super with ->s_root already set (e.g. with cifs_get_root() failure), we'll end up with cifs_put_super() called and ->mountdata freed twice. We'll also get cifs_sb freed twice and cifs_sb->local_nls dropped twice. The problem is, we can get to out_super both with and without ->s_root, which makes ->put_super() a bad place for such work. Switch to ->kill_sb(), have all that work done there after kill_anon_super(). Unlike ->put_super(), ->kill_sb() is called by deactivate_locked_super() whether we have ->s_root or not. Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:41 -04:00
Al Viro	dd85446619	take bdi setup/destruction into cifs_mount/cifs_umount Acked-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-24 18:39:41 -04:00
Jeff Layton	9b8e072a31	cifs: mark CONFIG_CIFS_NFSD_EXPORT as BROKEN This does not work properly with CIFS as current servers do not enable support for the FILE_OPEN_BY_FILE_ID on SMB NTCreateX and not all NFS clients handle ESTALE. For now, it just plain doesn't work. Mark it BROKEN to discourage distros from enabling it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-24 17:33:30 +00:00
Chris Mason	1973f0faeb	Btrfs: make sure to record the transid in new inodes When we create a new inode, we aren't filling in the field that records the transaction that last changed this inode. If we then go to fsync that inode, it will be skipped because the field isn't filled in. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-24 13:13:29 -04:00
Jeff Layton	e4fb0edb7c	cifs: free blkcipher in smbhash This is currently leaked in the rc == 0 case. Reported-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-24 17:03:55 +00:00
Linus Torvalds	5220cc9382	Merge branch 'for-linus' of git://git.kernel.dk/linux-block * 'for-linus' of git://git.kernel.dk/linux-block: block: add REQ_SECURE to REQ_COMMON_MASK block: use the passed in @bdev when claiming if partno is zero block: Add __attribute__((format(printf...) and fix fallout block: make disk_block_events() properly wait for work cancellation block: remove non-syncing __disk_block_events() and fold it into disk_block_events() block: don't use non-syncing event blocking in disk_check_events() cfq-iosched: fix locking around ioc->ioc_data assignment	2011-06-24 08:42:35 -07:00
Linus Torvalds	143e859d05	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix wsize negotiation to respect max buffer size and active signing (try #4) CIFS: Fix problem with 3.0-rc1 null user mount failure	2011-06-24 08:35:04 -07:00
Jesper Juhl	46e4edbf7e	Remove unneeded version.h includes from fs/ It was pointed out by 'make versioncheck' that some includes of linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and fs/omfs/file.c). This patch removes them. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-24 08:34:22 -07:00
Dave Chinner	4a33821236	xfs: prevent bogus assert when trying to remove non-existent attribute If the attribute fork on an inode is in btree format and has multiple levels (i.e node format rather than leaf format), then a lookup failure will trigger an assert failure in xfs_da_path_shift if the flag XFS_DA_OP_OKNOENT is not set. This flag is used to indicate to the directory btree code that not finding an entry is not a fatal error. In the case of doing a lookup for a directory name removal, this is valid as a user cannot insert an arbitrary name to remove from the directory btree. However, in the case of the attribute tree, a user has direct control over the attribute name and can ask for any random name to be removed without any validation. In this case, fsstress is asking for a non-existent user.selinux attribute to be removed, and that is causing xfs_da_path_shift() to fall off the bottom of the tree where it asserts that a lookup failure is allowed. Because the flag is not set, we die a horrible death on a debug enable kernel. Prevent this assert from firing on attribute removes by adding the op_flag XFS_DA_OP_OKNOENT to atribute removal operations. Discovered when testing on a SELinux enabled system by fsstress in test 070 by trying to remove a non-existent user.selinux attribute. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-23 22:13:51 -05:00
Dave Chinner	df4368a146	xfs: clear XFS_IDIRTY_RELEASE on truncate down When an inode is truncated down, speculative preallocation is removed from the inode. This should also reset the state bits for controlling whether preallocation is subsequently removed when the file is next closed. The flag is not being cleared, so repeated operations on a file that first involve a truncate (e.g. multiple repeated dd invocations on a file) give different file layouts for the second and subsequent invocations. Fix this by clearing the XFS_IDIRTY_RELEASE state bit when the XFS_ITRUNCATED bit is detected in xfs_release() and hence ensure that speculative delalloc is removed on files that have been truncated down. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-23 22:13:46 -05:00
Dave Chinner	778e24bb6d	xfs: reset inode per-lifetime state when recycling it XFS inodes has several per-lifetime state fields that determine the behaviour of the inode. These state fields are not all reset when an inode is reused from the reclaimable state. This can lead to unexpected behaviour of the new inode such as speculative preallocation not being truncated away in the expected manner for local files until the inode is subsequently truncated, freed or cycles out of the cache. It can also lead to an inode being considered to be a filestream inode or having been truncated when that is not the case. Rework the reinitialisation of the inode when it is recycled to ensure that it is pristine before it is reused. While there, also fix the resetting of state flags in the recycling error paths so the inode does not become unreclaimable. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-23 22:13:31 -05:00
Jeff Layton	1190f6a067	cifs: fix wsize negotiation to respect max buffer size and active signing (try #4 ) Hopefully last version. Base signing check on CAP_UNIX instead of tcon->unix_ext, also clean up the comments a bit more. According to Hongwei Sun's blog posting here: http://blogs.msdn.com/b/openspecification/archive/2009/04/10/smb-maximum-transmit-buffer-size-and-performance-tuning.aspx CAP_LARGE_WRITEX is ignored when signing is active. Also, the maximum size for a write without CAP_LARGE_WRITEX should be the maxBuf that the server sent in the NEGOTIATE request. Fix the wsize negotiation to take this into account. While we're at it, alter the other wsize definitions to use sizeof(WRITE_REQ) to allow for slightly larger amounts of data to potentially be written per request. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-23 17:54:39 +00:00
Linus Torvalds	bccaeafd7c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs: agstart field must be 64 bits JFS: Don't save agno in the inode jfs: Update agstart when resizing volume jfs: old_agsize should be 64 bits in jfs_extendfs	2011-06-22 21:49:07 -07:00
Pavel Shilovsky	446b23a758	CIFS: Fix problem with 3.0-rc1 null user mount failure Figured it out: it was broken by `b946845a9d` commit - "cifs: cifs_parse_mount_options: do not tokenize mount options in-place". So, as a quick fix I suggest to apply this patch. [PATCH] CIFS: Fix kfree() with constant string in a null user case Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-22 21:43:56 +00:00
Tejun Heo	06d984737b	ptrace: s/tracehook_tracer_task()/ptrace_parent()/ tracehook.h is on the way out. Rename tracehook_tracer_task() to ptrace_parent() and move it from tracehook.h to ptrace.h. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: John Johansen <john.johansen@canonical.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-22 19:26:29 +02:00
Tejun Heo	4b9d33e6d8	ptrace: kill clone/exec tracehooks At this point, tracehooks aren't useful to mainline kernel and mostly just add an extra layer of obfuscation. Although they have comments, without actual in-kernel users, it is difficult to tell what are their assumptions and they're actually trying to achieve. To mainline kernel, they just aren't worth keeping around. This patch kills the following clone and exec related tracehooks. tracehook_prepare_clone() tracehook_finish_clone() tracehook_report_clone() tracehook_report_clone_complete() tracehook_unsafe_exec() The changes are mostly trivial - logic is moved to the caller and comments are merged and adjusted appropriately. The only exception is in check_unsafe_exec() where LSM_UNSAFE_PTRACE* are OR'd to bprm->unsafe instead of setting it, which produces the same result as the field is always zero on entry. It also tests p->ptrace instead of (p->ptrace & PT_PTRACED) for consistency, which also gives the same result. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-22 19:26:29 +02:00
Tejun Heo	a288eecce5	ptrace: kill trivial tracehooks At this point, tracehooks aren't useful to mainline kernel and mostly just add an extra layer of obfuscation. Although they have comments, without actual in-kernel users, it is difficult to tell what are their assumptions and they're actually trying to achieve. To mainline kernel, they just aren't worth keeping around. This patch kills the following trivial tracehooks. * Ones testing whether task is ptraced. Replace with ->ptrace test. tracehook_expect_breakpoints() tracehook_consider_ignored_signal() tracehook_consider_fatal_signal() * ptrace_event() wrappers. Call directly. tracehook_report_exec() tracehook_report_exit() tracehook_report_vfork_done() * ptrace_release_task() wrapper. Call directly. tracehook_finish_release_task() * noop tracehook_prepare_release_task() tracehook_report_death() This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-22 19:26:28 +02:00
Linus Torvalds	2992c4bd57	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Fix decode_secinfo_maxsz NFSv4.1: Fix an off-by-one error in pnfs_generic_pg_test NFSv4.1: Fix some issues with pnfs_generic_pg_test NFSv4.1: file layout must consider pg_bsize for coalescing pnfs-obj: No longer needed to take an extra ref at add_device SUNRPC: Ensure the RPC client only quits on fatal signals NFSv4: Fix a readdir regression nfs4.1: mark layout as bad on error path in _pnfs_return_layout nfs4.1: prevent race that allowed use of freed layout in _pnfs_return_layout NFSv4.1: need to put_layout_hdr on _pnfs_return_layout error path NFS: (d)printks should use %zd for ssize_t arguments NFSv4.1: fix break condition in pnfs_find_lseg nfs4.1: fix several problems with _pnfs_return_layout NFSv4.1: allow zero fh array in filelayout decode layout NFSv4.1: allow nfs_fhget to succeed with mounted on fileid NFSv4.1: Fix a refcounting issue in the pNFS device id cache NFSv4.1: deprecate headerpadsz in CREATE_SESSION NFS41: do not update isize if inode needs layoutcommit NLM: Don't hang forever on NLM unlock requests NFS: fix umount of pnfs filesystems	2011-06-21 18:20:55 -07:00
Linus Torvalds	890879cfa0	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: Fix oops in jbd2_journal_remove_journal_head() jbd2: Remove obsolete parameters in the comments for some jbd2 functions ext4: fixed tracepoints cleanup ext4: use FIEMAP_EXTENT_LAST flag for last extent in fiemap ext4: Fix max file size and logical block counting of extent format file ext4: correct comments for ext4_free_blocks()	2011-06-21 10:22:35 -07:00
Bryan Schumaker	1650add235	NFS: Fix decode_secinfo_maxsz I initially did the calculation in bytes, and not words Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-21 11:54:07 -04:00
Trond Myklebust	19982ba856	NFSv4.1: Fix an off-by-one error in pnfs_generic_pg_test And document what is going on there... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-21 11:54:06 -04:00
Trond Myklebust	8f7d5efbef	NFSv4.1: Fix some issues with pnfs_generic_pg_test 1. If the intention is to coalesce requests 'prev' and 'req' then we have to ensure at least that we have a layout starting at req_offset(prev). 2. If we're only requesting a minimal layout of length desc->pg_count, we need to test the length actually returned by the server before we allow the coalescing to occur. 3. We need to deal correctly with (pgio->lseg == NULL) 4. Fixup the test guarding the pnfs_update_layout. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-21 11:54:05 -04:00
Linus Torvalds	eda0841094	Merge branch 'for-2.6.40' of git://linux-nfs.org/~bfields/linux * 'for-2.6.40' of git://linux-nfs.org/~bfields/linux: nfsd4: fix break_lease flags on nfsd open nfsd: link returns nfserr_delay when breaking lease nfsd: v4 support requires CRYPTO nfsd: fix dependency of nfsd on auth_rpcgss	2011-06-20 20:10:52 -07:00
Linus Torvalds	3669820650	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: devcgroup_inode_permission: take "is it a device node" checks to inlined wrapper fix comment in generic_permission() kill obsolete comment for follow_down() proc_sys_permission() is OK in RCU mode reiserfs_permission() doesn't need to bail out in RCU mode proc_fd_permission() is doesn't need to bail out in RCU mode nilfs2_permission() doesn't need to bail out in RCU mode logfs doesn't need ->permission() at all coda_ioctl_permission() is safe in RCU mode cifs_permission() doesn't need to bail out in RCU mode bad_inode_permission() is safe from RCU mode ubifs: dereferencing an ERR_PTR in ubifs_mount()	2011-06-20 20:09:15 -07:00
Dave Kleikamp	ecc90462b4	jfs: agstart field must be 64 bits The previous patch added the agstart field to jfs_ip, but declared it a long. We need to make sure its 64 bits on every platform. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2011-06-20 17:53:24 -05:00
Benny Halevy	19345cb299	NFSv4.1: file layout must consider pg_bsize for coalescing Otherwise we end up overflowing the rpc buffer size on the receive end. Signed-off-by: Benny Halevy <benny@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-20 16:12:26 -04:00
Linus Torvalds	90a800de0a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: avoid delayed metadata items during commits btrfs: fix uninitialized return value btrfs: fix wrong reservation when doing delayed inode operations btrfs: Remove unused sysfs code btrfs: fix dereference of ERR_PTR value Btrfs: fix relocation races Btrfs: set no_trans_join after trying to expand the transaction Btrfs: protect the pending_snapshots list with trans_lock Btrfs: fix path leakage on subvol deletion Btrfs: drop the delalloc_bytes check in shrink_delalloc Btrfs: check the return value from set_anon_super	2011-06-20 08:58:53 -07:00
Dave Kleikamp	d31b53e3cd	JFS: Don't save agno in the inode Resizing the file system can result in an in-memory inode being remapped to a different aggregate group (AG). A cached AG number can cause problems when trying to free or allocate inodes. Instead, save the IAG's agstart address and calculate the agno when we need it. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2011-06-20 10:53:46 -05:00
Dave Kleikamp	28e0fa894c	jfs: Update agstart when resizing volume A comment indicates that the IAG's agstart does not need to be updated since it will always point to a block in the same aggregate group, but jfs_fsck isn't so forgiving and reports it as an error. I'm fixing this in jfsutils as well, so either a new kernel or new utilities will be sufficient to fix the problem. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2011-06-20 10:32:46 -05:00
Dave Kleikamp	206b6310fd	jfs: old_agsize should be 64 bits in jfs_extendfs Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2011-06-20 10:30:04 -05:00
Al Viro	8e833fd2e1	fix comment in generic_permission() CAP_DAC_OVERRIDE is enough for MAY_EXEC on directory, even if no exec bits are set. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:45:56 -04:00
Al Viro	6291176bcd	kill obsolete comment for follow_down() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:45:49 -04:00
Al Viro	1aec7036d0	proc_sys_permission() is OK in RCU mode nothing blocking there, since all instances of sysctl ->permissions() method are non-blocking - both of them, that is. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:45:25 -04:00
Al Viro	1d29b5a2ed	reiserfs_permission() doesn't need to bail out in RCU mode nothing blocking other than generic_permission() (and check_acl callback does bail out in RCU mode). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:45:21 -04:00
Al Viro	cf12791116	proc_fd_permission() is doesn't need to bail out in RCU mode nothing blocking except generic_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:44:50 -04:00
Al Viro	730e908f35	nilfs2_permission() doesn't need to bail out in RCU mode Nothing blocking except for generic_permission(). Which will DTRT. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:44:33 -04:00
Al Viro	a63ab94d67	logfs doesn't need ->permission() at all ... and never did, what with its ->permission() being what we do by default when ->permission is NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:44:26 -04:00
Al Viro	6b419951f1	coda_ioctl_permission() is safe in RCU mode return (mask & MAY_EXEC) ? -EACCES : 0; is non-blocking... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:44:19 -04:00
Al Viro	ec12781f19	cifs_permission() doesn't need to bail out in RCU mode nothing potentially blocking except generic_permission(), which will DTRT Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:44:07 -04:00
Al Viro	1712c20dae	bad_inode_permission() is safe from RCU mode return -EIO; is not a blocking operation, thank you very much. Nick, what the hell have you been smoking? Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:44:00 -04:00
Dan Carpenter	185bf87393	ubifs: dereferencing an ERR_PTR in ubifs_mount() `d251ed271d` "ubifs: fix sget races" left out the goto from this error path so the static checkers complain that we're dereferencing "sb" when it's an ERR_PTR. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:42:34 -04:00
J. Bruce Fields	105f462210	nfsd4: fix break_lease flags on nfsd open Thanks to Casey Bodley for pointing out that on a read open we pass 0, instead of O_RDONLY, to break_lease, with the result that a read open is treated like a write open for the purposes of lease breaking! Reported-by: Casey Bodley <cbodley@citi.umich.edu> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-06-20 10:38:01 -04:00
Vitaliy Ivanov	e44ba033c5	treewide: remove duplicate includes Many stupid corrections of duplicated includes based on the output of scripts/checkincludes.pl. Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-06-20 16:08:19 +02:00
Boaz Harrosh	df18d127f4	pnfs-obj: No longer needed to take an extra ref at add_device Andy's last device_cache patches, already take an extra reference on the newly inserted device_id. So we can remove it from obj-io. Without this patch the device_ids are leaked. Andy's patches are not in Linus tree yet. So I'm not sure if they are scheduled for this Kernel or the next. This patch should be added as part of these. CC: Andy Adamson <andros@netapp.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-19 14:49:51 -04:00
Linus Torvalds	8816ead9d8	Merge branches 'perf-urgent-for-linus', 'sched-urgent-for-linus', 'timers-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tools/perf: Fix static build of perf tool tracing: Fix regression in printk_formats file * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Make watchdog robust vs. interruption timerfd: Fix wakeup of processes when timer is cancelled on clock change * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, MAINTAINERS: Add x86 MCE people x86, efi: Do not reserve boot services regions within reserved areas	2011-06-19 09:00:18 -07:00
Linus Torvalds	c11760c6d8	isofs: fix bh leak in isofs_fill_super() error case In isofs_fill_super(), when an iso_primary_descriptor is found, it is kept in pri_bh. The error cases don't properly release it. Fix it. Reported-and-tested-by: 김원석 <stanley.will.kim@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-18 07:25:42 -07:00
Chris Mason	e999376f09	Btrfs: avoid delayed metadata items during commits Snapshot creation has two phases. One is the initial snapshot setup, and the second is done during commit, while nobody is allowed to modify the root we are snapshotting. The delayed metadata insertion code can break that rule, it does a delayed inode update on the inode of the parent of the snapshot, and delayed directory item insertion. This makes sure to run the pending delayed operations before we record the snapshot root, which avoids corruptions. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 16:38:47 -04:00
David Sterba	35a30d7ce5	btrfs: fix uninitialized return value When allocation fails in btrfs_read_fs_root_no_name, ret is not set although it is returned, holding a garbage value. Signed-off-by: David Sterba <dsterba@suse.cz> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 14:54:18 -04:00
Miao Xie	19fd294957	btrfs: fix wrong reservation when doing delayed inode operations We have migrated the space for the delayed inode items from trans_block_rsv to global_block_rsv, but we forgot to set trans->block_rsv to global_block_rsv when we doing delayed inode operations, and the following Oops happened: [ 9792.654889] ------------[ cut here ]------------ [ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681 btrfs_alloc_free_block+0xca/0x27c [btrfs]() [ 9792.654899] Hardware name: To Be Filled By O.E.M. [ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211 snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] [ 9792.654919] Pid: 2762, comm: rm Tainted: G W 2.6.39+ #1 [ 9792.654920] Call Trace: [ 9792.654922] [<ffffffff81053c4a>] warn_slowpath_common+0x83/0x9b [ 9792.654925] [<ffffffff81053c7c>] warn_slowpath_null+0x1a/0x1c [ 9792.654933] [<ffffffffa038e747>] btrfs_alloc_free_block+0xca/0x27c [btrfs] [ 9792.654945] [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs] [ 9792.654953] [<ffffffffa038189b>] __btrfs_cow_block+0xfc/0x30c [btrfs] [ 9792.654963] [<ffffffffa0396aa6>] ? btrfs_buffer_uptodate+0x47/0x58 [btrfs] [ 9792.654970] [<ffffffffa0382e48>] ? read_block_for_search+0x94/0x368 [btrfs] [ 9792.654978] [<ffffffffa0381ba9>] btrfs_cow_block+0xfe/0x146 [btrfs] [ 9792.654986] [<ffffffffa03848b0>] btrfs_search_slot+0x14d/0x4b6 [btrfs] [ 9792.654997] [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs] [ 9792.655022] [<ffffffffa03938e8>] btrfs_lookup_inode+0x2f/0x8f [btrfs] [ 9792.655025] [<ffffffff8147afac>] ? _cond_resched+0xe/0x22 [ 9792.655027] [<ffffffff8147b892>] ? mutex_lock+0x29/0x50 [ 9792.655039] [<ffffffffa03d41b1>] btrfs_update_delayed_inode+0x72/0x137 [btrfs] [ 9792.655051] [<ffffffffa03d4ea2>] btrfs_run_delayed_items+0x90/0xdb [btrfs] [ 9792.655062] [<ffffffffa039a69b>] btrfs_commit_transaction+0x228/0x654 [btrfs] [ 9792.655064] [<ffffffff8106e8da>] ? remove_wait_queue+0x3a/0x3a [ 9792.655075] [<ffffffffa03a2fa5>] btrfs_evict_inode+0x14d/0x202 [btrfs] [ 9792.655077] [<ffffffff81132bd6>] evict+0x71/0x111 [ 9792.655079] [<ffffffff81132de0>] iput+0x12a/0x132 [ 9792.655081] [<ffffffff8112aa3a>] do_unlinkat+0x106/0x155 [ 9792.655083] [<ffffffff81127b83>] ? path_put+0x1f/0x23 [ 9792.655085] [<ffffffff8109c53c>] ? audit_syscall_entry+0x145/0x171 [ 9792.655087] [<ffffffff81128410>] ? putname+0x34/0x36 [ 9792.655090] [<ffffffff8112b441>] sys_unlinkat+0x29/0x2b [ 9792.655092] [<ffffffff81482c42>] system_call_fastpath+0x16/0x1b [ 9792.655093] ---[ end trace 02b696eb02b3f768 ]--- This patch fix it by setting the reservation of the transaction handle to the correct one. Reported-by: Josef Bacik <josef@redhat.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 14:54:18 -04:00
Maarten Lankhorst	9fe6a50fb7	btrfs: Remove unused sysfs code Removes code no longer used. The sysfs file itself is kept, because the btrfs developers expressed interest in putting new entries to sysfs. Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 14:54:18 -04:00
David Sterba	3ed4498caf	btrfs: fix dereference of ERR_PTR value smatch reports: btrfs_recover_log_trees error: 'wc.replay_dest' dereferencing possible ERR_PTR() Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 14:54:17 -04:00
Chris Mason	e038dca803	Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus Conflicts: fs/btrfs/transaction.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 14:16:13 -04:00
Linus Torvalds	01eff85b09	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: make log devices with write back caches work xfs: fix ->mknod() return value on xfs_get_acl() failure	2011-06-17 10:37:41 -07:00
Chris Mason	7585717f30	Btrfs: fix relocation races The recent commit to get rid of our trans_mutex introduced some races with block group relocation. The problem is that relocation needs to do some record keeping about each root, and it was relying on the transaction mutex to coordinate things in subtle ways. This fix adds a mutex just for the relocation code and makes sure it doesn't have a big impact on normal operations. The race is really fixed in btrfs_record_root_in_trans, which is where we step back and wait for the relocation code to finish accounting setup. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 13:36:58 -04:00
David Howells	879669961b	KEYS/DNS: Fix ____call_usermodehelper() to not lose the session keyring ____call_usermodehelper() now erases any credentials set by the subprocess_inf::init() function. The problem is that commit `17f60a7da1` ("capabilites: allow the application of capability limits to usermode helpers") creates and commits new credentials with prepare_kernel_cred() after the call to the init() function. This wipes all keyrings after umh_keys_init() is called. The best way to deal with this is to put the init() call just prior to the commit_creds() call, and pass the cred pointer to init(). That means that umh_keys_init() and suchlike can modify the credentials _before_ they are published and potentially in use by the rest of the system. This prevents request_key() from working as it is prevented from passing the session keyring it set up with the authorisation token to /sbin/request-key, and so the latter can't assume the authority to instantiate the key. This causes the in-kernel DNS resolver to fail with ENOKEY unconditionally. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-17 09:40:48 -07:00
Linus Torvalds	8b97b21e0f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd: proc: Fix Oops on stat of /proc/<zombie pid>/ns/net	2011-06-16 15:02:20 -07:00
Trond Myklebust	ee7b75fc4f	NFSv4: Fix a readdir regression Commit `7ebb9315` (NFS: use secinfo when crossing mountpoints) introduces a regression when decoding an NFSv4 readdir entry that sets the rdattr_error field. By treating the resulting value as if it is a decoding error, the current code may cause us to skip valid readdir entries. Reported-by: Andy Adamson <andros@netapp.com> Cc: stable@kernel.org [2.6.39] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-16 13:24:47 -04:00
Linus Torvalds	8dac6bee32	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: AFS: Use i_generation not i_version for the vnode uniquifier AFS: Set s_id in the superblock to the volume name vfs: Fix data corruption after failed write in __block_write_begin() afs: afs_fill_page reads too much, or wrong data VFS: Fix vfsmount overput on simultaneous automount fix wrong iput on d_inode introduced by `e6bc45d65d` Delay struct net freeing while there's a sysfs instance refering to it afs: fix sget() races, close leak on umount ubifs: fix sget races ubifs: split allocation of ubifs_info into a separate function fix leak in proc_set_super()	2011-06-16 10:21:59 -07:00
Christoph Hellwig	a27a263bae	xfs: make log devices with write back caches work There's no reason not to support cache flushing on external log devices. The only thing this really requires is flushing the data device first both in fsync and log commits. A side effect is that we also have to remove the barrier write test during mount, which has been superflous since the new FLUSH+FUA code anyway. Also use the chance to flush the RT subvolume write cache before the fsync commit, which is required for correct semantics. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-16 10:52:39 -05:00
David Howells	d6e43f751f	AFS: Use i_generation not i_version for the vnode uniquifier Store the AFS vnode uniquifier in the i_generation field, not the i_version field of the inode struct. i_version can then be given the AFS data version number. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-16 11:44:48 -04:00
David Howells	2e41ae225f	AFS: Set s_id in the superblock to the volume name Set s_id in the superblock to the name of the AFS volume that this superblock corresponds to. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-16 11:44:47 -04:00
Jan Kara	f9f07b6c13	vfs: Fix data corruption after failed write in __block_write_begin() I've got a report of a file corruption from fsxlinux on ext3. The important operations to the page were: mapwrite to a hole partial write to the page read - found the page zeroed from the end of the normal write The culprit seems to be that if get_block() fails in __block_write_begin() (e.g. transient ENOSPC in ext3), the function does ClearPageUptodate(page). Thus when we retry the write, the logic in __block_write_begin() thinks zeroing of the page is needed and overwrites old data. In fact, I don't see why we should ever need to zero the uptodate bit here - either the page was uptodate when we entered __block_write_begin() and it should stay so when we leave it, or it was not uptodate and noone had right to set it uptodate during __block_write_begin() so it remains !uptodate when we leave as well. So just remove clearing of the bit. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-16 11:44:46 -04:00
Anton Blanchard	5e7f23373b	afs: afs_fill_page reads too much, or wrong data afs_fill_page should read the page that is about to be written but the current implementation has a number of issues. If we aren't extending the file we always read PAGE_CACHE_SIZE at offset 0. If we are extending the file we try to read the entire file. Change afs_fill_page to read PAGE_CACHE_SIZE at the right offset, clamped to i_size. While here, avoid calling afs_fill_page when we are doing a PAGE_CACHE_SIZE write. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-16 11:44:46 -04:00
Al Viro	8aef188452	VFS: Fix vfsmount overput on simultaneous automount [Kudos to dhowells for tracking that crap down] If two processes attempt to cause automounting on the same mountpoint at the same time, the vfsmount holding the mountpoint will be left with one too few references on it, causing a BUG when the kernel tries to clean up. The problem is that lock_mount() drops the caller's reference to the mountpoint's vfsmount in the case where it finds something already mounted on the mountpoint as it transits to the mounted filesystem and replaces path->mnt with the new mountpoint vfsmount. During a pathwalk, however, we don't take a reference on the vfsmount if it is the same as the one in the nameidata struct, but do_add_mount() doesn't know this. The fix is to make sure we have a ref on the vfsmount of the mountpoint before calling do_add_mount(). However, if lock_mount() doesn't transit, we're then left with an extra ref on the mountpoint vfsmount which needs releasing. We can handle that in follow_managed() by not making assumptions about what we can and what we cannot get from lookup_mnt() as the current code does. The callers of follow_managed() expect that reference to path->mnt will be grabbed iff path->mnt has been changed. follow_managed() and follow_automount() keep track of whether such reference has been grabbed and assume that it'll happen in those and only those cases that'll have us return with changed path->mnt. That assumption is almost correct - it breaks in case of racing automounts and in even harder to hit race between following a mountpoint and a couple of mount --move. The thing is, we don't need to make that assumption at all - after the end of loop in follow_manage() we can check if path->mnt has ended up unchanged and do mntput() if needed. The BUG can be reproduced with the following test program: #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <sys/wait.h> int main(int argc, char **argv) { int pid, ws; struct stat buf; pid = fork(); stat(argv[1], &buf); if (pid > 0) wait(&ws); return 0; } and the following procedure: (1) Mount an NFS volume that on the server has something else mounted on a subdirectory. For instance, I can mount / from my server: mount warthog:/ /mnt -t nfs4 -r On the server /data has another filesystem mounted on it, so NFS will see a change in FSID as it walks down the path, and will mark /mnt/data as being a mountpoint. This will cause the automount code to be triggered. !!! Do not look inside the mounted fs at this point !!! (2) Run the above program on a file within the submount to generate two simultaneous automount requests: /tmp/forkstat /mnt/data/testfile (3) Unmount the automounted submount: umount /mnt/data (4) Unmount the original mount: umount /mnt At this point the kernel should throw a BUG with something like the following: BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12] Note that the bug appears on the root dentry of the original mount, not the mountpoint and not the submount because sys_umount() hasn't got to its final mntput_no_expire() yet, but this isn't so obvious from the call trace: [<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82 [<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b [<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs] [<ffffffff811617f3>] kill_anon_super+0x1d/0x7e [<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs] [<ffffffff81161c17>] deactivate_locked_super+0x34/0x83 [<ffffffff811629ff>] deactivate_super+0x6f/0x7b [<ffffffff81186261>] mntput_no_expire+0x18d/0x199 [<ffffffff811862a8>] mntput+0x3b/0x44 [<ffffffff81186d87>] release_mounts+0xa2/0xbf [<ffffffff811876af>] sys_umount+0x47a/0x4ba [<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f [<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b as do_umount() is inlined. However, you can see release_mounts() in there. Note also that it may be necessary to have multiple CPU cores to be able to trigger this bug. Tested-by: Jeff Layton <jlayton@redhat.com> Tested-by: Ian Kent <raven@themaw.net> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-16 11:28:16 -04:00
Török Edwin	50338b889d	fix wrong iput on d_inode introduced by `e6bc45d65d` Git bisection shows that commit `e6bc45d65d` causes BUG_ONs under high I/O load: kernel BUG at fs/inode.c:1368! [ 2862.501007] Call Trace: [ 2862.501007] [<ffffffff811691d8>] d_kill+0xf8/0x140 [ 2862.501007] [<ffffffff81169c19>] dput+0xc9/0x190 [ 2862.501007] [<ffffffff8115577f>] fput+0x15f/0x210 [ 2862.501007] [<ffffffff81152171>] filp_close+0x61/0x90 [ 2862.501007] [<ffffffff81152251>] sys_close+0xb1/0x110 [ 2862.501007] [<ffffffff814c14fb>] system_call_fastpath+0x16/0x1b A reliable way to reproduce this bug is: Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk, and apt-get remove openjdk-6-jdk. The buggy part of the patch is this: struct inode inode = NULL; ..... - if (nd.last.name[nd.last.len]) - goto slashes; inode = dentry->d_inode; - if (inode) - ihold(inode); + if (nd.last.name[nd.last.len] \|\| !inode) + goto slashes; + ihold(inode) ... if (inode) iput(inode); / truncate the inode here */ If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken), and dentry->d_inode is non-NULL, then this code now does an additional iput on the inode, which is wrong. Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0. Reference: https://lkml.org/lkml/2011/6/15/50 Reported-by: Norbert Preining <preining@logic.at> Reported-by: Török Edwin <edwintorok@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Török Edwin <edwintorok@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-16 11:27:39 -04:00
Linus Torvalds	13fca640bb	Revert "fs/exec.c: use BUILD_BUG_ON for VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP" This reverts commit `7f81c8890c`. It turns out that it's not actually a build-time check on x86-64 UML, which does some seriously crazy stuff with VM_STACK_FLAGS. The VM_STACK_FLAGS define depends on the arch-supplied VM_STACK_DEFAULT_FLAGS value, and on x86-64 UML we have arch/um/sys-x86_64/shared/sysdep/vm-flags.h: #define VM_STACK_DEFAULT_FLAGS \ (test_thread_flag(TIF_IA32) ? vm_stack_flags32 : vm_stack_flags) #define VM_STACK_DEFAULT_FLAGS vm_stack_flags (yes, seriously: two different #define's for that thing, with the first one being inside an "#ifdef TIF_IA32") It's possible that it is UML that should just be fixed in this area, but for now let's just undo the (very small) optimization. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-15 21:53:52 -07:00
Michal Hocko	7f81c8890c	fs/exec.c: use BUILD_BUG_ON for VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP Commit `a8bef8ff6e` ("mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS and VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time one, so BUILD_BUG_ON is more appropriate. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-15 20:03:59 -07:00
Eric W. Biederman	793925334f	proc: Fix Oops on stat of /proc/<zombie pid>/ns/net Don't call iput with the inode half setup to be a namespace filedescriptor. Instead rearrange the code so that we don't initialize ei->ns_ops until after I ns_ops->get succeeds, preventing us from invoking ns_ops->put when ns_ops->get failed. Reported-by: Ingo Saitz <Ingo.Saitz@stud.uni-hannover.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2011-06-15 14:35:29 -07:00
Fred Isaman	9e2dfdb308	nfs4.1: mark layout as bad on error path in _pnfs_return_layout Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 14:36:33 -04:00
Josef Bacik	ed0ca14021	Btrfs: set no_trans_join after trying to expand the transaction We can lockup if we try to allow new writers join the transaction and we have flushoncommit set or have a pending snapshot. This is because we set no_trans_join and then loop around and try to wait for ordered extents again. The problem is the ordered endio stuff needs to join the transaction, which it can't do because no_trans_join is set. So instead wait until after this loop to set no_trans_join and then make sure to wait for num_writers == 1 in case anybody got started in between us exiting the loop and setting no_trans_join. This could easily be reproduced by mounting -o flushoncommit and running xfstest 13. It cannot be reproduced with this patch. Thanks, Reported-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-15 13:24:47 -04:00
Josef Bacik	8351583e3f	Btrfs: protect the pending_snapshots list with trans_lock Currently there is nothing protecting the pending_snapshots list on the transaction. We only hold the directory mutex that we are snapshotting and a read lock on the subvol_sem, so we could race with somebody else creating a snapshot in a different directory and end up with list corruption. So protect this list with the trans_lock. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-15 13:24:46 -04:00
Josef Bacik	71d7aed014	Btrfs: fix path leakage on subvol deletion The delayed ref patch accidently removed the btrfs_free_path in btrfs_unlink_subvol, this puts it back and means we don't leak a path. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-15 13:24:45 -04:00
Fred Isaman	ea0ded748b	nfs4.1: prevent race that allowed use of freed layout in _pnfs_return_layout mark_matching_lsegs_invalid could put the last ref to the layout, so the get_layout_hdr needs to be called first. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 12:39:23 -04:00
Benny Halevy	1ed3a8539a	NFSv4.1: need to put_layout_hdr on _pnfs_return_layout error path We always get a reference on the layout header and we rely on nfs4_layoutreturn_release to put it. If we hit an allocation error before starting the rpc proc we bail out early without dereferncing the layout header properly. Signed-off-by: Benny Halevy <benny@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:52:15 -04:00
David Howells	c7fd06228b	NFS: (d)printks should use %zd for ssize_t arguments (d)printks should use %zd for ssize_t arguments not %ld, otherwise they might get a warning. I see the following with MN10300. fs/nfs/objlayout/objlayout.c: In function 'objlayout_read_done': fs/nfs/objlayout/objlayout.c:294: warning: format '%ld' expects type 'long int', but argument 3 has type 'ssize_t' Signed-off-by: David Howells <dhowells@redhat.com> cc: Trond Myklebust <Trond.Myklebust@netapp.com> cc: linux-nfs@vger.kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:24:31 -04:00
Benny Halevy	d771e3a43e	NFSv4.1: fix break condition in pnfs_find_lseg The break condition to skip out of the loop got broken when cmp_layout was change. Essentially, we want to stop looking once we know no layout on the remainder of the list can match the first byte of the looked-up range. Reported-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <benny@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:24:31 -04:00
Fred Isaman	a2e1d4f2e5	nfs4.1: fix several problems with _pnfs_return_layout _pnfs_return_layout had the following problems: - it did not call pnfs_free_lseg_list on all paths - it unintentionally did a forgetful return when there was no outstanding io - it raced with concurrent LAYOUTGETS Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:24:30 -04:00
Andy Adamson	cec765cf58	NFSv4.1: allow zero fh array in filelayout decode layout Signed-off-by: Andy Adamson <andros@netapp.com> cc:stable@kernel.org [2.6.39] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:24:30 -04:00
Andy Adamson	533eb4611c	NFSv4.1: allow nfs_fhget to succeed with mounted on fileid Commit `28331a46d8` "Ensure we request the ordinary fileid when doing readdirplus" changed the meaning of NFS_ATTR_FATTR_FILEID which used to be set when FATTR4_WORD1_MOUNTED_ON_FILED was requested. Allow nfs_fhget to succeed with only a mounted on fileid when crossing a mountpoint or a referral. Ask for the fileid of the absent file system if mounted_on_fileid is not supported. Signed-off-by: Andy Adamson <andros@netapp.com> cc:stable@kernel.org [2.6.39] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:24:29 -04:00
Trond Myklebust	1d92a08da2	NFSv4.1: Fix a refcounting issue in the pNFS device id cache When we add something to the global device id cache, we need to bump the reference count, so that the cache itself holds a reference. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:24:29 -04:00
Benny Halevy	c9c30dd5f7	NFSv4.1: deprecate headerpadsz in CREATE_SESSION We don't support header padding yet so better off ditching it Reported-by: Sid Moore <learnmost@gmail.com> Signed-off-by: Benny Halevy <benny@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:24:28 -04:00
Peng Tao	0f66b5984d	NFS41: do not update isize if inode needs layoutcommit nfs_update_inode will update isize if there is no queued pages. For pNFS, layoutcommit is supposed to change file size on server, the same effect as queued pages. nfs_update_inode may be called when dirty pages are written back (nfsi->npages==0) but layoutcommit is not sent, and it will change client file size according to server file size. Then client ends up losing what it just writes back in pNFS path. So we should skip updating client file size if file needs layoutcommit. Signed-off-by: Peng Tao <peng_tao@emc.com> Cc: stable@kernel.org [2.6.39] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:24:28 -04:00
Trond Myklebust	0b760113a3	NLM: Don't hang forever on NLM unlock requests If the NLM daemon is killed on the NFS server, we can currently end up hanging forever on an 'unlock' request, instead of aborting. Basically, if the rpcbind request fails, or the server keeps returning garbage, we really want to quit instead of retrying. Tested-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2011-06-15 11:24:27 -04:00
Weston Andros Adamson	9e3bd4e24e	NFS: fix umount of pnfs filesystems Unmounting a pnfs filesystem hangs using filelayout and possibly others. This fixes the use of the rcu protected node by making use of a new 'tmpnode' for the temporary purge list. Also, the spinlock shouldn't be held when calling synchronize_rcu(). Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-06-15 11:23:02 -04:00
Steve French	1252b3013b	[CIFS] update cifs version to 1.73 Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-14 16:19:54 +00:00
Al Viro	c46a131c0c	xfs: fix ->mknod() return value on xfs_get_acl() failure ->mknod() should return negative on errors and PTR_ERR() gives already negative value... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-06-14 11:02:13 -05:00
Steve French	040d15c867	[CIFS] trivial cleanup fscache cFYI and cERROR messages ... for uniformity and cleaner debug logs. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-14 15:51:18 +00:00
Max Asbock	1123d93963	timerfd: Fix wakeup of processes when timer is cancelled on clock change Currently processes waiting with poll on cancelable timerfd timers are not woken up when the timers are canceled. When the system time is set the clock_was_set() function calls timerfd_clock_was_set() to cancel and wake up processes waiting on potential cancelable timerfd timers. However the wake up currently has no effect because in the case of timerfd_read it is dependent on ctx->ticks not being 0. timerfd_poll also requires ctx->ticks being non zero. As a consequence processes waiting on cancelable timers only get woken up when the timers expire. This patch fixes this by incrementing ctx->ticks before calling wake_up. Signed-off-by: Max Asbock <masbock@linux.vnet.ibm.com> Cc: kay.sievers@vrfy.org Cc: virtuoso@slind.org Cc: johnstul <johnstul@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1307985512.4710.41.camel@w-amax.beaverton.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-06-14 11:46:14 +02:00
Sage Weil	d7f124f129	ceph: fix sync and dio writes across stripe boundaries We were iterating across stripe boundaries properly, but not moving the write buffer pointer forward. This caused us to rewrite the same data after the break. Fix by adjusting the data pointer forward, and recalculating the io and buffer alignment after the break. Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-13 16:26:22 -07:00
Sage Weil	773e9b4426	ceph: fix page alignment corrections dd if=/dev/urandom of=/mnt/fs_depot/dd10 bs=500 seek=8388 count=1 dd if=/mnt/fs_depot/dd10 of=/root/dd10out bs=500 skip=8388 count=1 Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-13 16:26:10 -07:00
Jeff Layton	8d1bca328b	cifs: correctly handle NULL tcon pointer in CIFSTCon Long ago (in commit `00e485b0`), I added some code to handle share-level passwords in CIFSTCon. That code ignored the fact that it's legit to pass in a NULL tcon pointer when connecting to the IPC$ share on the server. This wasn't really a problem until recently as we only called CIFSTCon this way when the server returned -EREMOTE. With the introduction of commit `c1508ca2` however, it gets called this way on every mount, causing an oops when share-level security is in effect. Fix this by simply treating a NULL tcon pointer as if user-level security were in effect. I'm not aware of any servers that protect the IPC$ share with a specific password anyway. Also, add a comment to the top of CIFSTCon to ensure that we don't make the same mistake again. Cc: <stable@kernel.org> Reported-by: Martijn Uffing <mp3project@sarijopen.student.utwente.nl> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-13 20:34:34 +00:00
Jeff Layton	3e71551364	cifs: show sec= option in /proc/mounts Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-13 20:34:34 +00:00
Jeff Layton	7fdbaa1b8d	cifs: don't allow cifs_reconnect to exit with NULL socket pointer It's possible for the following set of events to happen: cifsd calls cifs_reconnect which reconnects the socket. A userspace process then calls cifs_negotiate_protocol to handle the NEGOTIATE and gets a reply. But, while processing the reply, cifsd calls cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the reply from the earlier NEGOTIATE completes and the tcpStatus is set to CifsGood. cifs_reconnect then goes through and closes the socket and sets the pointer to zero, but because the status is now CifsGood, the new socket is not created and cifs_reconnect exits with the socket pointer set to NULL. Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is CifsNeedNegotiate, and by making sure that generic_ip_connect is always called at least once in cifs_reconnect. Note that this is not a perfect fix for this issue. It's still possible that the NEGOTIATE reply is handled after the socket has been closed and reconnected. In that case, the socket state will look correct but it no NEGOTIATE was performed on it be for the wrong socket. In that situation though the server should just shut down the socket on the next attempted send, rather than causing the oops that occurs today. Cc: <stable@kernel.org> # .38.x: `fd88ce9`: [CIFS] cifs: clarify the meaning of tcpStatus == CifsGood Reported-and-Tested-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-13 20:34:33 +00:00
Pavel Shilovsky	cd51875d53	CIFS: Fix sparse error cifs_sb_master_tlink was declared as inline, but without a definition. Remove the declaration and move the definition up. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-13 20:34:33 +00:00
Jan Kara	de1b794130	jbd2: Fix oops in jbd2_journal_remove_journal_head() jbd2_journal_remove_journal_head() can oops when trying to access journal_head returned by bh2jh(). This is caused for example by the following race: TASK1 TASK2 jbd2_journal_commit_transaction() ... processing t_forget list __jbd2_journal_refile_buffer(jh); if (!jh->b_transaction) { jbd_unlock_bh_state(bh); jbd2_journal_try_to_free_buffers() jbd2_journal_grab_journal_head(bh) jbd_lock_bh_state(bh) __journal_try_to_free_buffer() jbd2_journal_put_journal_head(jh) jbd2_journal_remove_journal_head(bh); jbd2_journal_put_journal_head() in TASK2 sees that b_jcount == 0 and buffer is not part of any transaction and thus frees journal_head before TASK1 gets to doing so. Note that even buffer_head can be released by try_to_free_buffers() after jbd2_journal_put_journal_head() which adds even larger opportunity for oops (but I didn't see this happen in reality). Fix the problem by making transactions hold their own journal_head reference (in b_jcount). That way we don't have to remove journal_head explicitely via jbd2_journal_remove_journal_head() and instead just remove journal_head when b_jcount drops to zero. The result of this is that [__]jbd2_journal_refile_buffer(), [__]jbd2_journal_unfile_buffer(), and __jdb2_journal_remove_checkpoint() can free journal_head which needs modification of a few callers. Also we have to be careful because once journal_head is removed, buffer_head might be freed as well. So we have to get our own buffer_head reference where it matters. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-13 15:38:22 -04:00
Linus Torvalds	ffdb8f1bfb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: unwind canceled flock state ceph: fix ENOENT logic in striped_read ceph: fix short sync reads from the OSD ceph: fix sync vs canceled write ceph: use ihold when we already have an inode ref	2011-06-13 11:21:50 -07:00
Linus Torvalds	c78a9b9b8e	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ftrace: Revert `8ab2b7efd` ftrace: Remove unnecessary disabling of irqs kprobes/trace: Fix kprobe selftest for gcc 4.6 ftrace: Fix possible undefined return code oprofile, dcookies: Fix possible circular locking dependency oprofile: Fix locking dependency in sync_start() oprofile: Free potentially owned tasks in case of errors oprofile, x86: Add comments to IBS LVT offset initialization	2011-06-13 10:45:49 -07:00
Linus Torvalds	33a538833f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix problem in setting checkpoint interval nilfs2: fix missing block address termination in btree node shrinking nilfs2: fix incorrect block address termination in node concatenation	2011-06-13 10:32:24 -07:00
Chris Mason	f4c4401621	Btrfs: drop the delalloc_bytes check in shrink_delalloc Even when delalloc_bytes is zero, we may need to sleep while waiting for delalloc space. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-13 11:30:47 -04:00
Chris Mason	ac08aedfa5	Btrfs: check the return value from set_anon_super Al Viro noticed we weren't checking for set_anon_super failures. This adds the required checks. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-13 11:28:50 -04:00
Tejun Heo	d4c208b86b	block: use the passed in @bdev when claiming if partno is zero `6b4517a791` (block: implement bd_claiming and claiming block) introduced claiming block to support O_EXCL blkdev opens properly. bd_start_claiming() looks up the part 0 bdev and starts claiming block. The function assumed that there is only one part 0 bdev and always used bdget_disk(disk, 0) to look it up; unfortunately, this isn't true for some drivers (floppy) which use multiple block devices to denote different operating parameters for the same physical device. There can be multiple part 0 bdev's for the same device number. This incorrect assumption caused the wrong bdev to be used during claiming leading to unbalanced bd_holders as reported in the following bug. https://bugzilla.kernel.org/show_bug.cgi?id=28522 This patch updates bd_start_claiming() such that it uses the bdev specified as argument if its partno is zero. Note that this means that different bdev's can be used for the same device and O_EXCL check can be effectively bypassed. It has always been broken that way and floppy is fortunately on its way out. Leave that breakage alone. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Tested-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Cc: stable@kernel.org # >= v2.6.36 Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-13 12:45:48 +02:00
H Hartley Sweeten	dd77409338	fs/partitions/check.c: make local symbols static The symbols part_ro_show, part_alignment_offset_show, and part_discard_alignment_show are not used outside this file and should be marked static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-13 11:47:24 +02:00
Tao Ma	1fb74cda1b	jbd2: Remove obsolete parameters in the comments for some jbd2 functions credits isn't a parameter for jbd2_journal_get_write_access and jbd2_journal_get_undo_access. So remove the corresponding comments. Acked-by: Jan Kara <jack@suse.cz> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-12 22:44:10 -04:00
Al Viro	a685e08987	Delay struct net freeing while there's a sysfs instance refering to it * new refcount in struct net, controlling actual freeing of the memory * new method in kobj_ns_type_operations (->drop_ns()) * ->current_ns() semantics change - it's supposed to be followed by corresponding ->drop_ns(). For struct net in case of CONFIG_NET_NS it bumps the new refcount; net_drop_ns() decrements it and calls net_free() if the last reference has been dropped. Method renamed to ->grab_current_ns(). * old net_free() callers call net_drop_ns() instead. * sysfs_exit_ns() is gone, along with a large part of callchain leading to it; now that the references stored in ->ns[...] stay valid we do not need to hunt them down and replace them with NULL. That fixes problems in sysfs_lookup() and sysfs_readdir(), along with getting rid of sb->s_instances abuse. Note that struct net shutdown logics has not changed - net_cleanup() is called exactly when it used to be called. The only thing postponed by having a sysfs instance refering to that struct net is actual freeing of memory occupied by struct net. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-12 17:45:41 -04:00
Al Viro	dde194a64b	afs: fix sget() races, close leak on umount * set ->s_fs_info in set() callback passed to sget() * allocate the thing and set it up enough for afs_test_super() before making it visible * have it freed in ->kill_sb() (current tree simply leaks it) * have ->put_super() leave ->s_fs_info->volume alone; it's too early for dropping it; do that from ->kill_sb() after having called kill_anon_super(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-12 17:45:36 -04:00
Al Viro	d251ed271d	ubifs: fix sget races * allocate ubifs_info in ->mount(), fill it enough for sb_test() and set ->s_fs_info to it in set() callback passed to sget(). * do not free it in ->put_super(); do that in ->kill_sb() after we'd done kill_anon_super(). * don't free it in ubifs_fill_super() either - deactivate_locked_super() done by caller when ubifs_fill_super() returns an error will take care of that sucker. * get rid of kludge with passing ubi to ubifs_fill_super() in ->s_fs_info; we only need it in alloc_ubifs_info(), so ubifs_fill_super() will need only ubifs_info. Which it will find in ->s_fs_info just fine, no need to reassign anything... As the result, sb_test() becomes safe to apply to all superblocks that can be found by sget() (and a kludge with temporary use of ->s_fs_info to store a pointer to very different structure goes away). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-12 17:45:34 -04:00
Al Viro	b1c27ab3f9	ubifs: split allocation of ubifs_info into a separate function preparation to ubifs sget() race fixes Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-12 17:45:32 -04:00
Al Viro	ff78fca2a0	fix leak in proc_set_super() set_anon_super() can fail... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-12 17:45:28 -04:00
Linus Torvalds	3c25fa740e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: use join_transaction in btrfs_evict_inode() Btrfs - use %pU to print fsid Btrfs: fix extent state leak on failed nodatasum reads btrfs: fix unlocked access of delalloc_inodes Btrfs: avoid stack bloat in btrfs_ioctl_fs_info() btrfs: remove 64bit alignment padding to allow extent_buffer to fit into one fewer cacheline Btrfs: clear current->journal_info on async transaction commit Btrfs: make sure to recheck for bitmaps in clusters btrfs: remove unneeded includes from scrub.c btrfs: reinitialize scrub workers btrfs: scrub: errors in tree enumeration Btrfs: don't map extent buffer if path->skip_locking is set Btrfs: unlock the trans lock properly Btrfs: don't map extent buffer if path->skip_locking is set Btrfs: fix duplicate checking logic Btrfs: fix the allocator loop logic Btrfs: fix bitmap regression Btrfs: don't commit the transaction if we dont have enough pinned bytes Btrfs: noinline the cluster searching functions Btrfs: cache bitmaps when searching for a cluster	2011-06-12 11:06:36 -07:00
Li Zefan	30b4caf5d7	Btrfs: use join_transaction in btrfs_evict_inode() The WARN_ON() in start_transaction() was triggered while balancing. The cause is btrfs_relocate_chunk() started a transaction and then called iput() on the inode that stores free space cache, and iput() called btrfs_start_transaction() again. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-11 08:31:55 -04:00
Ryusuke Konishi	071d73cfe5	nilfs2: fix problem in setting checkpoint interval Checkpoint generation interval of nilfs goes wrong after user has changed the interval parameter with nilfs-tune tool. segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds segctord starting. Construction interval = 0 seconds, CP frequency < 30 seconds This turned out to be caused by a trivial bug in initialization code of log writer. This will fix it. Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2011-06-11 15:51:15 +09:00
Ryusuke Konishi	d40990537c	nilfs2: fix missing block address termination in btree node shrinking nilfs_btree_delete function does not terminate part of virtual block addresses when shrinking the last remaining child node into the root node. The missing address termination causes that dead btree node blocks persist and chip away free disk space. This fixes the leak bug on the btree node deletion. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2011-06-11 15:51:15 +09:00
Ryusuke Konishi	fe744fdb74	nilfs2: fix incorrect block address termination in node concatenation nilfs_btree_delete function wrongly terminates virtual block address of the btree node held by its parent at index 0. When concatenating the index-0 node with its right sibling node, nilfs_btree_delete terminates the block address of index-0 node instead of the right sibling node which should be deleted. This bug not only wears disk space in the long run, but also causes file system corruption. This will fix it. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2011-06-11 15:51:15 +09:00
Ilya Dryomov	22b63a2971	Btrfs - use %pU to print fsid Get rid of FIXME comment. Uuids from dmesg are now the same as uuids given by btrfs-progs. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-10 19:02:04 -04:00
Jan Schmidt	08d2f347e8	Btrfs: fix extent state leak on failed nodatasum reads When encountering an EIO while reading from a nodatasum extent, we insert an error record into the inode's failure tree. btrfs_readpage_end_io_hook returns early for nodatasum inodes. We'd better clear the failure tree in that case, otherwise the kernel complains about BUG extent_state: Objects remaining on kmem_cache_close() on rmmod. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-10 19:00:53 -04:00
Chris Mason	0e735872fb	Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into for-linus	2011-06-10 18:58:08 -04:00
David Sterba	5be76758f3	btrfs: fix unlocked access of delalloc_inodes list_splice_init will make delalloc_inodes empty, but without a spinlock around, this may produce corrupted list head, accessed in many placess, The race window is very tight and nobody seems to have hit it so far. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-10 18:57:11 -04:00
Li Zefan	027ed2f004	Btrfs: avoid stack bloat in btrfs_ioctl_fs_info() The size of struct btrfs_ioctl_fs_info_args is as big as 1KB, so don't declare the variable on stack. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-10 18:57:10 -04:00
richard kennedy	9eb9104c66	btrfs: remove 64bit alignment padding to allow extent_buffer to fit into one fewer cacheline Reorder extent_buffer to remove 8 bytes of alignment padding on 64 bit builds. This shrinks its size to 128 bytes allowing it to fit into one fewer cache lines and allows more objects per slab in its kmem_cache. slabinfo extent_buffer reports :- before:- Sizes (bytes) Slabs ---------------------------------- Object : 136 Total : 123 SlabObj: 136 Full : 121 SlabSiz: 4096 Partial: 0 Loss : 0 CpuSlab: 2 Align : 8 Objects: 30 after :- Object : 128 Total : 4 SlabObj: 128 Full : 2 SlabSiz: 4096 Partial: 0 Loss : 0 CpuSlab: 2 Align : 8 Objects: 32 Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-10 18:57:10 -04:00
Sage Weil	38e880540f	Btrfs: clear current->journal_info on async transaction commit Normally current->jouranl_info is cleared by commit_transaction. For an async snap or subvol creation, though, it runs in a work queue. Clear it in btrfs_commit_transaction_async() to avoid leaking a non-NULL journal_info when we return to userspace. When the actual commit runs in the other thread it won't care that it's current->journal_info is already NULL. Signed-off-by: Sage Weil <sage@newdream.net> Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-10 16:42:29 -04:00
Chris Mason	38e8788066	Btrfs: make sure to recheck for bitmaps in clusters Josef recently changed the free extent cache to look in the block group cluster for any bitmaps before trying to add a new bitmap for the same offset. This avoids BUG_ON()s due covering duplicate ranges. But it didn't go quite far enough. A given free range might span between one or more bitmaps or free space entries. The code has looping to cover this, but it doesn't check for clustered bitmaps every time. This shuffles our gotos to check for a bitmap in the cluster for every new bitmap entry we try to add. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-10 16:36:57 -04:00
Arne Jansen	6eef312588	btrfs: remove unneeded includes from scrub.c Signed-off-by: Arne Jansen <sensille@gmx.net>	2011-06-10 14:59:52 +02:00
Arne Jansen	632dd772fc	btrfs: reinitialize scrub workers Scrub starts the workers each time a scrub starts and stops them after it finished. This patch adds an initialization for the workers before each start, otherwise the workers behave strangely. Signed-off-by: Arne Jansen <sensille@gmx.net>	2011-06-10 12:14:13 +02:00
Arne Jansen	8c51032f97	btrfs: scrub: errors in tree enumeration due to the semantics of btrfs_search_slot the path can point to an invalid slot when ret > 0. This condition went unnoticed, which in turn could have led to an incomplete scrubbing. Signed-off-by: Arne Jansen <sensille@gmx.net>	2011-06-10 12:14:13 +02:00
Josef Bacik	ad3e34bba4	Btrfs: don't map extent buffer if path->skip_locking is set Arne's scrub stuff exposed a problem with mapping the extent buffer in reada_for_search. He searches the commit root with multiple threads and with skip_locking set, so we can race and overwrite node->map_token since node isn't locked. So fix this so that we only map the extent buffer if we don't already have a map_token and skip_locking isn't set. Without this patch scrub would panic almost immediately, with the patch it doesn't panic anymore. Thanks, Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-10 12:14:12 +02:00
Mathias Krause	dac853ae89	exec: delay address limit change until point of no return Unconditionally changing the address limit to USER_DS and not restoring it to its old value in the error path is wrong because it prevents us using kernel memory on repeated calls to this function. This, in fact, breaks the fallback of hard coded paths to the init program from being ever successful if the first candidate fails to load. With this patch applied switching to USER_DS is delayed until the point of no return is reached which makes it possible to have a multi-arch rootfs with one arch specific init binary for each of the (hard coded) probed paths. Since the address limit is already set to USER_DS when start_thread() will be invoked, this redundancy can be safely removed. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-09 12:50:05 -07:00
Josef Bacik	3473f3c06a	Btrfs: unlock the trans lock properly In btrfs_wait_for_commit if we came upon a transaction that had committed we just exited, but that's bad since we are holding the trans_lock. So break instead so that the lock is dropped. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-09 10:15:17 -04:00
Josef Bacik	25b8b936ed	Btrfs: don't map extent buffer if path->skip_locking is set Arne's scrub stuff exposed a problem with mapping the extent buffer in reada_for_search. He searches the commit root with multiple threads and with skip_locking set, so we can race and overwrite node->map_token since node isn't locked. So fix this so that we only map the extent buffer if we don't already have a map_token and skip_locking isn't set. Without this patch scrub would panic almost immediately, with the patch it doesn't panic anymore. Thanks, Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-09 10:12:07 -04:00
Ingo Molnar	5f127133ee	Merge branch 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile into perf/urgent	2011-06-09 09:14:34 +02:00
Linus Torvalds	d21131bb0a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: trivial: add space in fsc error message cifs: silence printk when establishing first session on socket CIFS ACL support needs CONFIG_KEYS, so depend on it possible memory corruption in cifs_parse_mount_options() cifs: make CIFS depend on CRYPTO_ECB cifs: fix the kernel release version in the default security warning message	2011-06-08 13:54:29 -07:00
Josef Bacik	f6a398298d	Btrfs: fix duplicate checking logic When merging my code into the integration test the second check for duplicate entries got screwed up. This patch fixes it by dropping ret2 and just using ret for the return value, and checking if we got an error before adding the bitmap to the local list. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-08 16:37:29 -04:00
Josef Bacik	723bda2083	Btrfs: fix the allocator loop logic I was testing with empty_cluster = 0 to try and reproduce a problem and kept hitting early enospc panics. This was because our loop logic was a little confused. So this is what I did 1) Make the loop variable the ultimate decider on wether we should loop again isntead of checking to see if we had an uncached bg, empty size or empty cluster. 2) Increment loop before checking to see what we are on to make the loop definitions make more sense. 3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0 unless we didn't actually allocate a chunk. If we did allocate a chunk we should be able to easily setup a new cluster so clearing empty_size/empty_cluster makes us less efficient. This kept me from hitting panics while trying to reproduce the other problem. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-08 16:37:29 -04:00
Josef Bacik	2cdc342c20	Btrfs: fix bitmap regression In cleaning up the clustering code I accidently introduced a regression by adding bitmap entries to the cluster rb tree. The problem is if we've maxed out the number of bitmaps we can have for the block group we can only add free space to the bitmaps, but since the bitmap is on the cluster we can't find it and we try to create another one. This would result in a panic because the total bitmaps was bigger than the max bitmaps that were allowed. This patch fixes this by checking to see if we have a cluster, and then looking at the cluster rb tree to see if it has a bitmap entry and if it does and that space belongs to that bitmap, go ahead and add it to that bitmap. I could hit this panic every time with an fs_mark test within a couple of minutes. With this patch I no longer hit the panic and fs_mark goes to completion. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-08 16:37:28 -04:00
Josef Bacik	f2bb8f5cfb	Btrfs: don't commit the transaction if we dont have enough pinned bytes I noticed when running an enospc test that we would get stuck committing the transaction in check_data_space even though we truly didn't have enough space. So check to see if bytes_pinned is bigger than num_bytes, if it's not don't commit the transaction. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-08 15:08:31 -04:00
Josef Bacik	3de85bb95c	Btrfs: noinline the cluster searching functions When profiling the find cluster code it's hard to tell where we are spending our time because the bitmap and non-bitmap functions get inlined by the compiler, so make that not happen. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-08 15:08:30 -04:00
Josef Bacik	86d4a77ba3	Btrfs: cache bitmaps when searching for a cluster If we are looking for a cluster in a particularly sparse or fragmented block group, we will do a lot of looping through the free space tree looking for various things, and if we need to look at bitmaps we will endup doing the whole dance twice. So instead add the bitmap entries to a temporary list so if we have to do the bitmap search we can just look through the list of entries we've found quickly instead of having to loop through the entire tree again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-06-08 15:08:28 -04:00
Jeff Layton	83fb086e0e	cifs: trivial: add space in fsc error message Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-08 16:03:29 +00:00
Sage Weil	0c1f91f271	ceph: unwind canceled flock state If we request a lock and then abort (e.g., ^C), we need to send a matching unlock request to the MDS to unwind our lock attempt to avoid indefinitely blocking other clients. Reported-by: Brian Chrisman <brchrisman@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:36:45 -07:00
Sage Weil	0e98728fa3	ceph: fix ENOENT logic in striped_read Getting ENOENT is equivalent to reading 0 bytes. Make that correction before setting up the hit_stripe and was_short flags. Fixes the following case: dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0 dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:16 -07:00
Sage Weil	c3cd62839a	ceph: fix short sync reads from the OSD If we get a short read from the OSD because the object is small, we need to zero the remainder of the buffer. For O_DIRECT reads, the attempted range is not trimmed to i_size by the VFS, so we were actually looping indefinitely. Fix by trimming by i_size, and the unconditionally zeroing the trailing range. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:14 -07:00
Sage Weil	70b666c3b4	ceph: use ihold when we already have an inode ref We should use ihold whenever we already have a stable inode ref, even when we aren't holding i_lock. This avoids adding new and unnecessary locking dependencies. Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:11 -07:00
Linus Torvalds	9c125d2af8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6: fat: Fix corrupt inode flags when remove ATTR_SYS flag	2011-06-07 19:04:21 -07:00
Linus Torvalds	d205df9955	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Processes waiting on inode glock that no processes are holding	2011-06-07 18:44:10 -07:00
Linus Torvalds	8397345172	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: vfs: make unlink() and rmdir() return ENOENT in preference to EROFS lmLogOpen() broken failure exit usb: remove bad dput after dentry_unhash more conservative S_NOSEC handling	2011-06-07 18:36:59 -07:00
Wu Fengguang	e84d0a4f8e	writeback: trace event writeback_queue_io Note that it adds a little overheads to account the moved/enqueued inodes from b_dirty to b_io. The "moved" accounting may be later used to limit the number of inodes that can be moved in one shot, in order to keep spinlock hold time under control. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:23 +08:00
Wu Fengguang	251d6a471c	writeback: trace event writeback_single_inode It is valuable to know how the dirty inodes are iterated and their IO size. "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC\|I_SYNC age=414 index=0 to_write=1024 wrote=0" - "state" reflects inode->i_state at the end of writeback_single_inode() - "index" reflects mapping->writeback_index after the ->writepages() call - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode() - "wrote" is the number of pages actually written v2: add trace event writeback_single_inode_requeue as proposed by Dave. CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:23 +08:00
Wu Fengguang	846d5a091b	writeback: remove .nonblocking and .encountered_congestion Remove two unused struct writeback_control fields: .encountered_congestion (completely unused) .nonblocking (never set, checked/showed in XFS,NFS/btrfs) The .for_background check in nfs_write_inode() is also removed btw, as .for_background implies WB_SYNC_NONE. Reviewed-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:23 +08:00
Wu Fengguang	b7a2441f99	writeback: remove writeback_control.more_io When wbc.more_io was first introduced, it indicates whether there are at least one superblock whose s_more_io contains more IO work. Now with the per-bdi writeback, it can be replaced with a simple b_more_io test. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:23 +08:00
Wu Fengguang	e185dda89d	writeback: avoid extra sync work at enqueue time This removes writeback_control.wb_start and does more straightforward sync livelock prevention by setting .older_than_this to prevent extra inodes from being enqueued in the first place. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:22 +08:00
Wu Fengguang	e8dfc30582	writeback: elevate queue_io() into wb_writeback() Code refactor for more logical code layout. No behavior change. - remove the mis-named __writeback_inodes_sb() - wb_writeback()/writeback_inodes_wb() will decide when to queue_io() before calling __writeback_inodes_wb() Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:21 +08:00
Christoph Hellwig	f758eeabeb	writeback: split inode_wb_list_lock into bdi_writeback.list_lock Split the global inode_wb_list_lock into a per-bdi_writeback list_lock, as it's currently the most contended lock in the system for metadata heavy workloads. It won't help for single-filesystem workloads for which we'll need the I/O-less balance_dirty_pages, but at least we can dedicate a cpu to spinning on each bdi now for larger systems. Based on earlier patches from Nick Piggin and Dave Chinner. It reduces lock contentions to 1/4 in this test case: 10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram lock_stat version 0.3 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- vanilla 2.6.39-rc3: inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23 ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157 2.6.39-rc3 + patch: &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37 ------------------------ &(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150 &(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f &(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223 ------------------------ &(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b &(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf &(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:21 +08:00
Wu Fengguang	424b351fe1	writeback: refill b_io iff empty There is no point to carry different refill policies between for_kupdate and other type of works. Use a consistent "refill b_io iff empty" policy which can guarantee fairness in an easy to understand way. A b_io refill will setup a _fixed_ work set with all currently eligible inodes and start a new round of walk through b_io. The "fixed" work set means no new inodes will be added to the work set during the walk. Only when a complete walk over b_io is done, new inodes that are eligible at the time will be enqueued and the walk be started over. This procedure provides fairness among the inodes because it guarantees each inode to be synced once and only once at each round. So all inodes will be free from starvations. This change relies on wb_writeback() to keep retrying as long as we made some progress on cleaning some pages and/or inodes. Without that ability, the old logic on background works relies on aggressively queuing all eligible inodes into b_io at every time. But that's not a guarantee. The below test script completes a slightly faster now: 2.6.39-rc3 2.6.39-rc3-dyn-expire+ ------------------------------------------------ all elapsed 256.043 252.367 stddev 24.381 12.530 tar elapsed 30.097 28.808 dd elapsed 13.214 11.782 #!/bin/zsh cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/ umount /dev/sda7 mkfs.xfs -f /dev/sda7 mount /dev/sda7 /fs echo 3 > /proc/sys/vm/drop_caches tic=$(cat /proc/uptime\|cut -d' ' -f2) cd /fs time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 & time dd if=/dev/zero of=/fs/zero bs=1M count=1000 & wait sync tac=$(cat /proc/uptime\|cut -d' ' -f2) echo elapsed: $((tac - tic)) It maintains roughly the same small vs. large file writeout shares, and offers large files better chances to be written in nice 4M chunks. Analyzes from Dave Chinner in great details: Let's say we have lots of inodes with 100 dirty pages being created, and one large writeback going on. We expire 8 new inodes for every 1024 pages we write back. With the old code, we do: b_more_io (large inode) -> b_io (1l) 8 newly expired inodes -> b_io (1l, 8s) writeback large inode 1024 pages -> b_more_io b_more_io (large inode) -> b_io (8s, 1l) 8 newly expired inodes -> b_io (8s, 1l, 8s) writeback 8 small inodes 800 pages 1 large inode 224 pages -> b_more_io b_more_io (large inode) -> b_io (8s, 1l) 8 newly expired inodes -> b_io (8s, 1l, 8s) ..... Your new code: b_more_io (large inode) -> b_io (1l) 8 newly expired inodes -> b_io (1l, 8s) writeback large inode 1024 pages -> b_more_io (b_io == 8s) writeback 8 small inodes 800 pages b_io empty: (1800 pages written) b_more_io (large inode) -> b_io (1l) 14 newly expired inodes -> b_io (1l, 14s) writeback large inode 1024 pages -> b_more_io (b_io == 14s) writeback 10 small inodes 1000 pages 1 small inode 24 pages -> b_more_io (1l, 1s(24)) writeback 5 small inodes 500 pages b_io empty: (2548 pages written) b_more_io (large inode) -> b_io (1l, 1s(24)) 20 newly expired inodes -> b_io (1l, 1s(24), 20s) ...... Rough progression of pages written at b_io refill: Old code: total large file % of writeback 1024 224 21.9% (fixed) New code: total large file % of writeback 1800 1024 ~55% 2550 1024 ~40% 3050 1024 ~33% 3500 1024 ~29% 3950 1024 ~26% 4250 1024 ~24% 4500 1024 ~22.7% 4700 1024 ~21.7% 4800 1024 ~21.3% 4800 1024 ~21.3% (pretty much steady state from here) Ok, so the steady state is reached with a similar percentage of writeback to the large file as the existing code. Ok, that's good, but providing some evidence that is doesn't change the shared of writeback to the large should be in the commit message ;) The other advantage to this is that we always write 1024 page chunks to the large file, rather than smaller "whatever remains" chunks. CC: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:21 +08:00
Wu Fengguang	ba9aa8399f	writeback: the kupdate expire timestamp should be a moving target Dynamically compute the dirty expire timestamp at queue_io() time. writeback_control.older_than_this used to be determined at entrance to the kupdate writeback work. This _static_ timestamp may go stale if the kupdate work runs on and on. The flusher may then stuck with some old busy inodes, never considering newly expired inodes thereafter. This has two possible problems: - It is unfair for a large dirty inode to delay (for a long time) the writeback of small dirty inodes. - As time goes by, the large and busy dirty inode may contain only _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks delaying the expired dirty pages to the end of LRU lists, triggering the evil pageout(). Nevertheless this patch merely addresses part of the problem. v2: keep policy changes inside wb_writeback() and keep the wbc.older_than_this visibility as suggested by Dave. CC: Dave Chinner <david@fromorbit.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:21 +08:00
Wu Fengguang	e6fb6da2e1	writeback: try more writeback as long as something was written writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that they only populate possibly a subset of eligible inodes into b_io at entrance time. When the queued set of inodes are all synced, they just return, possibly with all queued inode pages written but still wbc.nr_to_write > 0. For kupdate and background writeback, there may be more eligible inodes sitting in b_dirty when the current set of b_io inodes are completed. So it is necessary to try another round of writeback as long as we made some progress in this round. When there are no more eligible inodes, no more inodes will be enqueued in queue_io(), hence nothing could/will be synced and we may safely bail. For example, imagine 100 inodes i0, i1, i2, ..., i90, i91, i99 At queue_io() time, i90-i99 happen to be expired and moved to s_io for IO. When finished successfully, if their total size is less than MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will quit the background work (w/o this patch) while it's still over background threshold. This will be a fairly normal/frequent case I guess. Now that we do tagged sync and update inode->dirtied_when after the sync, this change won't livelock sync(1). I actually tried to write 1 page per 1ms with this command write-and-fsync -n10000 -S 1000 -c 4096 /fs/test and do sync(1) at the same time. The sync completes quickly on ext4, xfs, btrfs. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:20 +08:00
Wu Fengguang	cb9bd1159c	writeback: introduce writeback_control.inodes_written The flusher works on dirty inodes in batches, and may quit prematurely if the batch of inodes happen to be metadata-only dirtied: in this case wbc->nr_to_write won't be decreased at all, which stands for "no pages written" but also mis-interpreted as "no progress". So introduce writeback_control.inodes_written to count the inodes get cleaned from VFS POV. A non-zero value means there are some progress on writeback, in which case more writeback can be tried. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:20 +08:00
Wu Fengguang	94c3dcbb0b	writeback: update dirtied_when for synced inode to prevent livelock Explicitly update .dirtied_when on synced inodes, so that they are no longer considered for writeback in the next round. It can prevent both of the following livelock schemes: - while true; do echo data >> f; done - while true; do touch f; done (in theory) The exact livelock condition is, during sync(1): (1) no new inodes are dirtied (2) an inode being actively dirtied On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX. When finished, it will be redirty_tail()ed because it's still dirty and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when on condition (1). The sync work will then revisit it on the next queue_io() and find it eligible again because its old ->dirtied_when predates the sync work start time. We'll do more aggressive "keep writeback as long as we wrote something" logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit `b9543dac5b` ("writeback: avoid livelocking WB_SYNC_ALL writeback") will no longer be enough to stop sync livelock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:20 +08:00
Wu Fengguang	6e6938b6d3	writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage sync(2) is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and do livelock prevention for it, too. Jan's commit `f446daaea9` ("mm: implement writeback livelock avoidance using page tagging") is a partial fix in that it only fixed the WB_SYNC_ALL phase livelock. Although ext4 is tested to no longer livelock with commit `f446daaea9`, it may due to some "redirty_tail() after pages_skipped" effect which is by no means a guarantee for _all_ the file systems. Note that writeback_inodes_sb() is called by not only sync(), they are treated the same because the other callers also need livelock prevention. Impact: It changes the order in which pages/inodes are synced to disk. Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode until finished with the current inode. Acked-by: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:20 +08:00
Theodore Ts'o	e6bc45d65d	vfs: make unlink() and rmdir() return ENOENT in preference to EROFS If user space attempts to remove a non-existent file or directory, and the file system is mounted read-only, return ENOENT instead of EROFS. Either error code is arguably valid/correct, but ENOENT is a more specific error message. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-07 08:51:14 -04:00
Al Viro	9054760ff5	lmLogOpen() broken failure exit Callers of lmLogOpen() expect it to return -E... on failure exits, which is what it returns, except for the case of blkdev_get_by_dev() failure. It that case lmLogOpen() return the error with the wrong sign... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2011-06-07 08:50:59 -04:00
Jeff Layton	9c4843ea57	cifs: silence printk when establishing first session on socket When signing is enabled, the first session that's established on a socket will cause a printk like this to pop: CIFS VFS: Unexpected SMB signature This is because the key exchange hasn't happened yet, so the signature field is bogus. Don't try to check the signature on the socket until the first session has been established. Also, eliminate the specific check for SMB_COM_NEGOTIATE since this check covers that case too. Cc: stable@kernel.org Cc: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-07 00:57:05 +00:00
Casey Bodley	7d751f6f8c	nfsd: link returns nfserr_delay when breaking lease fix for commit `4795bb37ef`, nfsd: break lease on unlink, link, and rename if the LINK operation breaks a delegation, it returns NFS4ERR_NOENT (which is not a valid error in rfc 5661) instead of NFS4ERR_DELAY. the return value of nfsd_break_lease() in nfsd_link() must be converted from host_err to err Signed-off-by: Casey Bodley <cbodley@citi.umich.edu> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-06-06 18:46:56 -04:00
Randy Dunlap	be1f4084b4	nfsd: v4 support requires CRYPTO nfsd V4 support uses crypto interfaces, so select CRYPTO to fix build errors in 2.6.39: ERROR: "crypto_destroy_tfm" [fs/nfsd/nfsd.ko] undefined! ERROR: "crypto_alloc_base" [fs/nfsd/nfsd.ko] undefined! Reported-by: Wakko Warner <wakko@animx.eu.org> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-06-06 18:37:35 -04:00
J. Bruce Fields	b084f598df	nfsd: fix dependency of nfsd on auth_rpcgss Commit `b0b0c0a26e` "nfsd: add proc file listing kernel's gss_krb5 enctypes" added an nunnecessary dependency of nfsd on the auth_rpcgss module. It's a little ad hoc, but since the only piece of information nfsd needs from rpcsec_gss_krb5 is a single static string, one solution is just to share it with an include file. Cc: stable@kernel.org Reported-by: Michael Guntsche <mike@it-loops.com> Cc: Kevin Coffman <kwc@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-06-06 15:07:15 -04:00
Darren Salt	243e2dd38e	CIFS ACL support needs CONFIG_KEYS, so depend on it Build fails if CONFIG_KEYS is not selected. Signed-off-by: Darren Salt <linux@youmustbejoking.demon.co.uk> Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-06 16:58:16 +00:00
Vasily Averin	957df4535d	possible memory corruption in cifs_parse_mount_options() error path after mountdata check frees uninitialized mountdata_copy Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-06 15:31:29 +00:00
Lukas Czerner	a9c667f8f0	ext4: fixed tracepoints cleanup While creating fixed tracepoints for ext3, basically by porting them from ext4, I found a lot of useless retyping, wrong type usage, useless variable passing and other inconsistencies in the ext4 fixed tracepoint code. This patch cleans the fixed tracepoint code for ext4 and also simplify some of them. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-06 09:51:52 -04:00
Lukas Czerner	c03f8aa9ab	ext4: use FIEMAP_EXTENT_LAST flag for last extent in fiemap Currently we are not marking the extent as the last one (FIEMAP_EXTENT_LAST) if there is a hole at the end of the file. This is because we just do not check for it right now and continue searching for next extent. But at the point we hit the hole at the end of the file, it is too late. This commit adds check for the allocated block in subsequent extent and if there is no more extents (block = EXT_MAX_BLOCKS) just flag the current one as the last one. This behaviour has been spotted unintentionally by 252 xfstest, when the test hangs out, because of wrong loop condition. However on other filesystems (like xfs) it will exit anyway, because we notice the last extent flag and exit. With this patch xfstest 252 does not hang anymore, ext4 fiemap implementation still reports bad extent type in some cases, however this seems to be different issue. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-06 00:06:52 -04:00
Lukas Czerner	f17722f917	ext4: Fix max file size and logical block counting of extent format file Kazuya Mio reported that he was able to hit BUG_ON(next == lblock) in ext4_ext_put_gap_in_cache() while creating a sparse file in extent format and fill the tail of file up to its end. We will hit the BUG_ON when we write the last block (2^32-1) into the sparse file. The root cause of the problem lies in the fact that we specifically set s_maxbytes so that block at s_maxbytes fit into on-disk extent format, which is 32 bit long. However, we are not storing start and end block number, but rather start block number and length in blocks. It means that in order to cover extent from 0 to EXT_MAX_BLOCK we need EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) - and it does not. The only way to fix it without changing the meaning of the struct ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes by one fs block so we can cover the whole extent we can get by the on-disk extent format. Also in many places EXT_MAX_BLOCK is used as length instead of maximum logical block number as the name suggests, it is all a bit messy. So this commit renames it to EXT_MAX_BLOCKS and change its usage in some places to actually be maximum number of blocks in the extent. The bug which this commit fixes can be reproduced as follows: dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((232-2)) sync dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((232-1)) Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-06 00:05:17 -04:00
Yongqiang Yang	5def136025	ext4: correct comments for ext4_free_blocks() metadata is not parameter of ext4_free_blocks() any more. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-05 23:26:40 -04:00
Linus Torvalds	e6ece70732	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits) btrfs: fix uninitialized variable warning btrfs: add helper for fs_info->closing Btrfs: add mount -o inode_cache btrfs: scrub: add explicit plugging btrfs: use btrfs_ino to access inode number Btrfs: don't save the inode cache if we are deleting this root btrfs: false BUG_ON when degraded Btrfs: don't save the inode cache in non-FS roots Btrfs: make sure we don't overflow the free space cache crc page Btrfs: fix uninit variable in the delayed inode code btrfs: scrub: don't reuse bios and pages Btrfs: leave spinning on lookup and map the leaf Btrfs: check for duplicate entries in the free space cache Btrfs: don't try to allocate from a block group that doesn't have enough space Btrfs: don't always do readahead Btrfs: try not to sleep as much when doing slow caching Btrfs: kill BTRFS_I(inode)->block_group Btrfs: don't look at the extent buffer level 3 times in a row Btrfs: map the node block when looking for readahead targets Btrfs: set range_start to the right start in count_range_bits ...	2011-06-05 06:17:23 +09:00
Tejun Heo	6dfca32984	job control: make task_clear_jobctl_pending() clear TRAPPING automatically JOBCTL_TRAPPING indicates that ptracer is waiting for tracee to (re)transit into TRACED. task_clear_jobctl_pending() must be called when either tracee enters TRACED or the transition is cancelled for some reason. The former is achieved by explicitly calling task_clear_jobctl_pending() in ptrace_stop() and the latter by calling it at the end of do_signal_stop(). Calling task_clear_jobctl_trapping() at the end of do_signal_stop() limits the scope TRAPPING can be used and is fragile in that seemingly unrelated changes to tracee's control flow can lead to stuck TRAPPING. We already have task_clear_jobctl_pending() calls on those cancelling events to clear JOBCTL_STOP_PENDING. Cancellations can be handled by making those call sites use JOBCTL_PENDING_MASK instead and updating task_clear_jobctl_pending() such that task_clear_jobctl_trapping() is called automatically if no stop/trap is pending. This patch makes the above changes and removes the fallback task_clear_jobctl_trapping() call from do_signal_stop(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:11 +02:00
Tejun Heo	3759a0d94c	job control: introduce JOBCTL_PENDING_MASK and task_clear_jobctl_pending() This patch introduces JOBCTL_PENDING_MASK and replaces task_clear_jobctl_stop_pending() with task_clear_jobctl_pending() which takes an extra @mask argument. JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but future patches will add more bits. recalc_sigpending_tsk() is updated to use JOBCTL_PENDING_MASK instead. task_clear_jobctl_pending() takes @mask which in subset of JOBCTL_PENDING_MASK and clears the relevant jobctl bits. If JOBCTL_STOP_PENDING is set, other STOP bits are cleared together. All task_clear_jobctl_stop_pending() users are updated to call task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is functionally identical to task_clear_jobctl_stop_pending(). This patch doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:10 +02:00
Tejun Heo	a8f072c1d6	job control: rename signal->group_stop and flags to jobctl and update them signal->group_stop currently hosts mostly group stop related flags; however, it's gonna be used for wider purposes and the GROUP_STOP_ flag prefix becomes confusing. Rename signal->group_stop to signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_. Bit position macros JOBCTL__BIT are defined and JOBCTL_* flags are defined in terms of them to allow using bitops later. While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate future additions. This doesn't cause any functional change. -v2: JOBCTL_*_BIT macros added as suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:09 +02:00
David Sterba	aa0467d8d2	btrfs: fix uninitialized variable warning With Linus' tree, today's linux-next build (powercp ppc64_defconfig) produced this warning: fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode': fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used uninitialized in this function Introduced by commit `16cdcec736` ("btrfs: implement delayed inode items operation"). This fixes a bug in btrfs_update_inode(): if the returned value from btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not updated and several call paths may hit a BUG_ON or fail with strange code. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David Sterba <dsterba@suse.cz>	2011-06-04 08:11:38 -04:00
David Sterba	7841cb2898	btrfs: add helper for fs_info->closing wrap checking of filesystem 'closing' flag and fix a few missing memory barriers. Signed-off-by: David Sterba <dsterba@suse.cz>	2011-06-04 08:11:22 -04:00
Chris Mason	4b9465cb9e	Btrfs: add mount -o inode_cache This makes the inode map cache default to off until we fix the overflow problem when the free space crcs don't fit inside a single page. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:47 -04:00
Arne Jansen	e7786c3ae5	btrfs: scrub: add explicit plugging With the removal of the implicit plugging scrub ends up doing more and smaller I/O than necessary. This patch adds explicit plugging per chunk. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:46 -04:00
David Sterba	a4689d2bd3	btrfs: use btrfs_ino to access inode number commit `4cb5300bc` ("Btrfs: add mount -o auto_defrag") accesses inode number directly while it should use the helper with the new inode number allocator. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:46 -04:00
Josef Bacik	d132a538d2	Btrfs: don't save the inode cache if we are deleting this root With xfstest 254 I can panic the box every time with the inode number caching stuff on. This is because we clean the inodes out when we delete the subvolume, but then we write out the inode cache which adds an inode to the subvolume inode tree, and then when it gets evicted again the root gets added back on the dead roots list and is deleted again, so we have a double free. To stop this from happening just return 0 if refs is 0 (and we're not the tree root since tree root always has refs of 0). With this fix 254 no longer panics. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:45 -04:00
Arne Jansen	5f3f302a6f	btrfs: false BUG_ON when degraded In degraded mode the struct btrfs_device of missing devs don't have device->name set. A kstrdup of NULL correctly returns NULL. Don't BUG in this case. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:44 -04:00
liubo	ca456ae280	Btrfs: don't save the inode cache in non-FS roots This adds extra checks to make sure the inode map we are caching really belongs to a FS root instead of a special relocation tree. It prevents crashes during balancing operations. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:44 -04:00
Chris Mason	211f96c24f	Btrfs: make sure we don't overflow the free space cache crc page The free space cache uses only one page for crcs right now, which means we can't have a cache file bigger than the crcs we can fit in the first page. This adds a check to enforce that restriction. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:43 -04:00
Chris Mason	17aca1c987	Btrfs: fix uninit variable in the delayed inode code The nitems counter needs to start at zero Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:43 -04:00
Arne Jansen	1bc8779349	btrfs: scrub: don't reuse bios and pages The current scrub implementation reuses bios and pages as often as possible, allocating them only on start and releasing them when finished. This leads to more problems with the block layer than it's worth. The elevator gets confused when there are more pages added to the bio than bi_size suggests. This patch completely rips out the reuse of bios and pages and allocates them freshly for each submit. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Maosn <chris.mason@oracle.com>	2011-06-04 08:03:17 -04:00
Linus Torvalds	4f1ba49efa	Merge branch 'for-linus' of git://git.kernel.dk/linux-block * 'for-linus' of git://git.kernel.dk/linux-block: block: Use hlist_entry() for io_context.cic_list.first cfq-iosched: Remove bogus check in queue_fail path xen/blkback: potential null dereference in error handling xen/blkback: don't call vbd_size() if bd_disk is NULL block: blkdev_get() should access ->bd_disk only after success CFQ: Fix typo and remove unnecessary semicolon block: remove unwanted semicolons Revert "block: Remove extra discard_alignment from hd_struct." nbd: adjust 'max_part' according to part_shift nbd: limit module parameters to a sane value nbd: pass MSG_* flags to kernel_recvmsg() block: improve the bio_add_page() and bio_add_pc_page() descriptions	2011-06-04 08:11:26 +09:00
Linus Torvalds	3af91a1256	Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 * 'linux-next' of git://git.infradead.org/ubifs-2.6: UBIFS: fix-up free space earlier UBIFS: intialize LPT earlier UBIFS: assert no fixup when writing a node UBIFS: fix clean znode counter corruption in error cases UBIFS: fix memory leak on error path UBIFS: fix shrinker object count reports UBIFS: fix recovery broken by the previous recovery fix UBIFS: amend ubifs_recover_leb interface UBIFS: introduce a "grouped" journal head flag UBIFS: supress false error messages	2011-06-04 07:59:32 +09:00
Al Viro	9e1f1de02c	more conservative S_NOSEC handling Caching "we have already removed suid/caps" was overenthusiastic as merged. On network filesystems we might have had suid/caps set on another client, silently picked by this client on revalidate, all of that without clearing the S_NOSEC flag. AFAICS, the only reasonably sane way to deal with that is * new superblock flag; unless set, S_NOSEC is not going to be set. * local block filesystems set it in their ->mount() (more accurately, mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than local block ones clear it) * if any network filesystem (or a cluster one) wants to use S_NOSEC, it'll need to set MS_NOSEC in sb->s_flags AND take care to clear S_NOSEC when inode attribute changes are picked from other clients. It's not an earth-shattering hole (anybody that can set suid on another client will almost certainly be able to write to the file before doing that anyway), but it's a bug that needs fixing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-03 18:24:58 -04:00
Suresh Jayaraman	5f0b23eeba	cifs: make CIFS depend on CRYPTO_ECB When CONFIG_CRYPTO_ECB is not set, trying to mount a CIFS share with NTLM security resulted in mount failure with the following error: "CIFS VFS: could not allocate des crypto API" Seems like a leftover from commit `43988d7`. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> CC: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-03 17:38:44 +00:00
Suresh Jayaraman	c592a70737	cifs: fix the kernel release version in the default security warning message When ntlm security mechanim is used, the message that warns about the upgrade to ntlmv2 got the kernel release version wrong (Blame it on Linus :). Fix it. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-06-03 15:31:23 +00:00
Ben Gardiner	098011940a	UBIFS: fix-up free space earlier The free space fixup is currently initiated during mount after the call to ubifs_write_master() which results in a write to PEBs; this has been observed with the patch 'assert no fixup when writing a node' applied: Move the free space fixup on mount to before the calls to ubifs_recover_inl_heads() and ubifs_write_master(). This results in no assertions with the previously mentioned patch applied. Artem: tweaked the patch a bit Signed-off-by: Ben Gardiner <bengardiner@nanometrics> Reviewed-by: Matthew L. Creech <mlcreech@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-06-03 18:12:31 +03:00
Ben Gardiner	781c5717a9	UBIFS: intialize LPT earlier The current 'mount_ubifs()' implementation does not initialize the LPT until the the master node is marked dirty. Move the LPT initialization to before marking the master node dirty. This is a preparation for the next patch which will move the free-space-fixup check to before marking the master node dirty, because we have to fix-up the free space before doing any writes. Artem: massaged the patch and commit message. Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca> Reviewed-by: Matthew L. Creech <mlcreech@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-06-03 18:12:31 +03:00
Ben Gardiner	4f1ab9b01d	UBIFS: assert no fixup when writing a node The current free space fixup can result in some writing to the UBI volume when the space_fixup flag is set. To catch instances where UBIFS is writing to the NAND while the space_fixup flag is set, add an assert to ubifs_write_node(). Artem: tweaked the patch, added similar assertion to the write buffer write path. Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca> Reviewed-by: Matthew L. Creech <mlcreech@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-06-03 18:12:31 +03:00
Artem Bityutskiy	8370723770	UBIFS: fix clean znode counter corruption in error cases UBIFS maintains per-filesystem and global clean znode counters ('c->clean_zn_cnt' and 'ubifs_clean_zn_cnt'). It is important to maintain correct values there since the shrinker relies on 'ubifs_clean_zn_cnt'. However, in case of failures during commit the counters were corrupted. E.g., if a failure happens in the middle of 'write_index()', then some nodes in the commit list ('c->cnext') are marked as clean, and some are marked as dirty. And the 'ubifs_destroy_tnc_subtree()' frees does not retrun correct count, and we end up with non-zero 'c->clean_zn_cnt' when unmounting. This means that if we have 2 file-sytem and one of them fails, and we unmount it, 'ubifs_clean_zn_cnt' stays incorrect and confuses the shrinker. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-06-03 18:12:31 +03:00
Artem Bityutskiy	812eb25831	UBIFS: fix memory leak on error path UBIFS leaks memory on error path in 'ubifs_jnl_update()' in case of write failure because it forgets to free the 'struct ubifs_dent_node *dent' object. Although the object is small, the alignment can make it large - e.g., 2KiB if the min. I/O unit is 2KiB. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: stable@kernel.org	2011-06-03 18:12:31 +03:00
Artem Bityutskiy	cf610bf419	UBIFS: fix shrinker object count reports Sometimes VM asks the shrinker to return amount of objects it can shrink, and we return the ubifs_clean_zn_cnt in that case. However, it is possible that this counter is negative for a short period of time, due to the way UBIFS TNC code updates it. And I can observe the following warnings sometimes: shrink_slab: ubifs_shrinker+0x0/0x2b7 [ubifs] negative objects to delete nr=-8541616642706119788 This patch makes sure UBIFS never returns negative count of objects. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: stable@kernel.org	2011-06-03 18:12:24 +03:00
Randy Dunlap	a2daff6803	fuse: fix non-ANSI void function notation Fix void function parameter list sparse warning: fs/fuse/inode.c:74:44: warning: non-ANSI function declaration of function 'fuse_alloc_forget' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-06-01 16:09:32 +02:00
Artem Bityutskiy	da8b94ea61	UBIFS: fix recovery broken by the previous recovery fix Unfortunately, the recovery fix d1606a59b6be4ea392eabd40d1250aa1eeb19efb (UBIFS: fix extremely rare mount failure) broke recovery. This commit make UBIFS drop the last min. I/O unit in all journal heads, but this is needed only for the GC head. And this does not work for non-GC heads. For example, if suppose we have min. I/O units A and B, and A contains a valid node X, which was fsynced, and then a group of nodes Y which spans the rest of A and B. In this case we'll drop not only Y, but also X, which is obviously incorrect. This patch fixes the issue and additionally makes recovery to drop last min. I/O unit only for the GC head, and leave things as they have been for ages for the other heads - this is safer. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-06-01 12:29:06 +03:00
Artem Bityutskiy	efcfde54ca	UBIFS: amend ubifs_recover_leb interface Instead of passing "grouped" parameter to 'ubifs_recover_leb()' which tells whether the nodes are grouped in the LEB to recover, pass the journal head number and let 'ubifs_recover_leb()' look at the journal head's 'grouped' flag. This patch is a preparation to a further fix where we'll need to know the journal head number for other purposes. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-06-01 12:29:06 +03:00
Artem Bityutskiy	1a0b06997c	UBIFS: introduce a "grouped" journal head flag Journal heads are different in a way how UBIFS writes nodes there. All normal journal heads receive grouped nodes, while the GC journal heads receives ungrouped nodes. This patch adds a 'grouped' flag to 'struct ubifs_jhead' which describes this property. This patch is a preparation to a further recovery fix. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-06-01 12:29:06 +03:00
Artem Bityutskiy	ab75950b11	UBIFS: supress false error messages Commit ab51afe05273741f72383529ef488aa1ea598ec6 was a good clean-up, but it introduced a regression - now UBIFS prints scary error messages during recovery on all corrupted nodes, even though the corruptions are expected (due to a power cut). This patch fixes the issue. Additionally fix a typo in a commentary introduced by the same commit. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-06-01 12:29:05 +03:00
Tejun Heo	4c49ff3fe1	block: blkdev_get() should access ->bd_disk only after success `d4dc210f69` (block: don't block events on excl write for non-optical devices) added dereferencing of bdev->bd_disk to test GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be %NULL if open failed which can lead to an oops. Test the flag after testing open was successful, not before. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Miller <davem@davemloft.net> Tested-by: David Miller <davem@davemloft.net> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-06-01 08:28:47 +02:00
Akinobu Mita	730e663bd8	ocfs2: use proper little-endian bitops Using __test_and_{set,clear}_bit_le() with ignoring its return value can be replaced with __{set,clear}_bit_le(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: ocfs2-devel@oss.oracle.com Signed-off-by: Joel Becker <jlbec@evilplan.org>	2011-05-31 19:03:45 -07:00
Dan Carpenter	87f0d5c8db	ocfs2: null deref on allocation error The original code had a null derefence in the error handling. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>	2011-05-31 19:03:45 -07:00

... 17 18 19 20 21 ...

24897 Commits