linux

Author	SHA1	Message	Date
Josef Bacik	60e7cd3a4b	Btrfs: fix transid verify errors when recovering log tree If we crash with a log, remount and recover that log, and then crash before we can commit another transaction we will get transid verify errors on the next mount. This is because we were not zero'ing out the log when we committed the transaction after recovery. This is ok as long as we commit another transaction at some point in the future, but if you abort or something else goes wrong you can end up in this weird state because the recovery stuff says that the tree log should have a generation+1 of the super generation, which won't be the case of the transaction that was started for recovery. Fix this by removing the check and _always_ zero out the log portion of the super when we commit a transaction. This fixes the transid verify issues I was seeing with my force errors tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-10-04 16:02:09 -04:00
Thierry Reding	b2a42f78ab	xfs: Use kmem_free() instead of free() This fixes a build failure caused by calling the free() function which does not exist in the Linux kernel. Signed-off-by: Thierry Reding <treding@nvidia.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `aaaae98022`)	2013-10-04 13:56:12 -05:00
tinguely@sgi.com	9b3b77fe66	xfs: fix memory leak in xlog_recover_add_to_trans Free the memory in error path of xlog_recover_add_to_trans(). Normally this memory is freed in recovery pass2, but is leaked in the error path. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `519ccb81ac`)	2013-10-04 13:56:03 -05:00
Dave Chinner	6d313498f0	xfs: dirent dtype presence is dependent on directory magic numbers The determination of whether a directory entry contains a dtype field originally was dependent on the filesystem having CRCs enabled. This meant that the format for dtype beign enabled could be determined by checking the directory block magic number rather than doing a feature bit check. This was useful in that it meant that we didn't need to pass a struct xfs_mount around to functions that were already supplied with a directory block header. Unfortunately, the introduction of dtype fields into the v4 structure via a feature bit meant this "use the directory block magic number" method of discriminating the dirent entry sizes is broken. Hence we need to convert the places that use magic number checks to use feature bit checks so that they work correctly and not by chance. The current code works on v4 filesystems only because the dirent size roundup covers the extra byte needed by the dtype field in the places where this problem occurs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `367993e7c6`)	2013-10-04 13:55:48 -05:00
Dave Chinner	89c6c89af2	xfs: lockdep needs to know about 3 dquot-deep nesting Michael Semon reported that xfs/299 generated this lockdep warning: ============================================= [ INFO: possible recursive locking detected ] 3.12.0-rc2+ #2 Not tainted --------------------------------------------- touch/21072 is trying to acquire lock: (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64 but task is already holding lock: (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xfs_dquot_other_class); lock(&xfs_dquot_other_class); * DEADLOCK * May be due to missing lock nesting notation 7 locks held by touch/21072: #0: (sb_writers#10){++++.+}, at: [<c11185b6>] mnt_want_write+0x1e/0x3e #1: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<c11078ee>] do_last+0x245/0xe40 #2: (sb_internal#2){++++.+}, at: [<c122c9e0>] xfs_trans_alloc+0x1f/0x35 #3: (&(&ip->i_lock)->mr_lock/1){+.+...}, at: [<c126cd1b>] xfs_ilock+0x100/0x1f1 #4: (&(&ip->i_lock)->mr_lock){++++-.}, at: [<c126cf52>] xfs_ilock_nowait+0x105/0x22f #5: (&dqp->q_qlock){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64 #6: (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64 The lockdep annotation for dquot lock nesting only understands locking for user and "other" dquots, not user, group and quota dquots. Fix the annotations to match the locking heirarchy we now have. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit `f112a04971`)	2013-10-04 13:55:33 -05:00
Linus Torvalds	15c83d26e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse bugfixes from Miklos Szeredi: "This contains two more fixes by Maxim for writeback/truncate races and fixes for RCU walk in fuse_dentry_revalidate()" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: no RCU mode in fuse_access() fuse: readdirplus: fix RCU walk fuse: don't check_submounts_and_drop() in RCU walk fuse: fix fallocate vs. ftruncate race fuse: wait for writeback in fuse_file_fallocate()	2013-10-04 09:06:13 -07:00
Tejun Heo	250f7c3fee	sysfs: introduce [__]sysfs_remove() Given a sysfs_dirent, there is no reason to have multiple versions of removal functions. A function which removes the specified sysfs_dirent and its descendants is enough. This patch intorduces [__}sysfs_remove() which replaces all internal variations of removal functions. This will be the only removal function in the planned new sysfs_dirent based interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-10-03 16:38:52 -07:00
Tejun Heo	bcdde7e221	sysfs: make __sysfs_remove_dir() recursive Currently, sysfs directory removal is inconsistent in that it would remove any files directly under it but wouldn't recurse into directories. Thanks to group subdirectories, this doesn't even match with kobject boundaries. sysfs is in the process of being separated out so that it can be used by multiple subsystems and we want to have a consistent behavior - either removal of a sysfs_dirent should remove every descendant entries or none instead of something inbetween. This patch implements proper recursive removal in __sysfs_remove_dir(). The function now walks its subtree in a post-order walk to remove all descendants. This is a behavior change but kobject / driver layer, which currently is the only consumer, has already been updated to handle duplicate removal attempts, so nothing should be broken after this change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-10-03 16:38:52 -07:00
Tejun Heo	26ea12dec0	kobject: grab an extra reference on kobject->sd to allow duplicate deletes sysfs currently has a rather weird behavior regarding removals. A directory removal would delete all files directly under it but wouldn't recurse into subdirectories, which, while a bit inconsistent, seems to make sense at the first glance as each directory is supposedly associated with a kobject and each kobject can take care of the directory deletion; however, this doesn't really hold as we have groups which can be directories without a kobject associated with it and require explicit deletions. We're in the process of separating out sysfs from kboject / driver core and want a consistent behavior. A removal should delete either only the specified node or everything under it. I think it is helpful to support recursive atomic removal and later patches will implement it. Such change means that a sysfs_dirent associated with kobject may be deleted before the kobject itself is removed if one of its ancestor gets removed before it. As sysfs_remove_dir() puts the base ref, we may end up with dangling pointer on descendants. This can be solved by holding an extra reference on the sd from kobject. Acquire an extra reference on the associated sysfs_dirent on directory creation and put it after removal. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-10-03 16:38:52 -07:00
Tejun Heo	d69ac5a0bb	sysfs: remove sysfs_addrm_cxt->parent_sd sysfs_addrm_start/finish() enclose sysfs_dirent additions and deletions and sysfs_addrm_cxt is used to record information necessary to finish the operations. Currently, sysfs_addrm_start() takes @parent_sd, records it in sysfs_addrm_cxt, and assumes that all operations in the block are performed under that @parent_sd. This assumption has been fine until now but we want to make some operations behave recursively and, while having @parent_sd recorded in sysfs_addrm_cxt doesn't necessarily prevents that, it becomes confusing. This patch removes sysfs_addrm_cxt->parent_sd and makes sysfs_add_one() take an explicit @parent_sd parameter. Note that sysfs_remove_one() doesn't need the extra argument as its parent is always known from the target @sd. While at it, add __acquires/releases() notations to sysfs_addrm_start/finish() respectively. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-10-03 16:16:43 -07:00
Linus Torvalds	981d901095	Merge git://git.kvack.org/~bcrl/aio-next Pull aio use-after-free fix from Ben LaHaise. * git://git.kvack.org/~bcrl/aio-next: aio: fix use-after-free in aio_migratepage	2013-10-02 09:38:17 -07:00
Linus Torvalds	517bf8fc21	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs lru leak fix from Al Viro: "The fix in "super: fix for destroy lrus" didn't - they need to be destroyed, all right, but that's the wrong place..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/super.c: fix lru_list leak for real	2013-10-01 10:28:11 -07:00
Al Viro	c2d22ecd3c	fs/super.c: fix lru_list leak for real Freeing ->s_{inode,dentry}_lru in deactivate_locked_super() is wrong; the right place is destroy_super(). As it is, we leak them if sget() decides that new superblock it has allocated (and never shown to anybody) isn't needed and should be freed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-10-01 13:11:21 -04:00
Miklos Szeredi	698fa1d163	fuse: no RCU mode in fuse_access() fuse_access() is never called in RCU walk, only on the final component of access(2) and chdir(2)... Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2013-10-01 16:41:23 +02:00
Miklos Szeredi	6314efee3c	fuse: readdirplus: fix RCU walk Doing dput(parent) is not valid in RCU walk mode. In RCU mode it would probably be okay to update the parent flags, but it's actually not necessary most of the time... So only set the FUSE_I_ADVISE_RDPLUS flag on the parent when the entry was recently initialized by READDIRPLUS. This is achieved by setting FUSE_I_INIT_RDPLUS on entries added by READDIRPLUS and only dropping out of RCU mode if this flag is set. FUSE_I_INIT_RDPLUS is cleared once the FUSE_I_ADVISE_RDPLUS flag is set in the parent. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-10-01 16:41:22 +02:00
Miklos Szeredi	3c70b8eeda	fuse: don't check_submounts_and_drop() in RCU walk If revalidate finds an invalid dentry in RCU walk mode, let the VFS deal with it instead of calling check_submounts_and_drop() which is not prepared for being called from RCU walk. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-10-01 16:41:22 +02:00
Linus Torvalds	f927318840	NFS client bugfixes for 3.12 - Stable fix for Oopses in the pNFS files layout driver - Fix a regression when doing a non-exclusive file create on NFSv4.x - NFSv4.1 security negotiation fixes when looking up the root filesystem - Fix a memory ordering issue in the pNFS files layout driver -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) iQIcBAABAgAGBQJSSfNNAAoJEGcL54qWCgDybGYQAJGm4/vd7/rWZ49KIjGFGkFo sCt0UOK6Y6ALhUOIlIreXsQ+Iwn9aAoIIRgx8UwnB+hO6PGnSyFuJZZx1KE8V2kj 6JlE5FbsWV+3uFQzNJQsNcoj7NZMzIRZT7x+7QansBOdSQjgQc3ig2sAMWREZjn8 GxMOl8FNRrnP8gRom30ZScgMp1YDM8J1ql80S/nbxh2NOLBsvgg9VapzJhhqkMyl b7WKX4Qbg4AeSaxIAIrIwcZ7L2YS09JGC40VSybQARs0/7J8fjOZPs7CmrUCoB5F DmT5vfEC4+dqDf8PMyoFVfxK5ua5Sb/FGQmagYYa8bSgY7Uq03akYI++co+4PZU1 f3SN6CSvVffzGMdXAhUupOZQbkKvKFxR2MTGy8s7dxdkQudd4RioYPDmLfCHlbmb VY5kFh/Duqso1FCrcfvZoC88ElrWUz5yoVzZyECOEwCs1wjI6bjmGdSqCSbU75Lm Z0XOAn1cStwFvGwCbGZPUzlvueji3coDdCFPBXAOFHzisLYoo/Lxenw7l5D1qM5b 02iZllcIo340vw8wxHZxVebecFo33P90X1gjv0HQQkV/6EeNgq4D47SWTPxRq3Ai Dl9MFjTPl51oseDLrH6I/hBvcqjksB1M1+WjifT0bCIi3Y0HAea2U0wgweHS3vAd QHqIpIJxNHDjPBMDWEZW =ScfI -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: - Stable fix for Oopses in the pNFS files layout driver - Fix a regression when doing a non-exclusive file create on NFSv4.x - NFSv4.1 security negotiation fixes when looking up the root filesystem - Fix a memory ordering issue in the pNFS files layout driver * tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Give "flavor" an initial value to fix a compile warning NFSv4.1: try SECINFO_NO_NAME flavs until one works NFSv4.1: Ensure memory ordering between nfs4_ds_connect and nfs4_fl_prepare_ds NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method	2013-09-30 17:10:26 -07:00
Linus Torvalds	522d6d38f8	Merge branch 'akpm' (fixes from Andrew Morton) Merge misc fixes from Andrew Morton. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits) pidns: fix free_pid() to handle the first fork failure ipc,msg: prevent race with rmid in msgsnd,msgrcv ipc/sem.c: update sem_otime for all operations mm/hwpoison: fix the lack of one reference count against poisoned page mm/hwpoison: fix false report on 2nd attempt at page recovery mm/hwpoison: fix test for a transparent huge page mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood block: change config option name for cmdline partition parsing mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration mm: avoid reinserting isolated balloon pages into LRU lists arch/parisc/mm/fault.c: fix uninitialized variable usage include/asm-generic/vtime.h: avoid zero-length file nilfs2: fix issue with race condition of competition between segments for dirty blocks Documentation/kernel-parameters.txt: replace kernelcore with Movable mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored kernel/kmod.c: check for NULL in call_usermodehelper_exec() ipc/sem.c: synchronize the proc interface ipc/sem.c: optimize sem_lock() ipc/sem.c: fix race in sem_lock() mm/compaction.c: periodically schedule when freeing pages ...	2013-09-30 14:32:32 -07:00
Vyacheslav Dubeyko	7f42ec3941	nilfs2: fix issue with race condition of competition between segments for dirty blocks Many NILFS2 users were reported about strange file system corruption (for example): NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768 NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540) But such error messages are consequence of file system's issue that takes place more earlier. Fortunately, Jerome Poulin <jeromepoulin@gmail.com> and Anton Eliasson <devel@antoneliasson.se> were reported about another issue not so recently. These reports describe the issue with segctor thread's crash: BUG: unable to handle kernel paging request at 0000000000004c83 IP: nilfs_end_page_io+0x12/0xd0 [nilfs2] Call Trace: nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2] nilfs_segctor_construct+0x17b/0x290 [nilfs2] nilfs_segctor_thread+0x122/0x3b0 [nilfs2] kthread+0xc0/0xd0 ret_from_fork+0x7c/0xb0 These two issues have one reason. This reason can raise third issue too. Third issue results in hanging of segctor thread with eating of 100% CPU. REPRODUCING PATH: One of the possible way or the issue reproducing was described by Jermoe me Poulin <jeromepoulin@gmail.com>: 1. init S to get to single user mode. 2. sysrq+E to make sure only my shell is running 3. start network-manager to get my wifi connection up 4. login as root and launch "screen" 5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies. 6. lscp \| xz -9e > lscp.txt.xz 7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs 8. start a screen to dump /proc/kmsg to text file since rsyslog is killed 9. start a screen and launch strace -f -o find-cat.log -t find /mnt/nilfs -type f -exec cat {} > /dev/null \; 10. start a screen and launch strace -f -o apt-get.log -t apt-get update 11. launch the last command again as it did not crash the first time 12. apt-get crashes 13. ps aux > ps-aux-crashed.log 13. sysrq+W 14. sysrq+E wait for everything to terminate 15. sysrq+SUSB Simplified way of the issue reproducing is starting kernel compilation task and "apt-get update" in parallel. REPRODUCIBILITY: The issue is reproduced not stable [60% - 80%]. It is very important to have proper environment for the issue reproducing. The critical conditions for successful reproducing: (1) It should have big modified file by mmap() way. (2) This file should have the count of dirty blocks are greater that several segments in size (for example, two or three) from time to time during processing. (3) It should be intensive background activity of files modification in another thread. INVESTIGATION: First of all, it is possible to see that the reason of crash is not valid page address: NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82 NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783 Moreover, value of b_page (0x1a82) is 6786. This value looks like segment number. And b_blocknr with b_size values look like block numbers. So, buffer_head's pointer points on not proper address value. Detailed investigation of the issue is discovered such picture: [-----------------------------SEGMENT 6783-------------------------------] NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783 [-----------------------------SEGMENT 6784-------------------------------] NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824 NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8 NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824 NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784 NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50 NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0 [----------] ditto NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15 [-----------------------------SEGMENT 6785-------------------------------] NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824 NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88 NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824 NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785 NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8 NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0 [----------] ditto NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12 NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783 NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784 NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785 NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82 BUG: unable to handle kernel paging request at 0000000000001a82 IP: [<ffffffffa024d0f2>] nilfs_end_page_io+0x12/0xd0 [nilfs2] Usually, for every segment we collect dirty files in list. Then, dirty blocks are gathered for every dirty file, prepared for write and submitted by means of nilfs_segbuf_submit_bh() call. Finally, it takes place complete write phase after calling nilfs_end_bio_write() on the block layer. Buffers/pages are marked as not dirty on final phase and processed files removed from the list of dirty files. It is possible to see that we had three prepare_write and submit_bio phases before segbuf_wait and complete_write phase. Moreover, segments compete between each other for dirty blocks because on every iteration of segments processing dirty buffer_heads are added in several lists of payload_buffers: [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50 [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8 The next pointer is the same but prev pointer has changed. It means that buffer_head has next pointer from one list but prev pointer from another. Such modification can be made several times. And, finally, it can be resulted in various issues: (1) segctor hanging, (2) segctor crashing, (3) file system metadata corruption. FIX: This patch adds: (1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write() for every proccessed dirty block; (2) checking of BH_Async_Write flag in nilfs_lookup_dirty_data_buffers() and nilfs_lookup_dirty_node_buffers(); (3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(), nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page(). Reported-by: Jerome Poulin <jeromepoulin@gmail.com> Reported-by: Anton Eliasson <devel@antoneliasson.se> Cc: Paul Fertser <fercerpav@gmail.com> Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp> Cc: Piotr Szymaniak <szarpaj@grubelek.pl> Cc: Juan Barry Manuel Canham <Linux@riotingpacifist.net> Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com> Cc: Elmer Zhang <freeboy6716@gmail.com> Cc: Kenneth Langga <klangga@gmail.com> Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-30 14:31:02 -07:00
Dan Aloni	7202365696	fs/binfmt_elf.c: prevent a coredump with a large vm_map_count from Oopsing A high setting of max_map_count, and a process core-dumping with a large enough vm_map_count could result in an NT_FILE note not being written, and the kernel crashing immediately later because it has assumed otherwise. Reproduction of the oops-causing bug described here: https://lkml.org/lkml/2013/8/30/50 Rge ussue originated in commit `2aa362c49c` ("coredump: extend core dump note section to contain file names of mapped file") from Oct 4, 2012. This patch make that section optional in that case. fill_files_note() should signify the error, and also let the info struct in elf_core_dump() be zero-initialized so that we can check for the optionally written note. [akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables] [akpm@linux-foundation.org: fix sparse warning] Signed-off-by: Dan Aloni <alonid@stratoscale.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Denys Vlasenko <vda.linux@googlemail.com> Reported-by: Martin MOKREJS <mmokrejs@gmail.com> Tested-by: Martin MOKREJS <mmokrejs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-30 14:31:01 -07:00
Al Viro	13f3583892	afs: dget_parent() can't return a negative dentry Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-29 22:02:24 -04:00
Al Viro	7b9a2378b4	ocfs2: needs ->d_lock to poke in ->d_parent->d_inode from ->d_revalidate() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-29 22:02:20 -04:00
Lubomir Rintel	4947555584	sysv: Add forgotten superblock lock init for v7 fs Superblock lock was replaced with (un)lock_super() removal, but left uninitialized for Seventh Edition UNIX filesystem in the following commit (3.7): `c07cb01` sysv: drop lock/unlock super Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-29 22:02:02 -04:00
Greg Kroah-Hartman	88502b9c0a	Merge 3.12-rc3 into driver-core-next We want the driver core and sysfs fixes in here to make merges and development easier. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-29 18:29:23 -07:00
Anna Schumaker	367156d9a8	NFS: Give "flavor" an initial value to fix a compile warning The previous patch introduces a compile warning by not assigning an initial value to the "flavor" variable. This could only be a problem if the server returns a supported secflavor list of length zero, but it's better to fix this before it's ever hit. Signed-off-by: Anna Schumaker <bjschuma@netapp.com> Acked-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-29 16:03:34 -04:00
Weston Andros Adamson	58a8cf1212	NFSv4.1: try SECINFO_NO_NAME flavs until one works Call nfs4_lookup_root_sec for each flavor returned by SECINFO_NO_NAME until one works. One example of a situation this fixes: - server configured for krb5 - server principal somehow gets deleted from KDC - server still thinking krb is good, sends krb5 as first entry in SECINFO_NO_NAME response - client tries krb5, but this fails without even sending an RPC because gssd's requests to the KDC can't find the server's principal Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-29 16:03:34 -04:00
Trond Myklebust	acd65e5bc1	NFSv4.1: Ensure memory ordering between nfs4_ds_connect and nfs4_fl_prepare_ds We need to ensure that the initialisation of the data server nfs_client structure in nfs4_ds_connect is correctly ordered w.r.t. the read of ds->ds_clp in nfs4_fl_prepare_ds. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-29 15:58:35 -04:00
Trond Myklebust	52b26a3e1b	NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails - Fix an Oops when nfs4_ds_connect() returns an error. - Always check the device status after waiting for a connect to complete. Reported-by: Andy Adamson <andros@netapp.com> Reported-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: <stable@vger.kernel.org> # v3.10+	2013-09-29 15:56:35 -04:00
Linus Torvalds	ddd23eb182	xfs: bugfixes for 3.12-rc3 - fix for directory node collapse regression - fix for recovery over stale on disk structures - fix for eofblocks ioctl - fix asserts in xfs_inode_free - lock the ail before removing an item from it -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJSRvPvAAoJENaLyazVq6ZOoXAP/3/AD1iuqGWBy2wIjISNJupu ST4gW5FXgBlG/sr1zGOA/L6VCdAaQMFSnlOnGpOjAyCH8VJ+XVb+4WCammzQ9CEu YJsbjlra52V3cOhGxDsuE9uEDIAqxnyiZndF/Pk1gnyGzpt5da3CxwI9UVuR8Dlt 9O/dJXuKGdKgBtQKzOfrzIDXQZ2zE5NPvueHsDU6i6Hd7YECwG5j4fRqS8jf49jY sleZo0CVkYx1IcIB//oVXa+JbyelhiS5D8Ro3dUbW8lJTToHq9RaIiE/xrz2nz2c lUcaQdecxAnN4NKtHu/QR18u5HauA7FH26cV+PUGWWmbjHTP3oybsnHJWZr+08cX +NfC9dpLl3VTapVNCxVeuToE2DhgTQuWOODiIFSi+Ljt3P6zfIKKBMqwJ6FceaJD 4afVT2GDEtvZuFbfhyvXUzP9bm0PLoZRI68bLLi772W2Zc8aiprJyKAcm/GtlxTk biluUwoGn+zD60u0GwAtlQ2g+/jTReoeAjez+LW1dWNrUwfVdnZh8xlFGEXEd/dS scx6BHSlJtwVKEWWYMO+JLHJ077yNR5RPoRFFokS1XGOSGiEuuSkAkS0eAcC5WCX egfGHzkymOOQhcXb/qPjpCVBbHEnCuYw6b24eNyceyqfE8o6iZmsUcF3kOzsa8Up fOVpOP7UovMR5teKQxAa =AUDD -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-v3.12-rc3' of git://oss.sgi.com/xfs/xfs Pull xfs bugfixes from Ben Myers: - fix for directory node collapse regression - fix for recovery over stale on disk structures - fix for eofblocks ioctl - fix asserts in xfs_inode_free - lock the ail before removing an item from it * tag 'xfs-for-linus-v3.12-rc3' of git://oss.sgi.com/xfs/xfs: xfs: fix node forward in xfs_node_toosmall xfs: log recovery lsn ordering needs uuid check xfs: fix XFS_IOC_FREE_EOFBLOCKS definition xfs: asserting lock not held during freeing not valid xfs: lock the AIL before removing the buffer item	2013-09-28 13:52:05 -07:00
Linus Torvalds	e1f8826f51	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull reiserfs and UDF fixes from Jan Kara: "The contains fix of an UDF oops when mounting corrupted media and a fix of a race in reiserfs leading to oops" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: reiserfs: fix race with flush_used_journal_lists and flush_journal_list reiserfs: remove useless flush_old_journal_lists udf: Fortify LVID loading	2013-09-27 09:31:09 -07:00
Benjamin LaHaise	5e9ae2e5da	aio: fix use-after-free in aio_migratepage Dmitry Vyukov managed to trigger a case where aio_migratepage can cause a use-after-free during teardown of the aio ring buffer's mapping. This turns out to be caused by access to the ioctx's ring_pages via the migratepage operation which was not being protected by any locks during ioctx freeing. Use the address_space's private_lock to protect use and updates of the mapping's private_data, and make ioctx teardown unlink the ioctx from the address space. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>	2013-09-26 20:34:51 -04:00
Tejun Heo	cfec0bc835	sysfs: @name comes before @ns Some internal sysfs functions which take explicit namespace argument are weird in that they place the optional @ns in front of @name which is contrary to the established convention. This is confusing and error-prone especially as @ns and @name may be interchanged without causing compilation warning. Swap the positions of @name and @ns in the following internal functions. sysfs_find_dirent() sysfs_rename() sysfs_hash_and_remove() sysfs_name_hash() sysfs_name_compare() create_dir() This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-26 15:34:38 -07:00
Tejun Heo	388975ccca	sysfs: clean up sysfs_get_dirent() The pre-existing sysfs interfaces which take explicit namespace argument are weird in that they place the optional @ns in front of @name which is contrary to the established convention. For example, we end up forcing vast majority of sysfs_get_dirent() users to do sysfs_get_dirent(parent, NULL, name), which is silly and error-prone especially as @ns and @name may be interchanged without causing compilation warning. This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the positions of @name and @ns, and sysfs_get_dirent() is now a wrapper around sysfs_get_dirent_ns(). This makes confusions a lot less likely. There are other interfaces which take @ns before @name. They'll be updated by following patches. This patch doesn't introduce any functional changes. v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol error on module builds. Reported by build test robot. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-26 15:33:18 -07:00
Tejun Heo	cb26a31157	sysfs: drop kobj_ns_type handling The way namespace tags are implemented in sysfs is more complicated than necessary. As each tag is a pointer value and required to be non-NULL under a namespace enabled parent, there's no need to record separately what type each tag is or where namespace is enabled. If multiple namespace types are needed, which currently aren't, we can simply compare the tag to a set of allowed tags in the superblock assuming that the tags, being pointers, won't have the same value across multiple types. Also, whether to filter by namespace tag or not can be trivially determined by whether the node has any tagged children or not. This patch rips out kobj_ns_type handling from sysfs. sysfs no longer cares whether specific type of namespace is enabled or not. If a sysfs_dirent has a non-NULL tag, the parent is marked as needing namespace filtering and the value is tested against the allowed set of tags for the superblock (currently only one but increasing this number isn't difficult) and the sysfs_dirent is ignored if it doesn't match. This removes most kobject namespace knowledge from sysfs proper which will enable proper separation and layering of sysfs. The namespace sanity checks in fs/sysfs/dir.c are replaced by the new sanity check in kobject_namespace(). As this is the only place ktype->namespace() is called for sysfs, this doesn't weaken the sanity check significantly. I omitted converting the sanity check in sysfs_do_create_link_sd(). While the check can be shifted to upper layer, mistakes there are well contained and should be easily visible anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-26 15:30:22 -07:00
Tejun Heo	4b30ee58ee	sysfs: remove ktype->namespace() invocations in symlink code There's no reason for sysfs to be calling ktype->namespace(). It is backwards, obfuscates what's going on and unnecessarily tangles two separate layers. There are two places where symlink code calls ktype->namespace(). * sysfs_do_create_link_sd() calls it to find out the namespace tag of the target directory. Unless symlinking races with cross-namespace renaming, this equals @target_sd->s_ns. * sysfs_rename_link() uses it to find out the new namespace to rename to and the new namespace can be different from the existing one. The function is renamed to sysfs_rename_link_ns() with an explicit @ns argument and the ktype->namespace() invocation is shifted to the device layer. While this patch replaces ktype->namespace() invocation with the recorded result in @target_sd, this shouldn't result in any behvior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-26 15:30:22 -07:00
Tejun Heo	e34ff49061	sysfs: remove ktype->namespace() invocations in directory code For some unrecognizable reason, namespace information is communicated to sysfs through ktype->namespace() callback when there's nothing which needs the use of a callback. The whole sequence of operations is completely synchronous and sysfs operations simply end up calling back into the layer which just invoked it in order to find out the namespace information, which is completely backwards, obfuscates what's going on and unnecessarily tangles two separate layers. This patch doesn't remove ktype->namespace() but shifts its handling to kobject layer. We probably want to get rid of the callback in the long term. This patch adds an explicit param to sysfs_{create\|rename\|move}_dir() and renames them to sysfs_{create\|rename\|move}_dir_ns(), respectively. ktype->namespace() invocations are moved to the calling sites of the above functions. A new helper kboject_namespace() is introduced which directly tests kobj_ns_type_operations->type which should give the same result as testing sysfs_fs_type(parent_sd) and returns @kobj's namespace tag as necessary. kobject_namespace() is extern as it will be used from another file in the following patches. This patch should be an equivalent conversion without any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-26 15:30:22 -07:00
Tejun Heo	58292cbe66	sysfs: make attr namespace interface less convoluted sysfs ns (namespace) implementation became more convoluted than necessary while trying to hide ns information from visible interface. The relatively recent attr ns support is a good example. * attr ns tag is determined by sysfs_ops->namespace() callback while dir tag is determined by kobj_type->namespace(). The placement is arbitrary. * Instead of performing operations with explicit ns tag, the namespace callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(), class_attr_namespace(), class_attr->namespace(). It's not simpler in any sense. The only thing this convolution does is traversing the whole stack backwards. The namespace callbacks are unncessary because the operations involved are inherently synchronous. The information can be provided in in straight-forward top-down direction and reversing that direction is unnecessary and against basic design principles. This backward interface is unnecessarily convoluted and hinders properly separating out sysfs from driver model / kobject for proper layering. This patch updates attr ns support such that * sysfs_ops->namespace() and class_attr->namespace() are dropped. * sysfs_{create\|remove}_file_ns(), which take explicit @ns param, are added and sysfs_{create\|remove}_file() are now simple wrappers around the ns aware functions. * ns handling is dropped from sysfs_chmod_file(). Nobody uses it at this point. sysfs_chmod_file_ns() can be added later if necessary. * Explicit @ns is propagated through class_{create\|remove}_file_ns() and netdev_class_{create\|remove}_file_ns(). * driver/net/bonding which is currently the only user of attr namespace is updated to use netdev_class_{create\|remove}_file_ns() with @bh->net as the ns tag instead of using the namespace callback. This patch should be an equivalent conversion without any functional difference. It makes the code easier to follow, reduces lines of code a bit and helps proper separation and layering. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-26 14:50:01 -07:00
Tejun Heo	bcac3769ca	sysfs: drop semicolon from to_sysfs_dirent() definition The expansion of to_sysfs_dirent() contains an unncessary trailing semicolon making it impossible to use in the middle of statements. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-09-26 14:48:28 -07:00
Mark Tinguely	997def25e4	xfs: fix node forward in xfs_node_toosmall Commit `f5ea1100` cleans up the disk to host conversions for node directory entries, but because a variable is reused in xfs_node_toosmall() the next node is not correctly found. If the original node is small enough (<= 3/8 of the node size), this change may incorrectly cause a node collapse when it should not. That will cause an assert in xfstest generic/319: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: /root/newest/xfs/fs/xfs/xfs_trans_buf.c, line: 569 Keep the original node header to get the correct forward node. (When a node is considered for a merge with a sibling, it overwrites the sibling pointers of the original incore nodehdr with the sibling's pointers. This leads to loop considering the original node as a merge candidate with itself in the second pass, and so it incorrectly determines a merge should occur.) Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> [v3: added Dave Chinner's (slightly modified) suggestion to the commit header, cleaned up whitespace. -bpm]	2013-09-26 10:38:17 -05:00
Trond Myklebust	5bc2afc2b5	NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method Determine if we've created a new file by examining the directory change attribute and/or the O_EXCL flag. This fixes a regression when doing a non-exclusive create of a new file. If the FILE_CREATED flag is not set, the atomic_open() command will perform full file access permissions checks instead of just checking for MAY_OPEN. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-09-26 10:20:18 -04:00
Steve French	ffe67b5859	[CIFS] update cifs.ko version To 2.02 Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-25 19:01:27 -05:00
Steve French	05c715f2a9	[CIFS] Remove ext2 flags that have been moved to fs.h These flags were unused by cifs and since the EXT flags have been moved to common code in uapi/linux/fs.h we won't need to have a cifs specific copy. Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-25 18:58:13 -05:00
Linus Torvalds	a153e67bda	Merge branch 'akpm' (patches from Andrew Morton) Merge fixes from Andrew Morton: "Bunch of fixes. And a reversion of mhocko's "Soft limit rework" patch series. This is actually your fault for opening the merge window when I was off racing ;) I didn't read the email thread before sending everything off. Johannes Weiner raised significant issues: http://www.spinics.net/lists/cgroups/msg08813.html and we agreed to back it all out" I clearly need to be more aware of Andrew's racing schedule. * akpm: MAINTAINERS: update mach-bcm related email address checkpatch: make extern in .h prototypes quieter cciss: fix info leak in cciss_ioctl32_passthru() cpqarray: fix info leak in ida_locked_ioctl() kernel/reboot.c: re-enable the function of variable reboot_default audit: fix endless wait in audit_log_start() revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code" revert "memcg: get rid of soft-limit tree infrastructure" revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim" revert "memcg: enhance memcg iterator to support predicates" revert "memcg: track children in soft limit excess to improve soft limit" revert "memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything" revert "memcg: track all children over limit in the root" revert "memcg, vmscan: do not fall into reclaim-all pass too quickly" fs/ocfs2/super.c: use a bigger nodestr in ocfs2_dismount_volume watchdog: update watchdog_thresh properly watchdog: update watchdog attributes atomically	2013-09-24 17:00:35 -07:00
Goldwyn Rodrigues	99d7a8824a	fs/ocfs2/super.c: use a bigger nodestr in ocfs2_dismount_volume While printing 32-bit node numbers, an 8-byte string is not enough. Increase the size of the string to 12 chars. This got left out in commit `49fa8140e4` ("fs/ocfs2/super.c: Use bigger nodestr to accomodate 32-bit node numbers"). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-24 17:00:25 -07:00
Kent Overstreet	2f6cf0de02	block: Fix bio_copy_data() The memcpy() in bio_copy_data() was using the wrong offset vars, leading to data corruption in weird unusual setups. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-stable <stable@vger.kernel.org> # >= v3.9 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-24 14:41:42 -07:00
Dave Chinner	566055d33a	xfs: log recovery lsn ordering needs uuid check After a fair number of xfstests runs, xfs/182 started to fail regularly with a corrupted directory - a directory read verifier was failing after recovery because it found a block with a XARM magic number (remote attribute block) rather than a directory data block. The first time I saw this repeated failure I did /something/ and the problem went away, so I was never able to find the underlying problem. Test xfs/182 failed again today, and I found the root cause before I did /something else/ that made it go away. Tracing indicated that the block in question was being correctly logged, the log was being flushed by sync, but the buffer was not being written back before the shutdown occurred. Tracing also indicated that log recovery was also reading the block, but then never writing it before log recovery invalidated the cache, indicating that it was not modified by log recovery. More detailed analysis of the corpse indicated that the filesystem had a uuid of "a4131074-1872-4cac-9323-2229adbcb886" but the XARM block had a uuid of "8f32f043-c3c9-e7f8-f947-4e7f989c05d3", which indicated it was a block from an older filesystem. The reason that log recovery didn't replay it was that the LSN in the XARM block was larger than the LSN of the transaction being replayed, and so the block was not overwritten by log recovery. Hence, log recovery cant blindly trust the magic number and LSN in the block - it must verify that it belongs to the filesystem being recovered before using the LSN. i.e. if the UUIDs don't match, we need to unconditionally recovery the change held in the log. This patch was first tested on a block device that was repeatedly causing xfs/182 to fail with the same failure on the same block with the same directory read corruption signature (i.e. XARM block). It did not fail, and hasn't failed since. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-24 12:35:57 -05:00
Dave Chinner	b771af2fcb	xfs: fix XFS_IOC_FREE_EOFBLOCKS definition It uses a kernel internal structure in it's definition rather than the user visible structure that is passed to the ioctl. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-24 12:35:08 -05:00
Dave Chinner	b313a5f1cb	xfs: asserting lock not held during freeing not valid When we free an inode, we do so via RCU. As an RCU lookup can occur at any time before we free an inode, and that lookup takes the inode flags lock, we cannot safely assert that the flags lock is not held just before marking it dead and running call_rcu() to free the inode. We check on allocation of a new inode structre that the lock is not held, so we still have protection against locks being leaked and hence not correctly initialised when allocated out of the slab. Hence just remove the assert... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-24 12:32:57 -05:00
Dave Chinner	4885235806	xfs: lock the AIL before removing the buffer item Regression introduced by commit `46f9d2e` ("xfs: aborted buf items can be in the AIL") which fails to lock the AIL before removing the item. Spinlock debugging throws a warning about this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-09-24 12:31:41 -05:00
Jeff Mahoney	721a769c03	reiserfs: fix race with flush_used_journal_lists and flush_journal_list There are two locks involved in managing the journal lists. The general reiserfs_write_lock and the journal->j_flush_mutex. While flush_journal_list is sleeping to acquire the j_flush_mutex or to submit a block for write, it will drop the write lock. This allows another thread to acquire the write lock and ultimately call flush_used_journal_lists to traverse the list of journal lists and select one for flushing. It can select the journal_list that has just had flush_journal_list called on it in the original thread and call it again with the same journal_list. The second thread then drops the write lock to acquire j_flush_mutex and the first thread reacquires it and continues execution and eventually clears and frees the journal list before dropping j_flush_mutex and returning. The second thread acquires j_flush_mutex and ends up operating on a journal_list that has already been released. If the memory hasn't been reused, we'll soon after hit a BUG_ON because the transaction id has already been cleared. If it's been reused, we'll crash in other fun ways. Since flush_journal_list will synchronize on j_flush_mutex, we can fix the race by taking a proper reference in flush_used_journal_lists and checking to see if it's still valid after the mutex is taken. It's safe to iterate the list of journal lists and pick a list with just the write lock as long as a reference is taken on the journal list before we drop the lock. We already have code to handle whether a transaction has been flushed already so we can use that to handle the race and get rid of the trans_id BUG_ON. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-09-24 11:24:21 +02:00
Jeff Mahoney	7bc9cc07ee	reiserfs: remove useless flush_old_journal_lists Commit `a3172027` introduced test_transaction as a requirement for flushing old lists -- but it can never return 1 unless the transaction has already been flushed. As a result, we have a routine that iterates the j_realblocks list but doesn't actually do anything. Since it's been this way since 2006 and the latency numbers were what Chris expected, let's just rip it out. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-09-24 11:24:21 +02:00
Jan Kara	69d75671d9	udf: Fortify LVID loading A user has reported an oops in udf_statfs() that was caused by numOfPartitions entry in LVID structure being corrupted. Fix the problem by verifying whether numOfPartitions makes sense at least to the extent that LVID fits into a single block as it should. Reported-by: Juergen Weigert <jw@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>	2013-09-24 11:23:33 +02:00
Linus Torvalds	68cf8d0c72	Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block Pull block IO fixes from Jens Axboe: "After merge window, no new stuff this time only a collection of neatly confined and simple fixes" * 'for-3.12/core' of git://git.kernel.dk/linux-block: cfq: explicitly use 64bit divide operation for 64bit arguments block: Add nr_bios to block_rq_remap tracepoint If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up. bio-integrity: Fix use of bs->bio_integrity_pool after free blkcg: relocate root_blkg setting and clearing block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) block: trace all devices plug operation	2013-09-22 15:00:11 -07:00
Linus Torvalds	0fbf2cc983	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are mostly bug fixes and a two small performance fixes. The most important of the bunch are Josef's fix for a snapshotting regression and Mark's update to fix compile problems on arm" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits) Btrfs: create the uuid tree on remount rw btrfs: change extent-same to copy entire argument struct Btrfs: dir_inode_operations should use btrfs_update_time also btrfs: Add btrfs: prefix to kernel log output btrfs: refuse to remount read-write after abort Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0 Btrfs: don't leak transaction in btrfs_sync_file() Btrfs: add the missing mutex unlock in write_all_supers() Btrfs: iput inode on allocation failure Btrfs: remove space_info->reservation_progress Btrfs: kill delay_iput arg to the wait_ordered functions Btrfs: fix worst case calculator for space usage Revert "Btrfs: rework the overcommit logic to be based on the total size" Btrfs: improve replacing nocow extents Btrfs: drop dir i_size when adding new names on replay Btrfs: replay dir_index items before other items Btrfs: check roots last log commit when checking if an inode has been logged Btrfs: actually log directory we are fsync()'ing Btrfs: actually limit the size of delalloc range Btrfs: allocate the free space by the existed max extent size when ENOSPC ...	2013-09-22 14:58:49 -07:00
Josef Bacik	94aebfb2e7	Btrfs: create the uuid tree on remount rw Users have been complaining of the uuid tree stuff warning that there is no uuid root when trying to do snapshot operations. This is because if you mount -o ro we will not create the uuid tree. But then if you mount -o rw,remount we will still not create it and then any subsequent snapshot/subvol operations you try to do will fail gloriously. Fix this by creating the uuid_root on remount rw if it was not already there. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:50:43 -04:00
Jim McDonough	74d290da47	[CIFS] Provide sane values for nlink Since we don't get info about the number of links from the readdir linfo levels, stat() will return 0 for st_nlink, and in particular, samba re-exported shares will show directories as files (as samba is keying off st_nlink before evaluating how to set the dos modebits) when doing a dir or ls. Copy nlink to the inode, unless it wasn't provided. Provide sane values if we don't have an existing one and none was provided. Signed-off-by: Jim McDonough <jmcd@samba.org> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: David Disseldorp <ddiss@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-21 10:36:10 -05:00
Mark Fasheh	cbf8b8ca3e	btrfs: change extent-same to copy entire argument struct btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data back to it's argument struct. Unfortunately, not all architectures provide __put_user_unaligned(), so compiles break on them if btrfs is selected. Instead, just copy the whole struct in / out at the start and end of operations, respectively. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:31 -04:00
Guangyu Sun	93fd63c2f0	Btrfs: dir_inode_operations should use btrfs_update_time also Commit `2bc5565286` (Btrfs: don't update atime on RO subvolumes) ensures that the access time of an inode is not updated when the inode lives in a read-only subvolume. However, if a directory on a read-only subvolume is accessed, the atime is updated. This results in a write operation to a read-only subvolume. I believe that access times should never be updated on read-only subvolumes. To reproduce: # mkfs.btrfs -f /dev/dm-3 (...) # mount /dev/dm-3 /mnt # btrfs subvol create /mnt/sub Create subvolume '/mnt/sub' # mkdir /mnt/sub/dir # echo "abc" > /mnt/sub/dir/file # btrfs subvol snapshot -r /mnt/sub /mnt/rosnap Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap' # stat /mnt/rosnap/dir File: `/mnt/rosnap/dir' Size: 8 Blocks: 0 IO Block: 4096 directory Device: 16h/22d Inode: 257 Links: 1 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2013-09-11 07:21:49.389157126 -0400 Modify: 2013-09-11 07:22:02.330156079 -0400 Change: 2013-09-11 07:22:02.330156079 -0400 # ls /mnt/rosnap/dir file # stat /mnt/rosnap/dir File: `/mnt/rosnap/dir' Size: 8 Blocks: 0 IO Block: 4096 directory Device: 16h/22d Inode: 257 Links: 1 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2013-09-11 07:22:56.797151670 -0400 Modify: 2013-09-11 07:22:02.330156079 -0400 Change: 2013-09-11 07:22:02.330156079 -0400 Reported-by: Koen De Wit <koen.de.wit@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:30 -04:00
Frank Holton	5138cccf34	btrfs: Add btrfs: prefix to kernel log output The kernel log entries for device label %s and device fsid %pU are missing the btrfs: prefix. Add those here. Signed-off-by: Frank Holton <fholton@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:30 -04:00
David Sterba	6ef3de9c92	btrfs: refuse to remount read-write after abort It's still possible to flip the filesystem into RW mode after it's remounted RO due to an abort. There are lots of places that check for the superblock error bit and will not write data, but we should not let the filesystem appear read-write. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:30 -04:00
chandan	1cecf579d1	Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0 This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default subvolume by passing a subvolume id of 0. Signed-off-by: chandan <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:29 -04:00
Filipe David Borba Manana	a0634be562	Btrfs: don't leak transaction in btrfs_sync_file() In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would return without ending/freeing the transaction. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:29 -04:00
Stefan Behrens	a724b43690	Btrfs: add the missing mutex unlock in write_all_supers() The BUG() was replaced by btrfs_error() and return -EIO with the patch "get rid of one BUG() in write_all_supers()", but the missing mutex_unlock() was overlooked. The 0-DAY kernel build service from Intel reported the missing unlock which was found by the coccinelle tool: fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374 Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:28 -04:00
Josef Bacik	f4ab9ea706	Btrfs: iput inode on allocation failure We don't do the iput when we fail to allocate our delayed delalloc work in __start_delalloc_inodes, fix this. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:28 -04:00
Josef Bacik	363e4d354e	Btrfs: remove space_info->reservation_progress This isn't used for anything anymore, just remove it. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:27 -04:00
Josef Bacik	f0de181c9b	Btrfs: kill delay_iput arg to the wait_ordered functions This is a left over of how we used to wait for ordered extents, which was to grab the inode and then run filemap flush on it. However if we have an ordered extent then we already are holding a ref on the inode, and we just use btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on the inode to start work on the ordered extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:27 -04:00
Josef Bacik	c4fbb4300a	Btrfs: fix worst case calculator for space usage Forever ago I made the worst case calculator say that we could potentially split into 3 blocks for every level on the way down, which isn't right. If we split we're only going to get two new blocks, the one we originally cow'ed and the new one we're going to split. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:27 -04:00
Josef Bacik	14575aef42	Revert "Btrfs: rework the overcommit logic to be based on the total size" This reverts commit `70afa3998c`. It is causing performance issues and wasn't actually correct. There were problems with the way we flushed delalloc and that was the real cause of the early enospc. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:26 -04:00
Josef Bacik	652f25a292	Btrfs: improve replacing nocow extents Various people have hit a deadlock when running btrfs/011. This is because when replacing nocow extents we will take the i_mutex to make sure nobody messes with the file while we are replacing the extent. The problem is we are already holding a transaction open, which is a locking inversion, so instead we need to save these inodes we find and then process them outside of the transaction. Further we can't just lock the inode and assume we are good to go. We need to lock the extent range and then read back the extent cache for the inode to make sure the extent really still points at the physical block we want. If it doesn't we don't have to copy it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:26 -04:00
Josef Bacik	d555438b6e	Btrfs: drop dir i_size when adding new names on replay So if we have dir_index items in the log that means we also have the inode item as well, which means that the inode's i_size is correct. However when we process dir_index'es we call btrfs_add_link() which will increase the directory's i_size for the new entry. To fix this we need to just set the dir items i_size to 0, and then as we find dir_index items we adjust the i_size. btrfs_add_link() will do it for new entries, and if the entry already exists we can just add the name_len to the i_size ourselves. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:25 -04:00
Josef Bacik	dd8e721773	Btrfs: replay dir_index items before other items A user reported a bug where his log would not replay because he was getting -EEXIST back. This was because he had a file moved into a directory that was logged. What happens is the file had a lower inode number, and so it is processed first when replaying the log, and so we add the inode ref in for the directory it was moved to. But then we process the directories DIR_INDEX item and try to add the inode ref for that inode and it fails because we already added it when we replayed the inode. To solve this problem we need to just process any DIR_INDEX items we have in the log first so this all is taken care of, and then we can replay the rest of the items. With this patch my reproducer can remount the file system properly instead of erroring out. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:25 -04:00
Josef Bacik	a5874ce6ce	Btrfs: check roots last log commit when checking if an inode has been logged Liu introduced a local copy of the last log commit for an inode to make sure we actually log an inode even if a log commit has already taken place. In order to make sure we didn't relog the same inode multiple times he set this local copy to the current trans when we log the inode, because usually we log the inode and then sync the log. The exception to this is during rename, we will relog an inode if the name changed and it is already in the log. The problem with this is then we go to sync the inode, and our check to see if the inode has already been logged is tripped and we don't sync the log. To fix this we need to _also_ check against the roots last log commit, because it could be less than what is in our local copy of the log commit. This fixes a bug where we rename a file into a directory and then fsync the directory and then on remount the directory is no longer there. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:24 -04:00
Josef Bacik	de2b530bfb	Btrfs: actually log directory we are fsync()'ing If you just create a directory and then fsync that directory and then pull the power plug you will come back up and the directory will not be there. That is because we won't actually create directories if we've logged files inside of them since they will be created on replay, but in this check we will set our logged_trans of our current directory if it happens to be a directory, making us think it doesn't need to be logged. Fix the logic to only do this to parent directories. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:24 -04:00
Josef Bacik	573aecafca	Btrfs: actually limit the size of delalloc range So forever we have had this thing to limit the amount of delalloc pages we'll setup to be written out to 128mb. This is because we have to lock all the pages in this range, so anything above this gets a bit unweildly, and also without a limit we'll happily allocate gigantic chunks of disk space. Turns out our check for this wasn't quite right, we wouldn't actually limit the chunk we wanted to write out, we'd just stop looking for more space after we went over the limit. So if you do a giant 20gb dd on my box with lots of ram I could get 2gig extents. This is fine normally, except when you go to relocate these extents and we can't find enough space to relocate these moster extents, since we have to be able to allocate exactly the same sized extent to move it around. So fix this by actually enforcing the limit. With this patch I'm no longer seeing giant 1.5gb extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:24 -04:00
Miao Xie	a482039889	Btrfs: allocate the free space by the existed max extent size when ENOSPC By the current code, if the requested size is very large, and all the extents in the free space cache are small, we will waste lots of the cpu time to cut the requested size in half and search the cache again and again until it gets down to the size the allocator can return. In fact, we can know the max extent size in the cache after the first search, so we needn't cut the size in half repeatedly, and just use the max extent size directly. This way can save lots of cpu time and make the performance grow up when there are only fragments in the free space cache. According to my test, if there are only 4KB free space extents in the fs, and the total size of those extents are 256MB, we can reduce the execute time of the following test from 5.4s to 1.4s. dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync Changelog v2 -> v3: - fix the problem that we skip the block group with the space which is less than we need. Changelog v1 -> v2: - address the problem that we return a wrong start position when searching the free space in a bitmap. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:23 -04:00
David Sterba	13fd8da98f	btrfs: add lockdep and tracing annotations for uuid tree Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:56 -04:00
Stefan Behrens	79556c3d88	btrfs: show compiled-in config features at module load time We want to know if there are debugging features compiled in, this may affect performance. The message is printed before the sanity checks. (This commit message is a copy of David Sterba's commit message when he introduced btrfs_print_info()). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:56 -04:00
Filipe David Borba Manana	cef2193729	Btrfs: more efficient inode tree replace operation Instead of removing the current inode from the red black tree and then add the new one, just use the red black tree replace operation, which is more efficient. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:55 -04:00
Ilya Dryomov	55e50e458e	Btrfs: do not add replace target to the alloc_list If replace was suspended by the umount, replace target device is added to the fs_devices->alloc_list during a later mount. This is obviously wrong. ->is_tgtdev_for_dev_replace is supposed to guard against that, but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized after everything is opened and fs_devices lists are populated. Fix this by checking the devid instead: for replace targets it's always equal to BTRFS_DEV_REPLACE_DEVID. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:55 -04:00
Josef Bacik	83d4cfd4da	Btrfs: fixup error handling in btrfs_reloc_cow If we failed to actually allocate the correct size of the extent to relocate we will end up in an infinite loop because we won't return an error, we'll just move on to the next extent. So fix this up by returning an error, and then fix all the callers to return an error up the stack rather than BUG_ON()'ing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:54 -04:00
Chris Mason	07f0e62e7f	Linux 3.11 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQEcBAABAgAGBQJSJPkeAAoJEHm+PkMAQRiGVWMH/jo5f01Ra7G4/CYS59K+AlBQ /oWL3W81r5MORlsMxwUwGtJ3sZ7UulKwiDrluWeOkz2+/9SmoHoUfkpbByq1bSIV y0eqhmjtkHQZz5radJIHeyz1gJIICBIgAM0l45j8SpK4n9EXRcjLSZjdjAkPzxZp qZpfxKhVSTu79m96bud7F+HrboHDQEyhD9zqdSi4xPQNnOmTc7K3tvui9AB3rMbV ablM3C+LqBYjZx+pKS/rOdfATxZvtU392HU53XTALt6VD1e8alMmhmpe0I9Zxvjv scsB6hfRkevfe7VaK3aVoDnQnLKd61yxs+/XdzTtkWPbVGp+kiuFUdDv/5y2r1g= =7Xf6 -----END PGP SIGNATURE----- Merge tag 'v3.11' into for-linus Linux 3.11	2013-09-21 10:44:55 -04:00
David Howells	509bf24d18	CacheFiles: Don't try to dump the index key if the cookie has been cleared Don't try to dump the index key that distinguishes an object if netfs data in the cookie the object refers to has been cleared (ie. the cookie has passed most of the way through __fscache_relinquish_cookie()). Since the netfs holds the index key, we can't get at it once the ->def and ->netfs_data pointers have been cleared - and a NULL pointer exception will ensue, usually just after a: CacheFiles: Error: Unexpected object collision error is reported. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-20 15:15:43 -07:00
Josh Boyer	607566aecc	CacheFiles: Fix memory leak in cachefiles_check_auxdata error paths In cachefiles_check_auxdata(), we allocate auxbuf but fail to free it if we determine there's an error or that the data is stale. Further, assigning the output of vfs_getxattr() to auxbuf->len gives problems with checking for errors as auxbuf->len is a u16. We don't actually need to set auxbuf->len, so keep the length in a variable for now. We shouldn't need to check the upper limit of the buffer as an overflow there should be indicated by -ERANGE. While we're at it, fscache_check_aux() returns an enum value, not an int, so assign it to an appropriately typed variable rather than to ret. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: Hongyi Jia <jiayisuse@gmail.com> cc: Milosz Tanski <milosz@adfin.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-20 15:15:42 -07:00
Linus Torvalds	e9ff04dd94	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fixes from Sage Weil: "These fix several bugs with RBD from 3.11 that didn't get tested in time for the merge window: some error handling, a use-after-free, and a sequencing issue when unmapping and image races with a notify operation. There is also a patch fixing a problem with the new ceph + fscache code that just went in" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: fscache: check consistency does not decrement refcount rbd: fix error handling from rbd_snap_name() rbd: ignore unmapped snapshots that no longer exist rbd: fix use-after free of rbd_dev->disk rbd: make rbd_obj_notify_ack() synchronous rbd: complete notifies before cleaning up osd_client and rbd_dev libceph: add function to ensure notifies are complete	2013-09-19 12:50:37 -05:00
Linus Torvalds	3fe03debfc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "atomic_open-related fixes (Miklos' series, with EEXIST-related parts replaced with fix in fs/namei.c:atomic_open() instead of messing with the instances) + race fix in autofs + leak on failure exit in 9p" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: 9p: don't forget to destroy inode cache if fscache registration fails atomic_open: take care of EEXIST in no-open case with O_CREAT\|O_EXCL in fs/namei.c vfs: don't set FILE_CREATED before calling ->atomic_open() nfs: set FILE_CREATED gfs2: set FILE_CREATED cifs: fix filp leak in cifs_atomic_open() vfs: improve i_op->atomic_open() documentation autofs4: close the races around autofs4_notify_daemon()	2013-09-18 19:22:22 -05:00
Linus Torvalds	9baa505948	Three pstore fixes related to compression: 1) Better adjustment of size of compression buffer (was too big for EFIVARS backend resulting in compression failure 2) Use zlib_inflateInit2 instead of zlib_inflateInit 3) Don't print messages about compression failure. They will waste space that may better be used to log console output leading to the crash. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJSOeAIAAoJEKurIx+X31iBq8wP/1MthA3CDTVFl2beFNXEo8G/ Sq3YAfTHj61f+UKT2489WSyYwc6Q3y4iEia+shCu28DkuQZMifH8KoDfsoJAKF1X SVsm5MkelhXEDlmt94AnEXmNIgQMnJ1c5uToTanNz/UbpUZdsdVzP+c4ifUC1mX3 m+uARA2oy7obVm0RihXEzRhMZAOdkq0TXxL4TVaZShjDPuxN5BSQGlNB13+6LAEM Q54HI/j9RHVFiIxT7INttyOMvDps2zDNJtsVgiphp0bBQBWzY1puJJykM/T64ZJV /UMsycoKLJdLi3pnwWtZ1USTk4EwkjjVWCtUHtan6wEt1rDbrkWaMU1RvTASBz9Z 418EUAob0FZuL0ZdaN4WgYc04xwgc748S/PcUtkFfvk8KqhQbmkgbdVu6cs/mJmQ Jbi+ATJda1zmCEQXZBLENfe7o4yiGgKjOWWy5/tbtMi8a6cpMIPUn9phNXNoRvBb II0iMKwZetuOkDDqJAtZwPUiYNdRHWLosn+66AjpYARXqrCnRfi87x4WMWYJ4CVR RMxrn6YQT3DIDxnBd00zVepdK9ee8It10t7k07f6Ve/EdvOJZK9lSg/FUp9MhL5a N6S9X2gQ0R2wDHjFNRyL8p0xIoe45zFXPICLYaqcDxEcC0G7bd1AxGZ5y9v+/qvK 76dJvg0f1E/TsoqhQw79 =E5IH -----END PGP SIGNATURE----- Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore/compression fixes from Tony Luck: "Three pstore fixes related to compression: 1) Better adjustment of size of compression buffer (was too big for EFIVARS backend resulting in compression failure 2) Use zlib_inflateInit2 instead of zlib_inflateInit 3) Don't print messages about compression failure. They will waste space that may better be used to log console output leading to the crash" * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: Remove the messages related to compression failure pstore: Use zlib_inflateInit2 instead of zlib_inflateInit pstore: Adjust buffer size for compression for smaller registered buffers	2013-09-18 12:39:40 -05:00
Jeff Layton	9ae6cf606a	cifs: stop trying to use virtual circuits Currently, we try to ensure that we use vcnum of 0 on the first established session on a connection and then try to use a different vcnum on each session after that. This is a little odd, since there's no real reason to use a different vcnum for each SMB session. I can only assume there was some confusion between SMB sessions and VCs. That's somewhat understandable since they both get created during SESSION_SETUP, but the documentation indicates that they are really orthogonal. The comment on max_vcs in particular looks quite misguided. An SMB session is already uniquely identified by the SMB UID value -- there's no need to again uniquely ID with a VC. Furthermore, a vcnum of 0 is a cue to the server that it should release any resources that were previously held by the client. This sounds like a good thing, until you consider that: a) it totally ignores the fact that other programs on the box (e.g. smbclient) might have connections established to the server. Using a vcnum of 0 causes them to get kicked off. b) it causes problems with NAT. If several clients are connected to the same server via the same NAT'ed address, whenever one connects to the server it kicks off all the others, which then reconnect and kick off the first one...ad nauseum. I don't see any reason to ignore the advice in "Implementing CIFS" which has a comprehensive treatment of virtual circuits. In there, it states "...and contrary to the specs the client should always use a VcNumber of one, never zero." Have the client just use a hardcoded vcnum of 1, and stop abusing the special behavior of vcnum 0. Reported-by: Sauron99@gmx.de <sauron99@gmx.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Volker Lendecke <vl@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-18 10:23:44 -05:00
David Howells	54afa99057	CIFS: FS-Cache: Uncache unread pages in cifs_readpages() before freeing them In cifs_readpages(), we may decide we don't want to read a page after all - but the page may already have passed through fscache_read_or_alloc_pages() and thus have marks and reservations set. Thus we have to call fscache_readpages_cancel() or fscache_uncache_page() on the pages we're returning to clear the marks. NFS, AFS and 9P should be unaffected by this as they call read_cache_pages() which does the cleanup for you. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-09-18 10:17:03 -05:00
Maxim Patlasov	0ab08f576b	fuse: fix fallocate vs. ftruncate race A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed description of races between ftruncate and anyone who can extend i_size: > 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size > changed (due to our own fuse_do_setattr()) and is going to call > truncate_pagecache() for some 'new_size' it believes valid right now. But by > the time that particular truncate_pagecache() is called ... > 2. fuse_do_setattr() returns (either having called truncate_pagecache() or > not -- it doesn't matter). > 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2). > 4. mmap-ed write makes a page in the extended region dirty. This patch adds necessary bits to fuse_file_fallocate() to protect from that race. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-18 14:19:59 +02:00
Maxim Patlasov	bde52788bd	fuse: wait for writeback in fuse_file_fallocate() The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE): 1) An user makes a page dirty via mmap-ed write. 2) The user performs fallocate(2) with mode == PUNCH_HOLE\|KEEP_SIZE and <offset, size> covering the page. 3) Before truncate_pagecache_range call from fuse_file_fallocate, the page goes to write-back. The page is fully processed by fuse_writepage (including end_page_writeback on the page), but fuse_flush_writepages did nothing because fi->writectr < 0. 4) truncate_pagecache_range is called and fuse_file_fallocate is finishing by calling fuse_release_nowrite. The latter triggers processing queued write-back request which will write stale data to the hole soon. Changed in v2 (thanks to Brian for suggestion): - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise, we can end up in returning -ENOTSUPP while user data is already punched from page cache. Use filemap_write_and_wait_range() instead. Changed in v3 (thanks to Miklos for suggestion): - fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite() instead. So far as we need a dirty-page barrier only, fuse_sync_writes() should be enough. - rebased to for-linus branch of fuse.git Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org	2013-09-18 14:19:59 +02:00
Al Viro	8061a6fa56	9p: don't forget to destroy inode cache if fscache registration fails Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-17 22:31:01 -04:00
Al Viro	03da633aa7	atomic_open: take care of EEXIST in no-open case with O_CREAT\|O_EXCL in fs/namei.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-17 17:08:50 -04:00
Bjorn Helgaas	adbe6991ef	bio-integrity: Fix use of bs->bio_integrity_pool after free This fixes a copy and paste error introduced by `9f060e2231` ("block: Convert integrity to bvec_alloc_bs()"). Found by Coverity (CID 1020654). Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-09-17 12:46:24 -06:00
Miklos Szeredi	116cc02253	vfs: don't set FILE_CREATED before calling ->atomic_open() If O_CREAT\|O_EXCL are passed to open, then we know that either - the file is successfully created, or - the operation fails in some way. So previously we set FILE_CREATED before calling ->atomic_open() so the filesystem doesn't have to. This, however, led to bugs in the implementation that went unnoticed when the filesystem didn't check for existence, yet returned success. To prevent this kind of bug, require filesystems to always explicitly set FILE_CREATED on O_CREAT\|O_EXCL and verify this in the VFS. Also added a couple more verifications for the result of atomic_open(): - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT. - Warn if filesystem set FILE_CREATED but gave a negative dentry. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-16 19:17:24 -04:00
Miklos Szeredi	01c919abaf	nfs: set FILE_CREATED Set FILE_CREATED on O_CREAT\|O_EXCL. If the NFS server honored our request for exclusivity then this must be correct. Currently this is a no-op, since the VFS sets FILE_CREATED anyway. The next patch will, however, require this flag to be always set by filesystems. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-16 19:17:24 -04:00
Miklos Szeredi	c5bf8fef52	gfs2: set FILE_CREATED In gfs2_create_inode() set FILE_CREATED in *opened. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-16 19:17:24 -04:00
Miklos Szeredi	dfb1d61b0e	cifs: fix filp leak in cifs_atomic_open() If an error occurs after having called finish_open() then fput() needs to be called on the already opened file. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Steve French <sfrench@samba.org> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-16 19:17:24 -04:00
Miklos Szeredi	0854d450e2	vfs: improve i_op->atomic_open() documentation Fix documentation of ->atomic_open() and related functions: finish_open() and finish_no_open(). Also add details that seem to be unclear and a source of bugs (some of which are fixed in the following series). Cc-ing maintainers of all filesystems implementing ->atomic_open(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Sage Weil <sage@inktank.com> Cc: Steve French <sfrench@samba.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-16 19:17:24 -04:00
Al Viro	606035e76e	autofs4: close the races around autofs4_notify_daemon() Don't drop ->wq_mutex before calling autofs4_notify_daemon() only to regain it there. Besides being pointless, that opens a race window where autofs4_wait_release() could've come and freed wq->name.name. And do the debugging printk in the "reused an existing wq" case before dropping ->wq_mutex - the same reason... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Ian Kent <raven@themaw.net>	2013-09-16 19:16:38 -04:00
Linus Torvalds	3369d11693	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull CIFS fixes from Steve French: "Two minor cifs fixes and a minor documentation cleanup for cifs.txt" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: update cifs.txt and remove some outdated infos cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache cifs: Do not take a reference to the page in cifs_readpage_worker()	2013-09-16 15:39:21 -04:00

1 2 3 4 5 ...

33580 Commits