Commit Graph

35092 Commits

Author SHA1 Message Date
Al Viro
d311d79de3 fix O_SYNC|O_APPEND syncing the wrong range on write()
It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
when sync_page_range() had been introduced; generic_file_write{,v}() correctly
synced
	pos_after_write - written .. pos_after_write - 1
but generic_file_aio_write() synced
	pos_before_write .. pos_before_write + written - 1
instead.  Which is not the same thing with O_APPEND, obviously.
A couple of years later correct variant had been killed off when
everything switched to use of generic_file_aio_write().

All users of generic_file_aio_write() are affected, and the same bug
has been copied into other instances of ->aio_write().

The fix is trivial; the only subtle point is that generic_write_sync()
ought to be inlined to avoid calculations useless for the majority of
calls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-02-09 15:18:09 -05:00
Linus Torvalds
9c1db77981 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is a small collection of fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix data corruption when reading/updating compressed extents
  Btrfs: don't loop forever if we can't run because of the tree mod log
  btrfs: reserve no transaction units in btrfs_ioctl_set_features
  btrfs: commit transaction after setting label and features
  Btrfs: fix assert screwup for the pending move stuff
2014-02-09 11:12:26 -08:00
Filipe David Borba Manana
a2aa75e18a Btrfs: fix data corruption when reading/updating compressed extents
When using a mix of compressed file extents and prealloc extents, it
is possible to fill a page of a file with random, garbage data from
some unrelated previous use of the page, instead of a sequence of zeroes.

A simple sequence of steps to get into such case, taken from the test
case I made for xfstests, is:

   _scratch_mkfs
   _scratch_mount "-o compress-force=lzo"
   $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar
   $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
   $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

This results in the following file items in the fs tree:

   item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
       inode generation 6 transid 6 size 542872 block group 0 mode 100600
   item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
       inode ref index 2 namelen 6 name: foobar
   item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
       extent data disk byte 0 nr 0 gen 6
       extent data offset 0 nr 24576 ram 266240
       extent compression 0
   item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
       prealloc data disk byte 12849152 nr 241664 gen 6
       prealloc data offset 0 nr 241664
   item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
       extent data disk byte 12845056 nr 4096 gen 6
       extent data offset 0 nr 20480 ram 20480
       extent compression 2
   item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
       prealloc data disk byte 13090816 nr 405504 gen 6
       prealloc data offset 0 nr 258048

The on disk extent at offset 266240 (which corresponds to 1 single disk block),
contains 5 compressed chunks of file data. Each of the first 4 compress 4096
bytes of file data, while the last one only compresses 3024 bytes of file data.
Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
1072 bytes) should always return zeroes (our next extent is a prealloc one).

The solution here is the compression code path to zero the remaining (untouched)
bytes of the last page it uncompressed data into, as the information about how
much space the file data consumes in the last page is not known in the upper layer
fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
the remainder of the page but only if it corresponds to the last page of the inode
and if the inode's size is not a multiple of the page size.

This would cause not only returning random data on reads, but also permanently
storing random data when updating parts of the region that should be zeroed.
For the example above, it means updating a single byte in the region [285648 ; 286720[
would store that byte correctly but also store random data on disk.

A test case for xfstests follows soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08 17:57:15 -08:00
Josef Bacik
27a377db74 Btrfs: don't loop forever if we can't run because of the tree mod log
A user reported a 100% cpu hang with my new delayed ref code.  Turns out I
forgot to increase the count check when we can't run a delayed ref because of
the tree mod log.  If we can't run any delayed refs during this there is no
point in continuing to look, and we need to break out.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08 17:57:15 -08:00
David Sterba
8051aa1a3d btrfs: reserve no transaction units in btrfs_ioctl_set_features
Added in patch "btrfs: add ioctls to query/change feature bits online"
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08 17:57:15 -08:00
Jeff Mahoney
d0270aca88 btrfs: commit transaction after setting label and features
The set_fslabel ioctl uses btrfs_end_transaction, which means it's
possible that the change will be lost if the system crashes, same for
the newly set features. Let's use btrfs_commit_transaction instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08 17:57:15 -08:00
Josef Bacik
6cc98d90f8 Btrfs: fix assert screwup for the pending move stuff
Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't
reproduce.  Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set,
which meant that a key part of Filipe's original patch was not being built in.
This appears to be a mess up with merging Filipe's patch as it does not exist in
his original patch.  Fix this by changing how we make sure del_waiting_dir_move
asserts that it did not error and take the function out of the ifdef check.
This makes btrfs/030 pass with the assert on or off.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08 17:57:15 -08:00
Linus Torvalds
ec2e6cb24a Fix regression
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJS9mGFAAoJEDaohF61QIxkMhMP/0657lE8Wn+thff6kX7mNs1v
 Ctxc6V2723t2D4xGkIuOrYMqsziRkUeXKZOQTKvRYes95RkA1q3GUvv+xDwP1kVI
 f6Mqg3TZOiCTm/3z8uJUlYq1q6Hjp8E8QWg3Qs9sFEYPh7WFo2BWiDLjIA2iKkr8
 YJx7jiRE2lzWw2DPBSY7cv9IUuJkgaIRMaMI477adU0m/BvmG73waG9NAo0sjmdK
 RPTC4/j2p8+PooPzXS7dHUkbxbTnqSIGuXEu3XED1Hsj2vqzIbi3zw0cvRuK02D4
 LpsKIcyaEG6ApxK7RdZJeauSmpok4s/sPWwJIqn+1X4foQF28AwCZY0HIe1q/W25
 +i99soVoX04Uby3l02PKKlHZLAfQo0GTl/3itpG2nV84SB7i2T2Quwi8X7q6iqrK
 +2r2Ro7upTGmRZqyCTyZX+/rEFLMGCdSVUmPnz+NWISmTTIvaXg82A7eTJtvWdXv
 TFjPij++s4fJEH8D4RmpgDlCBlCoqrZvl2tR1YWdKyDOpFMcBEuVJ8mNGSadJT8v
 ekLejc3CVDyLj6xscwFmWXTQEwVOKtoSHBlpFqSIBORJnOcN+59Vb3EKv9IhD/O5
 oLdcDuBSPUsOw+PMe5n4xjjr7bMrIuf4PiIWiOPJ8Y1dlOmGZKiEWyS4devZM4KG
 PyhJKEO/JUw3FLT9Ni3M
 =QFjr
 -----END PGP SIGNATURE-----

Merge tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggy

Pull jfs fix from David Kleikamp:
 "Fix regression"

* tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggy:
  jfs: fix generic posix ACL regression
2014-02-08 10:13:47 -08:00
Dave Kleikamp
c18f7b5120 jfs: fix generic posix ACL regression
I missed a couple errors in reviewing the patches converting jfs
to use the generic posix ACL function. Setting ACL's currently
fails with -EOPNOTSUPP.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-02-08 10:50:58 -06:00
Linus Torvalds
34a9bff4ab Driver core fix for 3.14-rc2
Here is a single kernfs fix to resolve a much-reported lockdep issue with the
 removal of entries in sysfs.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlL1WqIACgkQMUfUDdst+ykauACeMlmM0Ro0nCjU5e9Dq9qFw0ZJ
 u2oAn3qxgNIRKIjZTxDfXmXgFr0sTfTW
 =UyO3
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fix from Greg KH:
 "Here is a single kernfs fix to resolve a much-reported lockdep issue
  with the removal of entries in sysfs"

* tag 'driver-core-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag
2014-02-07 14:17:18 -08:00
KOSAKI Motohiro
227d53b397 mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq
To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
During aio buffer migration, we have a possibility to see the following
call stack.

aio_migratepage  [disable interrupt]
  migrate_page_copy
    clear_page_dirty_for_io
      set_page_dirty
        __set_page_dirty_buffers
          __set_page_dirty
            spin_lock_irq

This mean, current aio migration is a deadlockable.  spin_lock_irqsave
is a safer alternative and we should use it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Rientjes rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06 13:48:51 -08:00
Zongxun Wang
fb951eb5e1 ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters
Even if using the same jbd2 handle, we cannot rollback a transaction.
So once some error occurs after successfully allocating clusters, the
allocated clusters will never be used and it means they are lost.  For
example, call ocfs2_claim_clusters successfully when expanding a file,
but failed in ocfs2_insert_extent.  So we need free the allocated
clusters if they are not used indeed.

Signed-off-by: Zongxun Wang <wangzongxun@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06 13:48:51 -08:00
Linus Torvalds
c4ad8f98be execve: use 'struct filename *' for executable name passing
This changes 'do_execve()' to get the executable name as a 'struct
filename', and to free it when it is done.  This is what the normal
users want, and it simplifies and streamlines their error handling.

The controlled lifetime of the executable name also fixes a
use-after-free problem with the trace_sched_process_exec tracepoint: the
lifetime of the passed-in string for kernel users was not at all
obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
the pathname allocation lifetime with the execve() having finished,
which in turn meant that the trace point that happened after
mm_release() of the old process VM ended up using already free'd memory.

To solve the kernel string lifetime issue, this simply introduces
"getname_kernel()" that works like the normal user-space getname()
function, except with the source coming from kernel memory.

As Oleg points out, this also means that we could drop the tcomm[] array
from 'struct linux_binprm', since the pathname lifetime now covers
setup_new_exec().  That would be a separate cleanup.

Reported-by: Igor Zhbanov <i.zhbanov@samsung.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-05 12:54:53 -08:00
Tejun Heo
da9846ae15 kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag
kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set
before performing lockdep annotations and ends up feeding
uninitialized lockdep_map to lockdep triggering warning like the
following on USB stick hotunplug.

 usb 1-2: USB disconnect, device number 2
 INFO: trying to register non-static key.
 the code is fine but needs lockdep annotation.
 turning off the locking correctness validator.
 CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
  ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002
  ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002
  ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710
 Call Trace:
  [<ffffffff81cfb6bd>] dump_stack+0x4e/0x7a
  [<ffffffff810f8530>] __lock_acquire+0x1910/0x1e70
  [<ffffffff810f931a>] lock_acquire+0x9a/0x1d0
  [<ffffffff8127c75e>] kernfs_deactivate+0xee/0x130
  [<ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60
  [<ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0
  [<ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80
  [<ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0
  [<ffffffff8127b873>] sysfs_remove_groups+0x33/0x50
  [<ffffffff8177d66d>] device_remove_attrs+0x4d/0x80
  [<ffffffff8177e25e>] device_del+0x12e/0x1d0
  [<ffffffff819722c2>] usb_disconnect+0x122/0x1a0
  [<ffffffff819749b5>] hub_thread+0x3c5/0x1290
  [<ffffffff810c6a6d>] kthread+0xed/0x110
  [<ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0

Fix it by making kernfs_deactivate() perform lockdep annotations only
if KERNFS_LOCKDEP is set.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Jiri Kosina <jkosina@suse.cz>
Reported-by: Dave Jones <davej@redhat.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-05 11:44:04 -08:00
Linus Torvalds
878a876b2e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe is fixing compile and boot problems with our crc32c rework, and
  Josef has disabled snapshot aware defrag for now.

  As the number of snapshots increases, we're hitting OOM.  For the
  short term we're disabling things until a bigger fix is ready"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: use late_initcall instead of module_init
  Btrfs: use btrfs_crc32c everywhere instead of libcrc32c
  Btrfs: disable snapshot aware defrag for now
2014-02-04 12:26:56 -08:00
Linus Torvalds
d7512f79fd NFS client bugfixes for Linux 3.14
Highlights:
 
 - Fix NFSv3 acl regressions
 - Fix NFSv4 memory corruption due to slot table abuse in nfs4_proc_open_confirm
 - nfs4_destroy_session must call rpc_destroy_waitqueue
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS8RmYAAoJEGcL54qWCgDynYEP/1Ip6/IoaqOfT5kAYo05tw30
 Wqk0vNZWqdkuK/T8kUQmcbaN6Gt29QRhhJzDBhIrQ+7n8LsWwiZ3zoi9m3p2Ry7U
 M+xWcE6P7Cc44w3XLfll/xKb3ktfhXFQUeTHxsQ3X7uBsBP8TcfYabWHFrrkSe2B
 aubj/JqebLm2Uz7Zb03T4JF6LKd+3uHBr7SDK03zTUaPtn+bcVGBwS9MlFTt17Ma
 2zMFMM4TNNPDGJWJMvI460BYB4nySINCA4I+78sSLOQbiDc6yZd1Zq/Ngtvs+AlW
 /+m3eEY3ljcawCSAHMkoef4G5ljHV8cndtr3WBtAUjGq/H6V8eGdW4PB/A2UGYRH
 Vdcxrm6Bn7IDRJ6baq72DuX1ZyTDPBllRz0m0l08w8p2cuK7lFDSF7/HS2MNOb2i
 8Ai02CG9+aWZCz1uuF0vngKViP+hfmJqrLr1OgLuSggPd1kbqU/+HnNGtLmDfi/0
 K0ceuMvIuMlahlYgb7XbGFN7RIcbIYXXhBM2dUQo+/TtGvfjHVI8EmmlC7CtEXdW
 TeVc9gxWgKIlF1p06aoWFzdbBkK3CtNCXgh8vQpFRSqvoAQC/mBVQW9Iank77hRA
 94wiNZVLKaEO+l2WmfuGLhZ9kkLb/Iq5NVA5IUGWgjyT1yJsR5faTFfRhM+yjABE
 IwQF5ntXqxz82nWCRbSl
 =kKEV
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights:

   - Fix NFSv3 acl regressions
   - Fix NFSv4 memory corruption due to slot table abuse in
     nfs4_proc_open_confirm
   - nfs4_destroy_session must call rpc_destroy_waitqueue"

* tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  fs: get_acl() must be allowed to return EOPNOTSUPP
  NFSv3: Fix return value of nfs3_proc_setacls
  NFSv3: Remove unused function nfs3_proc_set_default_acl
  NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueue
  NFSv4: Fix memory corruption in nfs4_proc_open_confirm
  nfs: fix setting of ACLs on file creation.
2014-02-04 12:26:16 -08:00
Trond Myklebust
88a78a912e Merge branch 'acl_fixes' into linux-next 2014-02-03 17:13:45 -05:00
Trond Myklebust
789b663ae3 fs: get_acl() must be allowed to return EOPNOTSUPP
posix_acl_xattr_get requires get_acl() to return EOPNOTSUPP if the
filesystem cannot support acls. This is needed for NFS, which can't
know whether or not the server supports acls until it tries to get/set
one.
This patch converts posix_acl_chmod and posix_acl_create to deal with
EOPNOTSUPP return values from get_acl().

Reported-by: Russell King <linux@arm.linux.org.uk>
Link: http://lkml.kernel.org/r/20140130140834.GW15937@n2100.arm.linux.org.uk
Cc: Al Viro viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-03 17:12:37 -05:00
Trond Myklebust
8f493b9cfc NFSv3: Fix return value of nfs3_proc_setacls
nfs3_proc_setacls is used internally by the NFSv3 create operations
to set the acl after the file has been created. If the operation
fails because the server doesn't support acls, then it must return '0',
not -EOPNOTSUPP.

Reported-by: Russell King <linux@arm.linux.org.uk>
Link: http://lkml.kernel.org/r/20140201010328.GI15937@n2100.arm.linux.org.uk
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-03 13:14:23 -05:00
Trond Myklebust
d4c42fb493 NFSv3: Remove unused function nfs3_proc_set_default_acl
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-03 13:13:50 -05:00
Filipe David Borba Manana
60efa5eb2e Btrfs: use late_initcall instead of module_init
It seems that when init_btrfs_fs() is called, crc32c/crc32c-intel might
not always be already initialized, which results in the call to crypto_alloc_shash()
returning -ENOENT, as experienced by Ahmet who reported this.

Therefore make sure init_btrfs_fs() is called after crc32c is initialized (which
is at initialization level 6, module_init), by using late_initcall (which is at
initialization level 7) instead of module_init for btrfs.

Reported-and-Tested-by: Ahmet Inan <ainan@mathematik.uni-freiburg.de>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-03 09:01:28 -08:00
Filipe David Borba Manana
0b947aff15 Btrfs: use btrfs_crc32c everywhere instead of libcrc32c
After the commit titled "Btrfs: fix btrfs boot when compiled as built-in",
LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not
possible to build a kernel with btrfs enabled (either as module or built-in)
if libcrc32c is not enabled as well. So just replace all uses of libcrc32c
with the equivalent function in btrfs hash.h - btrfs_crc32c.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-03 09:01:27 -08:00
Josef Bacik
8101c8dbf6 Btrfs: disable snapshot aware defrag for now
It's just broken and it's taking a lot of effort to fix it, so for now just
disable it so people can defrag in peace.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-03 09:01:27 -08:00
Mikulas Patocka
1c0b8a7a62 hpfs: optimize quad buffer loading
HPFS needs to load 4 consecutive 512-byte sectors when accessing the
directory nodes or bitmaps.  We can't switch to 2048-byte block size
because files are allocated in the units of 512-byte sectors.

Previously, the driver would allocate a 2048-byte area using kmalloc,
copy the data from four buffers to this area and eventually copy them
back if they were modified.

In the current implementation of the buffer cache, buffers are allocated
in the pagecache.  That means that 4 consecutive 512-byte buffers are
stored in consecutive areas in the kernel address space.  So, we don't
need to allocate extra memory and copy the content of the buffers there.

This patch optimizes the code to avoid copying the buffers.  It checks
if the four buffers are stored in contiguous memory - if they are not,
it falls back to allocating a 2048-byte area and copying data there.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-02 16:24:07 -08:00
Mikulas Patocka
2cbe5c76fc hpfs: remember free space
Previously, hpfs scanned all bitmaps each time the user asked for free
space using statfs.  This patch changes it so that hpfs scans the
bitmaps only once, remembes the free space and on next invocation of
statfs it returns the value instantly.

New versions of wine are hammering on the statfs syscall very heavily,
making some games unplayable when they're stored on hpfs, with load
times in minutes.

This should be backported to the stable kernels because it fixes
user-visible problem (excessive level load times in wine).

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-02 16:24:07 -08:00
Trond Myklebust
20b9a90245 NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueue
There may still be timers active on the session waitqueues. Make sure
that we kill them before freeing the memory.

Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-01 15:13:39 -05:00
Trond Myklebust
17ead6c85c NFSv4: Fix memory corruption in nfs4_proc_open_confirm
nfs41_wake_and_assign_slot() relies on the task->tk_msg.rpc_argp and
task->tk_msg.rpc_resp always pointing to the session sequence arguments.

nfs4_proc_open_confirm tries to pull a fast one by reusing the open
sequence structure, thus causing corruption of the NFSv4 slot table.

Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-01 15:13:39 -05:00
Pali Rohár
1bda2ac071 afs: proc cells and rootcell are writeable
Both proc files are writeable and used for configuring cells. But
there is missing correct mode flag for writeable files. Without
this patch both proc files are read only.

[ It turns out they aren't really read-only, since root can write to
  them even if the write bit isn't set due to CAP_DAC_OVERRIDE ]

Signed-off-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-01 10:59:39 -08:00
Linus Torvalds
0f44bc36ba Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "A set of cifs fixes (mostly for symlinks, and SMB2 xattrs) and
  cleanups"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix check for regular file in couldbe_mf_symlink()
  [CIFS] Fix SMB2 mounts so they don't try to set or get xattrs via cifs
  CIFS: Cleanup cifs open codepath
  CIFS: Remove extra indentation in cifs_sfu_type
  CIFS: Cleanup cifs_mknod
  CIFS: Cleanup CIFSSMBOpen
  cifs: Add support for follow_link on dfs shares under posix extensions
  cifs: move unix extension call to cifs_query_symlink()
  cifs: Re-order M-F Symlink code
  cifs: Add create MFSymlinks to protocol ops struct
  cifs: use protocol specific call for query_mf_symlink()
  cifs: Rename MF symlink function names
  cifs: Rename and cleanup open_query_close_cifs_symlink()
  cifs: Fix memory leak in cifs_hardlink()
2014-02-01 10:52:45 -08:00
Linus Torvalds
efc518eb31 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Several obvious fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Fix mountpoint reference leakage in linkat
  hfsplus: use xattr handlers for removexattr
  Typo in compat_sys_lseek() declaration
  fs/super.c: sync ro remount after blocking writers
  vfs: unexport the getname() symbol
2014-02-01 10:43:45 -08:00
Noah Massey
718360c59f nfs: fix setting of ACLs on file creation.
nfs3_get_acl() tries to skip posix equivalent ACLs, but misinterprets
the return value of posix_acl_equiv_mode(). Fix it.

This is a regression introduced by
"nfs: use generic posix ACL infrastructure for v3 Posix ACLs"

CC: Christoph Hellwig <hch@infradead.org>
CC: linux-nfs@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-31 20:36:07 -05:00
Linus Torvalds
8a1f006ad3 NFS client bugfixes for Linux 3.14
Highlights:
 
 - Fix several races in nfs_revalidate_mapping
 - NFSv4.1 slot leakage in the pNFS files driver
 - Stable fix for a slot leak in nfs40_sequence_done
 - Don't reject NFSv4 servers that support ACLs with only ALLOW aces
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS7Bb+AAoJEGcL54qWCgDyDuQP/17nKR5e6MLhixcAbvlcH+pN
 8CGolAM3HmRXDWUW/PkBH3UguG8Tzx1Ex26vIxipPeTSwZabf6194Twj6L97DEGZ
 2SouD158BW1TkAbhEN/alKB/4ZCPos05iXjZkrL7MRff+8FD0UvWR2pBT1F2jQdY
 ZftG76Q72qhZHfH07ZMxM/v4Oy2Ge98RDD35gfuuqMSjHpmN9tiB55PeheW33LVY
 fu6I/JEwmlJpgy2qUcDv7v0V4mDpjC7XbcjjHpMHL8zp/C5Rx/rdgt9OQPlwmjdV
 FD8MWNXLc5TWxIouLDFPVUv3WZPjyu449QHS9Wc95fSqsHcdl4j4SwLAoSvUIdHt
 vDI5PtWhw3WAezbtiuCQnT0xcoIOn5bLjOVP13taDcV9vlZLcFlyOpZ5gVE4/Yju
 zm4sCW2+imDc74ERGa4fukF6QhzzAVmD8RlCJwuNzwCfXiZ36+xSanMYiPoUiwLL
 OVNgymrm0fe7GVFQKWN2D+Vr68OQEmARO+KfA3UzP5rQV+9CU8zSHjbcoRWZ59QG
 VahOS5WDLQSrMp8W37yAHH9IiAWveAAKJJTHlOniRqH90QYPgyW18fTo7YcpW313
 AQGFgr/1n4t27MWRLu5rdoN5v8+kwNi0UV6oboNIPoP1v15NkEMvc7HKFj5M883R
 qEYfe5wqN/eRNj68NT/+
 =B7f0
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights:

   - Fix several races in nfs_revalidate_mapping
   - NFSv4.1 slot leakage in the pNFS files driver
   - Stable fix for a slot leak in nfs40_sequence_done
   - Don't reject NFSv4 servers that support ACLs with only ALLOW aces"

* tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: initialize the ACL support bits to zero.
  NFSv4.1: Cleanup
  NFSv4.1: Clean up nfs41_sequence_done
  NFSv4: Fix a slot leak in nfs40_sequence_done
  NFSv4.1 free slot before resending I/O to MDS
  nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING
  NFS: Fix races in nfs_revalidate_mapping
  sunrpc: turn warn_gssd() log message into a dprintk()
  NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping
  nfs: handle servers that support only ALLOW ACE type.
2014-01-31 15:39:07 -08:00
Oleg Drokin
d22e6338db Fix mountpoint reference leakage in linkat
Recent changes to retry on ESTALE in linkat
(commit 442e31ca5a)
introduced a mountpoint reference leak and a small memory
leak in case a filesystem link operation returns ESTALE
which is pretty normal for distributed filesystems like
lustre, nfs and so on.
Free old_path in such a case.

[AV: there was another missing path_put() nearby - on the previous
goto retry]

Signed-off-by: Oleg Drokin: <green@linuxhacker.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-31 17:33:13 -05:00
Christoph Hellwig
b168fff721 hfsplus: use xattr handlers for removexattr
hfsplus was already using the handlers for get and set operations,
and with the removal of can_set_xattr we've now allow operations that
wouldn't otherwise be allowed.

With this we can also centralize the special-casing of the osx.
attrs that don't have prefixes on disk in the osx xattr handlers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-31 14:44:39 -05:00
Andrew Ruder
807612db2f fs/super.c: sync ro remount after blocking writers
Move sync_filesystem() after sb_prepare_remount_readonly().  If writers
sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly()
it can cause inodes to be dirtied and writeback to occur well after
sys_mount() has completely successfully.

This was spotted by corrupted ubifs filesystems on reboot, but appears
that it can cause issues with any filesystem using writeback.

Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Richard Weinberger <richard@nod.at>
Co-authored-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Ruder <andrew.ruder@elecsyscorp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-31 14:29:36 -05:00
Jeff Layton
9115eac2c7 vfs: unexport the getname() symbol
Leaving getname() exported when putname() isn't is a bad idea.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-31 14:28:56 -05:00
Linus Torvalds
e30b82bbe0 Minor bug fix for linux-3.14
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJS67pQAAoJEDaohF61QIxkfkcQAKAuNhBtaGkx8Dd9gPM98DrW
 F4NnitUNzcjHTSq23oRIe7rLanXTMHKMhq4S83YWISLTWCAmekOVF4xs8WTEm1Px
 5/X3an65m3mvW/1V7oxixA+GLUk2LwPqPt7N+jWApkR9KB5LydQCucdbq4Vza9gd
 SkoK5RJe8bTE+43k962vsDJM7XCVeUOGOSXO7MyTaipRZ+fyuEPwwbQ0u7Bxl9JG
 HkfhCJUu3V79nruIgmP1Z2llN2XIv/fsCo3F42yBspSBUxYqnpnAdo6Zr5srnEQC
 kHm67VFk5sPnLPnm6AxaoonE9GX+TMzy5eUk6Lb5x5EwRlH694FFovlquFyr3RjN
 cmh6VWGv1jcYnEmRhKjT8Vs0GQ+Q5L2L6dEdDbck2WkVWPDdLb83wbPJRZ/g5kcF
 GAUI8EjjSR4ezuZqmv6tbWHpn/q7p6NekC8DVMmwtoBZgIuFmITyrqpFJUG4acnO
 yuGdZtYePnMHZlQMNTdzUjuOu/Wyqit5OrM/ztmrGlYCI8eiE17OInWikZzSFFcf
 +uDodCg5tokG8AlNgNZ+thcpjO7aAR9GCOD2INZYKlFRkJANpQ+OOK+AOafmWoqO
 UjmL0rH3FXRu0uhD8VLKRaUfuyMaBHJ/9+ndwmyOa+7x0BPv03hw3bGGhy6tsTHg
 XbxlExsTrtpKEXwznWqG
 =F/GX
 -----END PGP SIGNATURE-----

Merge tag 'jfs-3.14' of git://github.com/kleikamp/linux-shaggy

Pull jfs fix from David Kleikamp:
 "Minor bug fix for linux-3.14"

* tag 'jfs-3.14' of git://github.com/kleikamp/linux-shaggy:
  jfs: fix xattr value size overflow in __jfs_setxattr
2014-01-31 08:14:35 -08:00
Sage Weil
77516dc92a ceph: fix missing dput in ceph_set_acl
Add matching dput() for d_find_alias().  Move d_find_alias() down a bit
at Julia's suggestion.

[ Introduced by commit 72466d0b92: "ceph: fix posix ACL hooks" ]

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31 08:14:06 -08:00
Sachin Prabhu
a9a315d414 cifs: Fix check for regular file in couldbe_mf_symlink()
MF Symlinks are regular files containing content in a specified format.

The function couldbe_mf_symlink() checks the mode for a set S_IFREG bit
as a test to confirm that it is a regular file. This bit is also set for
other filetypes and simply checking for this bit being set may return
false positives.

We ensure that we are actually checking for a regular file by using the
S_ISREG macro to test instead.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Neil Brown <neilb@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-01-31 09:06:43 -06:00
Malahal Naineni
a1800acaf7 nfs: initialize the ACL support bits to zero.
Avoid returning incorrect acl mask attributes when the server doesn't
support ACLs.

Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-31 08:28:16 -05:00
Linus Torvalds
e7651b819e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This is a pretty big pull, and most of these changes have been
  floating in btrfs-next for a long time.  Filipe's properties work is a
  cool building block for inheriting attributes like compression down on
  a per inode basis.

  Jeff Mahoney kicked in code to export filesystem info into sysfs.

  Otherwise, lots of performance improvements, cleanups and bug fixes.

  Looks like there are still a few other small pending incrementals, but
  I wanted to get the bulk of this in first"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
  Btrfs: fix spin_unlock in check_ref_cleanup
  Btrfs: setup inode location during btrfs_init_inode_locked
  Btrfs: don't use ram_bytes for uncompressed inline items
  Btrfs: fix btrfs_search_slot_for_read backwards iteration
  Btrfs: do not export ulist functions
  Btrfs: rework ulist with list+rb_tree
  Btrfs: fix memory leaks on walking backrefs failure
  Btrfs: fix send file hole detection leading to data corruption
  Btrfs: add a reschedule point in btrfs_find_all_roots()
  Btrfs: make send's file extent item search more efficient
  Btrfs: fix to catch all errors when resolving indirect ref
  Btrfs: fix protection between walking backrefs and root deletion
  btrfs: fix warning while merging two adjacent extents
  Btrfs: fix infinite path build loops in incremental send
  btrfs: undo sysfs when open_ctree() fails
  Btrfs: fix snprintf usage by send's gen_unique_name
  btrfs: fix defrag 32-bit integer overflow
  btrfs: sysfs: list the NO_HOLES feature
  btrfs: sysfs: don't show reserved incompat feature
  btrfs: call permission checks earlier in ioctls and return EPERM
  ...
2014-01-30 20:08:20 -08:00
Linus Torvalds
271bf66d4c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull some further ceph acl cleanups from Sage Weil:
 "I do have a couple patches on top of what's in your tree, though, that
  clean up a couple duplicated lines in your fix and apply Christoph's
  cleanup"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: simplify ceph_{get,init}_acl
  ceph: remove duplicate declaration of ceph_setattr
2014-01-30 20:02:51 -08:00
Christoph Hellwig
7585823619 ceph: simplify ceph_{get,init}_acl
- ->get_acl only gets called after we checked for a cached ACL, so no
   need to call get_cached_acl again.
 - no need to check IS_POSIXACL in ->get_acl, without that it should
   never get set as all the callers that set it already have the check.
 - you should be able to use the full posix_acl_create in CEPH

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-01-30 19:26:17 -08:00
Linus Torvalds
53d8ab29f8 Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver changes from Jens Axboe:

 - bcache update from Kent Overstreet.

 - two bcache fixes from Nicholas Swenson.

 - cciss pci init error fix from Andrew.

 - underflow fix in the parallel IDE pg_write code from Dan Carpenter.
   I'm sure the 1 (or 0) users of that are now happy.

 - two PCI related fixes for sx8 from Jingoo Han.

 - floppy init fix for first block read from Jiri Kosina.

 - pktcdvd error return miss fix from Julia Lawall.

 - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
   Michael Opdenacker.

 - comment typo fix for the loop driver from Olaf Hering.

 - potential oops fix for null_blk from Raghavendra K T.

 - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
   an OOM problem and a problem with handling security locked conditions

* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
  mg_disk: Spelling s/finised/finished/
  null_blk: Null pointer deference problem in alloc_page_buffers
  mtip32xx: Correctly handle security locked condition
  mtip32xx: Make SGL container per-command to eliminate high order dma allocation
  drivers/block/loop.c: fix comment typo in loop_config_discard
  drivers/block/cciss.c:cciss_init_one(): use proper errnos
  drivers/block/paride/pg.c: underflow bug in pg_write()
  drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
  drivers/block/sx8.c: use module_pci_driver()
  floppy: bail out in open() if drive is not responding to block0 read
  bcache: Fix auxiliary search trees for key size > cacheline size
  bcache: Don't return -EINTR when insert finished
  bcache: Improve bucket_prio() calculation
  bcache: Add bch_bkey_equal_header()
  bcache: update bch_bkey_try_merge
  bcache: Move insert_fixup() to btree_keys_ops
  bcache: Convert sorting to btree_keys
  bcache: Convert debug code to btree_keys
  bcache: Convert btree_iter to struct btree_keys
  bcache: Refactor bset_tree sysfs stats
  ...
2014-01-30 11:40:10 -08:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Linus Torvalds
d9894c228b Merge branch 'for-3.14' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 - Handle some loose ends from the vfs read delegation support.
   (For example nfsd can stop breaking leases on its own in a
    fewer places where it can now depend on the vfs to.)
 - Make life a little easier for NFSv4-only configurations
   (thanks to Kinglong Mee).
 - Fix some gss-proxy problems (thanks Jeff Layton).
 - miscellaneous bug fixes and cleanup

* 'for-3.14' of git://linux-nfs.org/~bfields/linux: (38 commits)
  nfsd: consider CLAIM_FH when handing out delegation
  nfsd4: fix delegation-unlink/rename race
  nfsd4: delay setting current_fh in open
  nfsd4: minor nfs4_setlease cleanup
  gss_krb5: use lcm from kernel lib
  nfsd4: decrease nfsd4_encode_fattr stack usage
  nfsd: fix encode_entryplus_baggage stack usage
  nfsd4: simplify xdr encoding of nfsv4 names
  nfsd4: encode_rdattr_error cleanup
  nfsd4: nfsd4_encode_fattr cleanup
  minor svcauth_gss.c cleanup
  nfsd4: better VERIFY comment
  nfsd4: break only delegations when appropriate
  NFSD: Fix a memory leak in nfsd4_create_session
  sunrpc: get rid of use_gssp_lock
  sunrpc: fix potential race between setting use_gss_proxy and the upcall rpc_clnt
  sunrpc: don't wait for write before allowing reads from use-gss-proxy file
  nfsd: get rid of unused function definition
  Define op_iattr for nfsd4_open instead using macro
  NFSD: fix compile warning without CONFIG_NFSD_V3
  ...
2014-01-30 10:18:43 -08:00
Christoph Hellwig
5f13ee9c1c nfs: fix xattr inode op pointers when disabled
Chris Mason reported a NULL pointer derefernence in generic_getxattr()
that was due to sb->s_xattr being NULL.

The reason is that the nfs #ifdef's for ACL support were misplaced, and
the nfs3 inode operations had the xattr operation pointers set up, even
though xattrs were not actually supported.  As a result, the xattr code
was being called without the infrastructure having been set up.

Move the #ifdef's appropriately.

Reported-and-tested-by: Chris Mason <clm@fb.com>
Acked-by: Al Viro viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 09:37:49 -08:00
Peter Rosin
32d35d44d0 ceph: remove duplicate declaration of ceph_setattr
Signed-off-by: Peter Rosin <peda@lysator.liu.se>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-01-30 08:38:00 -08:00
Linus Torvalds
13293115d1 Merge branch 'akpm' (patches from Andrew Morton)
Merge random fixes from Andrew Morton:
 "Random fixes.

  I have one batch remaining for -rc1, mainly zram changes which await a
  merge of Jens's trees"

* emailed patches fron Andrew Morton akpm@linux-foundation.org>:
  MAINTAINERS: ADI Linux development mailing lists: change to the new server
  Documentation: fix multiple typo occurences s/KenelVersion/KernelVersion/
  dma-debug: fix overlap detection
  memblock: add limit checking to memblock_virt_alloc
  mm/readahead.c: fix do_readahead() for no readpage(s)
  mm/slub.c: do not VM_BUG_ON_PAGE() for temporary on-stack pages
  slab: fix wrong retval on kmem_cache_create_memcg error path
  s390/compat: change parameter types from unsigned long to compat_ulong_t
  fs/compat: fix lookup_dcookie() parameter handling
  fs/compat: fix parameter handling for compat readv/writev syscalls
  mm/mempolicy.c: convert to pr_foo()
  mm: numa: initialise numa balancing after jump label initialisation
  mm/page-writeback.c: do not count anon pages as dirtyable memory
  mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memory
  mm: document improved handling of swappiness==0
  lib/genalloc.c: add check gen_pool_dma_alloc() if dma pointer is not NULL
2014-01-29 16:22:54 -08:00
Heiko Carstens
d8d14bd09c fs/compat: fix lookup_dcookie() parameter handling
Commit d5dc77bfee ("consolidate compat lookup_dcookie()") coverted all
architectures to the new compat_sys_lookup_dcookie() syscall.

The "len" paramater of the new compat syscall must have the type
compat_size_t in order to enforce zero extension for architectures where
the ABI requires that the caller of a function performed zero and/or
sign extension to 64 bit of all parameters.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>	[v3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29 16:22:40 -08:00
Heiko Carstens
dfd948e32a fs/compat: fix parameter handling for compat readv/writev syscalls
We got a report that the pwritev syscall does not work correctly in
compat mode on s390.

It turned out that with commit 72ec35163f ("switch compat readv/writev
variants to COMPAT_SYSCALL_DEFINE") we lost the zero extension of a
couple of syscall parameters because the some parameter types haven't
been converted from unsigned long to compat_ulong_t.

This is needed for architectures where the ABI requires that the caller
of a function performed zero and/or sign extension to 64 bit of all
parameters.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>	[v3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29 16:22:39 -08:00
Linus Torvalds
1ecd7450c0 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fanotify use-after-free fixes from Jan Kara:
 "Three fixes for the fanotify use after free problems guys were
  reporting.

  I have ended up with different lifetime rules for struct
  fanotify_event_info depending on whether it is for permission event or
  normal event which isn't ideal.  My plan is to split these into two
  different structures (as permission events need larger struct anyway)
  which will make the rules trivial again.  But that can wait for later
  I guess (but I can add the patch to the pile if you want), now I
  wanted to make -rc1 boot for these guys"

[ "These guys" being Jiri Kosina and Dave Jones that reported the slab
  corruption issues due to incorrect object lifetimes ]

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: Fix use after free for permission events
  fsnotify: Do not return merged event from fsnotify_add_notify_event()
  fanotify: Fix use after free in mask checking
2014-01-29 16:09:34 -08:00
Sage Weil
72466d0b92 ceph: fix posix ACL hooks
The merge of commit 7221fe4c2e ("ceph: add acl for cephfs") raced with
upstream changes in the generic POSIX ACL code (eg commit 2aeccbe957
"fs: add generic xattr_acl handlers" and others).

Some of the fallout was fixed in commit 4db658ea0c ("ceph: Fix up after
semantic merge conflict"), but it was incomplete: the set_acl
inode_operation wasn't getting set, and the prototype needed to be
adjusted a bit (it doesn't take a dentry anymore).

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29 16:05:28 -08:00
Trond Myklebust
905e7dafbe NFSv4.1: Cleanup
It is now completely safe to call nfs41_sequence_free_slot with a NULL
slot.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-29 12:26:57 -05:00
Trond Myklebust
a13ce7c629 NFSv4.1: Clean up nfs41_sequence_done
Move the test for res->sr_slot == NULL out of the nfs41_sequence_free_slot
helper and into the main function for efficiency.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-29 12:24:03 -05:00
Trond Myklebust
cab92c1982 NFSv4: Fix a slot leak in nfs40_sequence_done
The check for whether or not we sent an RPC call in nfs40_sequence_done
is insufficient to decide whether or not we are holding a session slot,
and thus should not be used to decide when to free that slot.

This patch replaces the RPC_WAS_SENT() test with the correct test for
whether or not slot == NULL.

Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-29 12:12:15 -05:00
Andy Adamson
f9c96fcc50 NFSv4.1 free slot before resending I/O to MDS
Fix a dynamic session slot leak where a slot is preallocated and I/O is
resent through the MDS.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-29 11:54:55 -05:00
Chris Mason
cf93da7bcf Btrfs: fix spin_unlock in check_ref_cleanup
Our goto out should have gone a little farther.

Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:31 -08:00
Chris Mason
90d3e592e9 Btrfs: setup inode location during btrfs_init_inode_locked
We have a race during inode init because the BTRFS_I(inode)->location is setup
after the inode hash table lock is dropped.  btrfs_find_actor uses the location
field, so our search might not find an existing inode in the hash table if we
race with the inode init code.

This commit changes things to setup the location field sooner.  Also the find actor now
uses only the location objectid to match inodes.  For inode hashing, we just
need a unique and stable test, it doesn't have to reflect the inode numbers we
show to userland.

Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org
2014-01-29 07:06:30 -08:00
Chris Mason
514ac8ad87 Btrfs: don't use ram_bytes for uncompressed inline items
If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size.  The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.

Reported-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org
2014-01-29 07:06:29 -08:00
Filipe David Borba Manana
23c6bf6a91 Btrfs: fix btrfs_search_slot_for_read backwards iteration
If the current path's leaf slot is 0, we do search for the previous
leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a
value corresponding to the number of items - 1 of the former leaf.
Fix this by using the slot set by btrfs_prev_leaf, decrementing it
by 1 if it's equal to the leaf's number of items.

Use of btrfs_search_slot_for_read() for backward iteration is used in
particular by the send feature, which could miss items when the input
leaf has less items than its previous leaf.

This could be reproduced by running btrfs/007 from xfstests in a loop.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:28 -08:00
Wang Shilong
49fc647a2c Btrfs: do not export ulist functions
There are not any users that use ulist except Btrfs,don't
export them.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:27 -08:00
Wang Shilong
4c7a6f74ce Btrfs: rework ulist with list+rb_tree
We are really suffering from now ulist's implementation, some developers
gave their try, and i just gave some of my ideas for things:

 1. use list+rb_tree instead of arrary+rb_tree

 2. add cur_list to iterator rather than ulist structure.

 3. add seqnum into every node when they are added, this is
 used to do selfcheck when iterating node.

I noticed Zach Brown's comments before, long term is to kick off
ulist implementation, however, for now, we need at least avoid
arrary from ulist.

Cc: Liu Bo <bo.li.liu@oracle.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Zach Brown <zab@redhat.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:27 -08:00
Wang Shilong
f05c474688 Btrfs: fix memory leaks on walking backrefs failure
When walking backrefs, we may iterate every inode's extent
and add/merge them into ulist, and the caller will free memory
from ulist.

However, if we fail to allocate inode's extents element
memory or ulist_add() fail to allocate memory, we won't
add allocated memory into ulist, and the caller won't
free some allocated memory thus memory leaks happen.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:26 -08:00
Filipe David Borba Manana
bf54f412f0 Btrfs: fix send file hole detection leading to data corruption
There was a case where file hole detection was incorrect and it would
cause an incremental send to override a section of a file with zeroes.

This happened in the case where between the last leaf we processed which
contained a file extent item for our current inode and the leaf we're
currently are at (and has a file extent item for our current inode) there
are only leafs containing exclusively file extent items for our current
inode, and none of them was updated since the previous send operation.
The file hole detection code would incorrectly consider the file range
covered by these leafs as a hole.

A test case for xfstests follows soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:25 -08:00
Wang Shilong
bca1a29003 Btrfs: add a reschedule point in btrfs_find_all_roots()
I can easily trigger the following warnings when enabling quota
in my virtual machine(running Opensuse), Steps are firstly creating
a subvolume full of fragment extents, and then create many snapshots
(500 in my test case).

[ 2362.808459] BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-qgroup-re:1970]

[ 2362.809023] task: e4af8450 ti: e371c000 task.ti: e371c000
[ 2362.809026] EIP: 0060:[<fa38f4ae>] EFLAGS: 00000246 CPU: 0
[ 2362.809049] EIP is at __merge_refs+0x5e/0x100 [btrfs]
[ 2362.809051] EAX: 00000000 EBX: cfadbcf0 ECX: 00000000 EDX: cfadbcb0
[ 2362.809052] ESI: dd8d3370 EDI: e371dde0 EBP: e371dd6c ESP: e371dd5c
[ 2362.809054]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 2362.809055] CR0: 80050033 CR2: ac454d50 CR3: 009a9000 CR4: 001407d0
[ 2362.809099] Stack:
[ 2362.809100]  00000001 e371dde0 dfcc6890 f29f8000 e371de28 fa39016d 00000011 00000001
[ 2362.809105]  99bfc000 00000000 93928000 00000000 00000001 00000050 e371dda8 00000001
[ 2362.809109]  f3a31000 f3413000 00000001 e371ddb8 000040a8 00000202 00000000 00000023
[ 2362.809113] Call Trace:
[ 2362.809136]  [<fa39016d>] find_parent_nodes+0x34d/0x1280 [btrfs]
[ 2362.809156]  [<fa391172>] btrfs_find_all_roots+0xb2/0x110 [btrfs]
[ 2362.809174]  [<fa3934a8>] btrfs_qgroup_rescan_worker+0x358/0x7a0 [btrfs]
[ 2362.809180]  [<c024d0ce>] ? lock_timer_base.isra.39+0x1e/0x40
[ 2362.809199]  [<fa3648df>] worker_loop+0xff/0x470 [btrfs]
[ 2362.809204]  [<c027a88a>] ? __wake_up_locked+0x1a/0x20
[ 2362.809221]  [<fa3647e0>] ? btrfs_queue_worker+0x2b0/0x2b0 [btrfs]
[ 2362.809225]  [<c025ebbc>] kthread+0x9c/0xb0
[ 2362.809229]  [<c06b487b>] ret_from_kernel_thread+0x1b/0x30
[ 2362.809233]  [<c025eb20>] ? kthread_create_on_node+0x110/0x110

By adding a reschedule point at the end of btrfs_find_all_roots(), i no longer
hit these warnings.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:25 -08:00
Filipe David Borba Manana
7fdd29d02e Btrfs: make send's file extent item search more efficient
Instead of looking for a file extent item, process it, release the path
and do a btree search for the next file extent item, just process all
file extent items in a leaf without intermediate btree searches. This way
we save cpu and we're not blocking other tasks or affecting concurrency on
the btree, because send's paths use the commit root and skip btree node/leaf
locking.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:24 -08:00
Wang Shilong
95def2ede1 Btrfs: fix to catch all errors when resolving indirect ref
We can only tolerate ENOENT here, for other errors, we should
return directly.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:23 -08:00
Wang Shilong
538f72cdf0 Btrfs: fix protection between walking backrefs and root deletion
There is a race condition between resolving indirect ref and root deletion,
and we should gurantee that root can not be destroyed to avoid accessing
broken tree here.

Here we fix it by holding @subvol_srcu, and we will release it as soon
as we have held root node lock.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:23 -08:00
Gui Hecheng
3c9665df0c btrfs: fix warning while merging two adjacent extents
When we have two adjacent extents in relink_extent_backref,
we try to merge them. When we use btrfs_search_slot to locate the
slot for the current extent, we shouldn't set "ins_len = 1",
because we will merge it into the previous extent rather than
insert a new item. Otherwise, we may happen to create a new leaf
in btrfs_search_slot and path->slot[0] will be 0. Then we try to
fetch the previous item using "path->slots[0]--", and it will cause
a warning as follows:

	[  145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0
	[  145.713387] btrfs bad mapping eb start 5337088 len 4096, wanted 167772306 8
	...
	[  145.713462]  [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0
	[  145.713476]  [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40
	[  145.713485]  [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0
	[  145.713498]  [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820
	[  145.713508]  [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80

I encounter this warning when running defrag having mkfs.btrfs
with option -M. At the same time there are read/writes & snapshots
running at background.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:22 -08:00
Filipe David Borba Manana
9f03740a95 Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.

This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:

  WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
  Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
  CPU: 1 PID: 5705 Comm: btrfs Tainted: G           O 3.13.0-rc7-fdm-btrfs-next-18+ #3
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
  [ 5381.660441]  00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
  [ 5381.660447]  0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
  [ 5381.660452]  0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
  [ 5381.660457] Call Trace:
  [ 5381.660464]  [<ffffffff81777434>] dump_stack+0x4e/0x68
  [ 5381.660471]  [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
  [ 5381.660476]  [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
  [ 5381.660480]  [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
  [ 5381.660487]  [<ffffffff8108313f>] ? local_clock+0x4f/0x60
  [ 5381.660491]  [<ffffffff811430e8>] ? free_one_page+0x98/0x440
  [ 5381.660495]  [<ffffffff8108313f>] ? local_clock+0x4f/0x60
  [ 5381.660502]  [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
  [ 5381.660508]  [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
  [ 5381.660515]  [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
  [ 5381.660520]  [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
  [ 5381.660524]  [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
  [ 5381.660530]  [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
  [ 5381.660536]  [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
  [ 5381.660560]  [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
  [ 5381.660564]  [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
  [ 5381.660569]  [<ffffffff811580ef>] krealloc+0x6f/0xb0
  [ 5381.660586]  [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
  [ 5381.660601]  [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
  [ 5381.660615]  [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
  [ 5381.660628]  [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]

Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:

  $ mkfs.btrfs -f /dev/sdb3
  $ mount /dev/sdb3 /mnt/btrfs
  $ mkdir -p /mnt/btrfs/a/b/c/d
  $ mkdir /mnt/btrfs/a/b/c2
  $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
  $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
  $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
  $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
  $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send

The structure of the filesystem when the first snapshot is taken is:

	 .                       (ino 256)
	 |-- a                   (ino 257)
	     |-- b               (ino 258)
	         |-- c           (ino 259)
	         |   |-- d       (ino 260)
                 |
	         |-- c2          (ino 261)

And its structure when the second snapshot is taken is:

	 .                       (ino 256)
	 |-- a                   (ino 257)
	     |-- b               (ino 258)
	         |-- c2          (ino 261)
	             |-- d2      (ino 260)
	                 |-- cc  (ino 259)

Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.

A test case for xfstests, with a more complex scenario, will follow soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:22 -08:00
Jan Kara
8581679424 fanotify: Fix use after free for permission events
Currently struct fanotify_event_info has been destroyed immediately
after reporting its contents to userspace. However that is wrong for
permission events because those need to stay around until userspace
provides response which is filled back in fanotify_event_info. So change
to code to free permission events only after we have got the response
from userspace.

Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-01-29 13:57:17 +01:00
Jan Kara
83c0e1b442 fsnotify: Do not return merged event from fsnotify_add_notify_event()
The event returned from fsnotify_add_notify_event() cannot ever be used
safely as the event may be freed by the time the function returns (after
dropping notification_mutex). So change the prototype to just return
whether the event was added or merged into some existing event.

Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-01-29 13:57:10 +01:00
Jan Kara
13116dfd13 fanotify: Fix use after free in mask checking
We cannot use the event structure returned from
fsnotify_add_notify_event() because that event can be freed by the time
that function returns. Use the mask argument passed into the event
handler directly instead. This also fixes a possible problem when we
could unnecessarily wait for permission response for a normal fanotify
event which got merged with a permission event.

We also disallow merging of permission event with any other event so
that we know the permission event which we just created is the one on
which we should wait for permission response.

Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-01-29 13:57:04 +01:00
Linus Torvalds
0e47c969c6 MTD updates for 3.14:
- Add me (Brian Norris) as an additional MTD maintainer (it'd be nice to get
    David's "ack" for this; I'm sure he approves, but he's been pretty silent
    lately)
  - Add Ezequiel Garcie as maintainer for the pxa3xx NAND driver
  - Last (?) round of pxa3xx improvements for supporting Armada 370/XP
  - Typical churn in driver boilerplate (OOM messages, printk()'s, devm_*, etc.)
  - Quad read mode support for SPI NOR driver (m25p80)
  - Update Davinci NAND driver to prepare for use on new platforms
  - Begin to kill off NAND_MAX_{PAGE,OOB}SIZE macros; more work is pending
  - Miscellaneous NAND device support (new IDs)
  - Add READ RETRY support for Micron MLC NAND
  - Support new GPMI NAND ECC layout device-tree binding
  - Avoid mapping stack/vmalloc() memory for GPMI NAND DMA
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJS50wMAAoJEFySrpd9RFgt1I0P/3fxzmHMmTRPjNRmtVqE5buW
 L3CjI1sPypA5PgCjjwABQdiw8WztBt6KVlMv5Gs6gaoi/WYAyAWz40PfwuBWFklD
 L8+cuKuuslsFfqh+gdkrBkNKnrFfMJpTDx57E5+dciInmWWB1w14sHMAYk5A/ZHK
 MBfXCcgfIECeQp1zSKmBtQxkhqn20hOS232gsCRWu2NWWu4UCzEcRLHGCBP1qqi3
 6joQhrICiQJwH09D+I9YfyS/oisM+75Df+79xephcjMaJYAAE3tLVrKM84CBso+R
 R/wjOvX+xFVyvf3n+CftxbmbFVXtRAWqCEEHu2yKffiF9oipp9xOm6gqJb9SYX/7
 n54fexx5cM1AE8OeDCxgbVCNfgoqDS6hk03d5oVMRmASShr0Ye4irG77Dy/KTMGC
 UdB2gb5HoKPC2N2IWWRlpwRoyLR9Xhf/+iCHjac0kB7oYQ0IdMyC0d4h7Prp5r/4
 zaz/pBNNr1lL8DOFrIog02U1DL7jOBtMmlaCUnfrPkXOiwG04Inkfy7Gfug0uTIh
 o1JhwCWmrY/CLYmpMqKFK9V4gr72CZTGIjMeNpb2NUAM0XlRbf3AQ8x7P7Q9UjK6
 ARZC/3lPrkSE8BDgSP8cT/oXB5zsIAU0jJbu3yIixD/v9WwMyKjoS4E8crl7qVB4
 KDAzQNR3rjwmE9lOjtsx
 =3nT9
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20140127' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 - Add me (Brian Norris) as an additional MTD maintainer (it'd be nice to get
   David's "ack" for this; I'm sure he approves, but he's been pretty silent
   lately)
 - Add Ezequiel Garcie as maintainer for the pxa3xx NAND driver
 - Last (?) round of pxa3xx improvements for supporting Armada 370/XP
 - Typical churn in driver boilerplate (OOM messages, printk()'s, devm_*, etc.)
 - Quad read mode support for SPI NOR driver (m25p80)
 - Update Davinci NAND driver to prepare for use on new platforms
 - Begin to kill off NAND_MAX_{PAGE,OOB}SIZE macros; more work is pending
 - Miscellaneous NAND device support (new IDs)
 - Add READ RETRY support for Micron MLC NAND
 - Support new GPMI NAND ECC layout device-tree binding
 - Avoid mapping stack/vmalloc() memory for GPMI NAND DMA

* tag 'for-linus-20140127' of git://git.infradead.org/linux-mtd: (151 commits)
  mtd: gpmi: add sanity check when mapping DMA for read_buf/write_buf
  mtd: gpmi: allocate a proper buffer for non ECC read/write
  mtd: m25p80: Set rx_nbits for Quad SPI transfers
  mtd: m25p80: Enable Quad SPI read transfers for s25fl512s
  mtd: s3c2410: Merge plat/regs-nand.h into s3c2410.c
  mtd: mtdram: add missing 'const'
  mtd: m25p80: assign default read command
  mtd: nuc900_nand: remove redundant return value check of platform_get_resource()
  mtd: plat_nand: remove redundant return value check of platform_get_resource()
  mtd: nand: add Intel manufacturer ID
  mtd: nand: add SanDisk manufacturer ID
  mtd: nand: add support for Samsung K9LCG08U0B
  mtd: nand: pxa3xx: Add support for 2048 bytes page size devices
  mtd: m25p80: Use OPCODE_QUAD_READ_4B for 4-byte addressing
  mtd: nand: don't use {read,write}_buf for 8-bit transfers
  mtd: nand: use __packed shorthand
  mtd: nand: support Micron READ RETRY
  mtd: nand: add generic READ RETRY support
  mtd: nand: add ONFI vendor block for Micron
  mtd: nand: localize ECC failures per page
  ...
2014-01-28 18:56:37 -08:00
Linus Torvalds
f1499382f1 xfs: update #2 for v3.14-rc1
- allow logical sector sized direct io on 'advanced format' 4k/512 disk.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJS6C5lAAoJENaLyazVq6ZOFFMP/Ag3IzyGv9zQ1I1ebqxdGNmU
 8bNVvurN+Az1ISgXkMdcbqNgpVxcWaAcpbJwxWsRBaKBxtBmBWa0TF33XEmrFdmO
 Qqw3qLSsPzIDP5l518/bpn1Q0XhMWO2zss9HtHogw/7WBkB5qQYn94DAwIF1Zrga
 cimpuiVv7u73BnlMJeQl+mqONkuEUl9Oc9lOyDP7lqbAlp3mBDylTs0q99kug8fF
 LSQ+f+06PFKshPzTtcZHhqwMtK7Pd+wPp3B3i0iGvv1GcWWL1adM5HfWPLLQ/85l
 CvdTnKnIWpNvmaEbPtpB+kuz7V3CtnLFIBM1oRiYZoNRNEUG3SkkqH+xSkXY+Wfr
 Wiuplt50vHUvn1tiL6n2GE0QAyTNhcy2RiWzpkIim6S2LUXCmdOKvT9hEEs5ymVl
 Y+VSXwdPEbCHSfhqkg+PXUbWxk+WTkDe9rOMFK4VcTjDuxiWUAiFgVpEHrxjPV4A
 rFPwNgfUpaRMou5SzTZHLuM8YQ75DckDH+O3XpWD4boEfeTGeV3LhlRzLvIkL0I2
 8VB24kV9TPUj+rphh9AckdU3kfN1wIb7ZOjwr008I5udfgPwjoj+kSgnd+k2yPtr
 r/ryHTV+RchbsiYEdriibSgUOtJNdNWnboba1DEDZDSvqALsq5eKWmpm3InE85Ix
 2bN0SdjsTphp/Mc1Mllg
 =c0tN
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.14-rc1-2' of git://oss.sgi.com/xfs/xfs

Pull second xfs update from Ben Myers:
 "Allow logical sector sized direct io on 'advanced format' 4k/512 disk"

* tag 'xfs-for-linus-v3.14-rc1-2' of git://oss.sgi.com/xfs/xfs:
  xfs: allow logical-sector sized O_DIRECT
  xfs: rename xfs_buftarg structure members
  xfs: clean up xfs_buftarg
2014-01-28 18:21:22 -08:00
Linus Torvalds
4db658ea0c ceph: Fix up after semantic merge conflict
The previous ceph-client merge resulted in ceph not even building,
because there was a merge conflict that wasn't visible as an actual data
conflict: commit 7221fe4c2e ("ceph: add acl for cephfs") added support
for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change
a lot of the POSIX ACL helper functions to be much more helpful to
filesystems (see for example commits 2aeccbe957 "fs: add generic
xattr_acl handlers", 5bf3258fd2 "fs: make posix_acl_chmod more useful"
and 37bc15392a "fs: make posix_acl_create more useful")

The reason this conflict wasn't obvious was many-fold: because it was a
semantic conflict rather than a data conflict, it wasn't visible in the
git merge as a conflict.  And because the VFS tree hadn't been in
linux-next, people hadn't become aware of it that way.  And because I
was at jury duty this morning, I was using my laptop and as a result not
doing constant "allmodconfig" builds.

Anyway, this fixes the build and generally removes a fair chunk of the
Ceph POSIX ACL support code, since the improved helpers seem to match
really well for Ceph too.  But I don't actually have any way to *test*
the end result, and I was really hoping for some ACK's for this.  Oh,
well.

Not compiling certainly doesn't make things easier to test, so I'm
committing this without the acks after having waited for four hours...
Plus it's what I would have done for the merge had I noticed the
semantic conflict..

Reported-by: Dave Jones <davej@redhat.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Guangliang Zhao <lucienchao@gmail.com>
Cc: Li Wang <li.wang@ubuntykylin.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-28 18:06:18 -08:00
Anand Jain
2365dd3ca0 btrfs: undo sysfs when open_ctree() fails
reproducer:
mkfs.btrfs -f /dev/sdb &&\
mount /dev/sdb /btrfs &&\
btrfs dev add -f /dev/sdc /btrfs &&\
umount /btrfs &&\
wipefs -a /dev/sdc &&\
mount -o degraded /dev/sdb /btrfs
//above mount fails so try with RO
mount -o degraded,ro /dev/sdb /btrfs

------
sysfs: cannot create duplicate filename '/fs/btrfs/3f48c79e-5ed0-4e87-b189-86e749e503f4'
::

dump_stack+0x49/0x5e
warn_slowpath_common+0x87/0xb0
warn_slowpath_fmt+0x41/0x50
strlcat+0x69/0x80
sysfs_warn_dup+0x87/0xa0
sysfs_add_one+0x40/0x50
create_dir+0x76/0xc0
sysfs_create_dir_ns+0x7a/0xc0
kobject_add_internal+0xad/0x220
kobject_add_varg+0x38/0x60
kobject_init_and_add+0x53/0x70
mutex_lock+0x11/0x40
__free_pages+0x25/0x30
free_pages+0x41/0x50
selinux_sb_copy_data+0x14e/0x1e0
mount_fs+0x3e/0x1a0
vfs_kern_mount+0x71/0x120
do_mount+0x3f7/0x980
SyS_mount+0x8b/0xe0
system_call_fastpath+0x16/0x1b
------

further 'modprobe -r btrfs' fails as well

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:44 -08:00
Filipe David Borba Manana
f74b86d855 Btrfs: fix snprintf usage by send's gen_unique_name
The buffer size argument passed to snprintf must account for the
trailing null byte added by snprintf, and it returns a value >= then
sizeof(buffer) when the string can't fit in the buffer.

Since our buffer has a size of 64 characters, and the maximum orphan
name we can generate is 63 characters wide, we must pass 64 as the
buffer size to snprintf, and not 63.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:43 -08:00
Justin Maggard
c41570c9d2 btrfs: fix defrag 32-bit integer overflow
When defragging a very large file, the cluster variable can wrap its 32-bit
signed int type and become negative, which eventually gets passed to
btrfs_force_ra() as a very large unsigned long value.  On 32-bit platforms,
this eventually results in an Oops from the SLAB allocator.

Change the cluster and max_cluster signed int variables to unsigned long to
match the readahead functions.  This also allows the min() comparison in
btrfs_defrag_file() to work as intended.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:43 -08:00
David Sterba
c736c095de btrfs: sysfs: list the NO_HOLES feature
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:42 -08:00
David Sterba
66b4bbd4f5 btrfs: sysfs: don't show reserved incompat feature
The COMPRESS_LZOv2 incompat featue is currently not implemented, the bit
is only reserved, no point to list it in sysfs.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:41 -08:00
David Sterba
bd60ea0fe9 btrfs: call permission checks earlier in ioctls and return EPERM
The owner and capability checks in IOC_SUBVOL_SETFLAGS and
SET_RECEIVED_SUBVOL should be called before any other checks are done.

Also unify the error code to EPERM.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:41 -08:00
David Sterba
d024206133 btrfs: restrict snapshotting to own subvolumes
Currently, any user can snapshot any subvolume if the path is accessible and
thus indirectly create and keep files he does not own under his direcotries.
This is not possible with traditional directories.

In security context, a user can snapshot root filesystem and pin any
potentially buggy binaries, even if the updates are applied.

All the snapshots are visible to the administrator, so it's possible to
verify if there are suspicious snapshots.

Another more practical problem is that any user can pin the space used
by eg. root and cause ENOSPC.

Original report:
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786

CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:40 -08:00
Miao Xie
89d4346a36 Btrfs: fix wrong block group in trace during the free space allocation
We allocate the free space from the former block group, not the current
one, so should use the former one to output the trace information.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:40 -08:00
Miao Xie
215a63d139 Btrfs: cleanup the code of used_block_group in find_free_extent()
used_block_group is just used for the space cluster which doesn't
belong to the current block group, the other place needn't use it.
Or the logic of code seems unclear.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:39 -08:00
Miao Xie
920e4a58d2 Btrfs: cleanup the redundant code for the block group allocation and init
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:38 -08:00
Miao Xie
26b47ff65b Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss
It is better that the position of the lock is close to the data which is
protected by it, because they may be in the same cache line, we will load
less cache lines when we access them. So we rearrange the members' position
of btrfs_space_info structure to make the lock be closer to the its data.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:38 -08:00
Wang Shilong
ffcfaf8179 Btrfs: fix wrong search path initialization before searching tree root
To search tree root without transaction protection, we should neither search commit
root nor skip locking here, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:37 -08:00
Miao Xie
23c671a588 Btrfs: flush the dirty pages of the ordered extent aggressively during logging csum
The performance of fsync dropped down suddenly sometimes, the main reason
of this problem was that we might only flush part dirty pages in a ordered
extent, then got that ordered extent, wait for the csum calcucation. But if
no task flushed the left part, we would wait until the flusher flushed them,
sometimes we need wait for several seconds, it made the performance drop
down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s)

This patch improves the above problem by flushing left dirty pages aggressively.

Test Environment:
CPU:		2CPU * 2Cores
Memory:		4GB
Partition:	20GB(HDD)

Test Command:
 # sysbench --num-threads=8 --test=fileio --file-num=1 \
 > --file-total-size=8G --file-block-size=32768 \
 > --file-io-mode=sync --file-fsync-freq=100 \
 > --file-fsync-end=no --max-requests=10000 \
 > --file-test-mode=rndwr run

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:37 -08:00
Wang Shilong
2c21b4d733 Btrfs: fix transaction abortion when remounting btrfs from RW to RO
Steps to reproduce:
 # mkfs.btrfs -f /dev/sda8
 # mount /dev/sda8 /mnt -o flushoncommit
 # dd if=/dev/zero of=/mnt/data bs=4k count=102400 &
 # mount /dev/sda8 /mnt -o remount, ro

When remounting RW to RO, the logic is to firstly set flag
to RO and then commit transaction, however with option
flushoncommit enabled,we will do RO check within committing
transaction, so we get a transaction abortion here.

Actually,here check is wrong, we should check if FS_STATE_ERROR
is set, fix it.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:36 -08:00
Filipe David Borba Manana
e4355f34ef Btrfs: faster file extent item search in clone ioctl
When we are looking for file extent items that intersect the cloning
range, for each one that falls completely outside the range, don't
release the path and do another full tree search - just move on
to the next slot and copy the file extent item into our buffer only
if the item intersects the cloning range.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:35 -08:00
Liu Bo
1a4319cc3c Btrfs: fix extent state leak on transaction abortion
When transaction is aborted, we fail to commit transaction, instead we do
cleanup work.  After that when we umount btrfs, we get to free fs roots' log
trees respectively, but that happens after we unpin extents, so those extents
pinned by freeing log trees will remain in memory and lead to the leak.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:35 -08:00
Qu Wenruo
078025347c btrfs: Cleanup the btrfs_parse_options for remount.
Since remount will pending the new mount options to the original mount
options, which will make btrfs_parse_options check the old options then
new options, causing some stupid output like "enabling XXX" following by
"disable XXX".

This patch will add extra check before every btrfs_info to skip the
output from old options checking.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:34 -08:00
Qu Wenruo
3818aea275 btrfs: Add noinode_cache mount option
Add noinode_cache mount option for btrfs.

Since inode map cache involves all the btrfs_find_free_ino/return_ino
things and if just trigger the mount_opt,
an inode number get from inode map cache will not returned to inode map
cache.

To keep the find and return inode both in the same behavior,
a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
CHANGE_INODE_CACHE is set/cleared in remounting, and the original
INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
success transaction.
Since find/return inode is all done between btrfs_start_transaction and
btrfs_commit_transaction, this will keep consistent behavior.

Also noinode_cache mount option will not stop the caching_kthread.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:33 -08:00
Wang Shilong
ade2e0b3ee Btrfs: fix to search previous metadata extent item since skinny metadata
There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.

To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:33 -08:00
Wang Shilong
7c76edb77c Btrfs: fix missing skinny metadata check in scrub_stripe()
Check if we support skinny metadata firstly and fix to use
right type to search.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:32 -08:00
Filipe David Borba Manana
28e5dd8f35 Btrfs: fix send to not send non-aligned clone operations
It is possible for the send feature to send clone operations that
request a cloning range (offset + length) that is not aligned with
the block size. This makes the btrfs receive command send issue a
clone ioctl call that will fail, as the ioctl will return an -EINVAL
error because of the unaligned range.

Fix this by not sending clone operations for non block aligned ranges,
and instead send regular write operation for these (less common) cases.

The following xfstest reproduces this issue, which fails on the second
btrfs receive command without this change:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=`mktemp -d`

  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $tmp
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root

  rm -f $seqres.full

  _scratch_mkfs >/dev/null 2>&1
  _scratch_mount

  $XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo | _filter_xfs_io
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch

  $XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo | _filter_xfs_io
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch

  $XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo | _filter_xfs_io
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch

  $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 | \
      _filter_scratch

  $XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo | _filter_xfs_io
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch

  $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 | \
      _filter_scratch

  $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 | _filter_scratch
  $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
      -f $tmp/2.snap 2>&1 | _filter_scratch

  md5sum $SCRATCH_MNT/foo | _filter_scratch
  md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  _scratch_unmount
  _check_btrfs_filesystem $SCRATCH_DEV
  _scratch_mkfs >/dev/null 2>&1
  _scratch_mount

  $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
  md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch

  $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  _scratch_unmount
  _check_btrfs_filesystem $SCRATCH_DEV

  status=0
  exit

The tests expected output is:

  QA output created by 025
  FSSync 'SCRATCH_MNT'
  FSSync 'SCRATCH_MNT'
  wrote 2978/2978 bytes at offset 1482752
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  FSSync 'SCRATCH_MNT'
  Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1'
  FSSync 'SCRATCH_MNT'
  Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2'
  At subvol SCRATCH_MNT/mysnap1
  At subvol SCRATCH_MNT/mysnap2
  129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/foo
  42b6369eae2a8725c1aacc0440e597aa  SCRATCH_MNT/mysnap1/foo
  129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/mysnap2/foo
  At subvol mysnap1
  42b6369eae2a8725c1aacc0440e597aa  SCRATCH_MNT/mysnap1/foo
  At snapshot mysnap2
  129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/mysnap2/foo

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:32 -08:00
Filipe David Borba Manana
14a958e678 Btrfs: fix btrfs boot when compiled as built-in
After the change titled "Btrfs: add support for inode properties", if
btrfs was built-in the kernel (i.e. not as a module), it would cause a
kernel panic, as reported recently by Fengguang:

[    2.024722] BUG: unable to handle kernel NULL pointer dereference at           (null)
[    2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b
[    2.028684] PGD 0
[    2.028684] Oops: 0000 [#1] SMP
[    2.028684] Modules linked in:
[    2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1
[    2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000
[    2.028684] RIP: 0010:[<ffffffff81501594>]  [<ffffffff81501594>] crc32c+0xc/0x6b
[    2.028684] RSP: 0000:ffff88000edd7e58  EFLAGS: 00010246
[    2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000
[    2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe
[    2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20
[    2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff
[    2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000
[    2.028684] FS:  0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000
[    2.028684] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0
[    2.028684] Stack:
[    2.028684]  ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05
[    2.028684]  0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05
[    2.028684]  ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404
[    2.028684] Call Trace:
[    2.028684]  [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96
[    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
[    2.028684]  [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0
[    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
[    2.028684]  [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a
[    2.028684]  [<ffffffff810e2404>] ? parse_args+0x25f/0x33d
[    2.028684]  [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230
[    2.028684]  [<ffffffff8234c785>] ? do_early_param+0x88/0x88
[    2.028684]  [<ffffffff819f61b5>] ? rest_init+0x89/0x89
[    2.028684]  [<ffffffff819f61c3>] kernel_init+0xe/0x109

The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
part of the properties initialization routine), the libcrc32c is not yet initialized,
so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
panic on boot.

The approach to fix this is to use crypto component directly to use its crc32c (which
is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
doing as well, it uses crypto directly to get crc32c functionality.

Verified this works both when btrfs is built-in and when it's loadable kernel module.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:31 -08:00
Filipe David Borba Manana
c57c2b3ed2 Btrfs: unlock inodes in correct order in clone ioctl
In the clone ioctl, when the source and target inodes are different,
we can acquire their mutexes in 2 possible different orders. After
we're done cloning, we were releasing the mutexes always in the same
order - the most correct way of doing it is to release them by the
reverse order they were acquired.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:30 -08:00