Commit Graph

27584 Commits

Author SHA1 Message Date
Josef Bacik
f4c738c2e7 Btrfs: rework shrink_delalloc
So shrink_delalloc has grown all sorts of cruft over the years thanks to
many reworkings of how we track enospc.  What happens now as we fill up the
disk is we will loop for freaking ever hoping to reclaim a arbitrary amount
of space of metadata, this was from when everybody flushed at the same time.
Now we only have people flushing one at a time.  So instead of trying to
reclaim a huge amount of space, just try to flush a decent chunk of space,
and stop looping as soon as we have enough free space to satisfy our
reservation.  This makes xfstests 224 go much faster.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:58 -04:00
Liu Bo
b9ca0664dc Btrfs: do not set subvolume flags in readonly mode
$ mkfs.btrfs /dev/sdb7
$ btrfstune -S1 /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
mount: block device /dev/sdb7 is write-protected, mounting read-only
$ btrfs dev add /dev/sdb8 /mnt/btrfs/

Now we get a btrfs in which mnt flags has readonly but sb flags does
not.  So for those ioctls that only check sb flags with MS_RDONLY, it
is going to be a problem.
Setting subvolume flags is such an ioctl, we should use mnt_want_write_file()
to check RO flags.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:58 -04:00
Liu Bo
e54bfa3104 Btrfs: use mnt_want_write_file instead of mnt_want_write
mnt_want_write_file is faster when file has been opened for write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:57 -04:00
Liu Bo
768e9dfe82 Btrfs: remove redundant r/o check for superblock
mnt_want_write() and mnt_want_write_file() will check sb->s_flags with
MS_RDONLY, and we don't need to do it ourselves.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
a874a63e13 Btrfs: check write access to mount earlier while creating snapshots
Move check of write access to mount into upper functions so that we can
use mnt_want_write_file instead, which is faster than mnt_want_write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
287082b0bd Btrfs: fix typo in cow_file_range_async and async_cow_submit
It should be 10 * 1024 * 1024.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-23 16:27:55 -04:00
Josef Bacik
0e72110692 Btrfs: change how we indicate we're adding csums
There is weird logic I had to put in place to make sure that when we were
adding csums that we'd used the delalloc block rsv instead of the global
block rsv.  Part of this meant that we had to free up our transaction
reservation before we ran the delayed refs since csum deletion happens
during the delayed ref work.  The problem with this is that when we release
a reservation we will add it to the global reserve if it is not full in
order to keep us going along longer before we have to force a transaction
commit.  By releasing our reservation before we run delayed refs we don't
get the opportunity to drain down the global reserve for the work we did, so
we won't refill it as often.  This isn't a problem per-se, it just results
in us possibly committing transactions more and more often, and in rare
cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
out of space in our block rsv.

This also helps us by holding onto space while the delayed refs run so we
don't end up with as many people trying to do things at the same time, which
again will help us not force commits or hit the use_block_rsv warnings.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:55 -04:00
Tsutomu Itoh
b995929515 Btrfs: return error of btrfs_update_inode() to caller
We didn't check error of btrfs_update_inode(), but that error looks
easy to bubble back up.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:54 -04:00
Dan Carpenter
23291a044c Btrfs: fix error handling in __add_reloc_root()
We dereferenced "node" in the error message after freeing it.  Also
btrfs_panic() can return so we should return an error code instead of
continuing.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23 16:27:53 -04:00
Ilya Dryomov
44c44af2f4 Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mounting
There used to be a BUG_ON(ret) there before EH patch (79787eaa) went in.
Bail out with EINVAL.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-23 16:27:53 -04:00
Ilya Dryomov
fed425c742 Btrfs: do not return EINVAL instead of ENOMEM from open_ctree()
When bailing from open_ctree() err is returned, not ret.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-23 16:27:52 -04:00
Josef Bacik
02db0844be Btrfs: add DEVICE_READY ioctl
This will be used in conjunction with btrfs device ready <dev>.  This is
needed for initrd's to have a nice and lightweight way to tell if all of the
devices needed for a file system are in the cache currently.  This keeps
them from having to do mount+sleep loops waiting for devices to show up.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:42 -04:00
Josef Bacik
96c3f4331a Btrfs: flush delayed inodes if we're short on space
Those crazy gentoo guys have been complaining about ENOSPC errors on their
portage volumes.  This is because doing things like untar tends to create
lots of new files which will soak up all the reservation space in the
delayed inodes.  Usually this gets papered over by the fact that we will try
and commit the transaction, however if this happens in the wrong spot or we
choose not to commit the transaction you will be screwed.  So add the
ability to expclitly flush delayed inodes to free up space.  Please test
this out guys to make sure it works since as usual I cannot reproduce.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 15:41:40 -04:00
David Sterba
b27f7c0c15 btrfs: join DEV_STATS ioctls to one
Commit c11d2c236c (Btrfs: add ioctl to get and reset the device
stats) introduced two ioctls doing almost the same thing distinguished
by just the ioctl number which encodes "do reset after read". I have
suggested

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html

to implement it via the ioctl args. This hasn't happen, and I think we
should use a more clean way to pass flags and should not waste ioctl
numbers.

CC: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-07-23 15:41:40 -04:00
Andrew Mahone
a43a211133 btrfs: ignore unfragmented file checks in defrag when compression enabled - rebased
Rebased on btrfs-next and retested.

Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip
checks for adjacent extents and extent size when deciding whether to defrag,
as these can prevent an uncompressed and unfragmented file from being
compressed as requested.

Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>
2012-07-23 15:41:39 -04:00
Dan Carpenter
e4b50e14c8 Btrfs: small naming cleanup in join_transaction()
"root->fs_info" and "fs_info" are the same, but "fs_info" is prefered
because it is shorter and that's what is used in the rest of the
function.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23 15:41:39 -04:00
Alexander Block
2bc5565286 Btrfs: don't update atime on RO subvolumes
Before the update_time inode operation was indroduced, it was
not possible to prevent updates of atime on RO subvolumes. VFS
was only able to check for RO on the mount, but did not know
anything about btrfs subvolumes.

btrfs_update_time does now check if the root is RO and skip
updating of times.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-07-23 15:41:38 -04:00
Arnd Hannemann
063849eafd Btrfs: allow mount -o remount,compress=no
Btrfs allows to turn on compression on a mounted and used filesystem
by issuing mount -o remount,compress=lzo.
This patch allows to turn compression off again
while the filesystem is mounted. As suggested by David Sterba
if the compress-force option was set, it is implicitly cleared
if compression is turned off.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
2012-07-23 15:41:38 -04:00
Josef Bacik
c5c3c5f31e Btrfs: remove ->dirty_inode
We do all of our inode updating when we change it, and now that we do
->update_time we don't need ->dirty_inode for atime updates anymore, so just
remove it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-07-23 15:41:38 -04:00
Chris Mason
cbea5ac1ee Btrfs: reduce calls to wake_up on uncontended locks
The btrfs locks were unconditionally calling wake_up as the
locks were released.  This lead to extra thrashing on the waitqueue,
especially for locks that were dominated by readers.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:18 -04:00
Chris Mason
e39e64ac0c Btrfs: don't wait around for new log writers on an SSD
Waiting on spindles improves performance, but ssds want all the
IO as quickly as we can push it down.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:17 -04:00
Linus Torvalds
ce9f8d6b39 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
Pull pnfs/ore fixes from Boaz Harrosh:
 "These are catastrophic fixes to the pnfs objects-layout that were just
  discovered.  They are also destined for @stable.

  I have found these and worked on them at around RC1 time but
  unfortunately went to the hospital for kidney stones and had a very
  slow recovery.  I refrained from sending them as is, before proper
  testing, and surly I have found a bug just yesterday.

  So now they are all well tested, and have my sign-off.  Other then
  fixing the problem at hand, and assuming there are no bugs at the new
  code, there is low risk to any surrounding code.  And in anyway they
  affect only these paths that are now broken.  That is RAID5 in pnfs
  objects-layout code.  It does also affect exofs (which was not broken)
  but I have tested exofs and it is lower priority then objects-layout
  because no one is using exofs, but objects-layout has lots of users."

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
  pnfs-obj: don't leak objio_state if ore_write/read fails
  ore: Unlock r4w pages in exact reverse order of locking
  ore: Remove support of partial IO request (NFS crash)
  ore: Fix NFS crash by supporting any unaligned RAID IO
2012-07-20 11:43:53 -07:00
Linus Torvalds
1793416287 Fix a bug in UBIFS free space fix-up reported already twice recently:
http://lists.infradead.org/pipermail/linux-mtd/2012-May/041408.html
 http://lists.infradead.org/pipermail/linux-mtd/2012-June/042422.html
 
 and we finally have the fix. I am quite confident the fix is correct
 because I could reproduce the problem with nandsim and verify the
 fix. It was also verified by Iwo (the reporter).
 
 I am also confident that this is OK to merge the fix so late because
 this patch affects only the fixup functionality, which is not used by
 most users.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQCQZ1AAoJECmIfjd9wqK0fosP/RD3Ruo5ILvTtBThKJPUoeld
 kihD9w3rk26cILlpGA3Cs/kaoOj/wPtjMVKGVkw50cWKRQemFLMh4ZcbCepfae+b
 g+YsH+ihkINgjdpKM351lgSCS+NEPJ695zmxNJ+/zjM5+ewfP6vK0qivnjF7w81k
 jLAVt80a1nhjNPyDMeQVr69HxBegYuX927LL4onJULYqvmrSiX/5tXzI+02emjDf
 9gA99fyc4pLNAJzzQyr44pogNaSME+Q90p4PAd11tlaVfn1kXgCXA3Ybv2cy7cer
 ipQfHQzfMjiCMO7Kpt5Ja3necuTarZsHV4UtmXhc4uIOr5p57dJX7RfBzA3j4RmV
 2ZFynqjl7n6ZT0pAM/0F9h9FyjZrCcgg1BGcEsqfJv2Yu7txOX1Qo2gkEvYJl8Sx
 Q2G6xNdzyib8MXClm4L2Zix16WqAF7CyUZo+szUTpdO8PPzgJ/vNpAk+3yqoVeep
 0Dr0HmTMRuP6tJGa9TH58QlvhClkXGSb7ukk1UlV4RVXtvvYtjVBwUXoHSUHNDJO
 HB9B+7ViTIjm9fdILqCX5wtrnZZQgFd1hBiQ/13/ZFrtB1hz5WfOdgfLRIBifjbq
 hGkwQyb5zsWTm7KGTOV0Yncmbnkut4zSJpMCbjZvcPJ2r5zwNwImKdvLBJ7oCKmd
 nPZ2dJmJYdKw2L00SzGZ
 =bUPG
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs

Pull UBIFS free space fix-up bugfix from Artem Bityutskiy:
 "It's been reported already twice recently:

    http://lists.infradead.org/pipermail/linux-mtd/2012-May/041408.html
    http://lists.infradead.org/pipermail/linux-mtd/2012-June/042422.html

  and we finally have the fix.  I am quite confident the fix is correct
  because I could reproduce the problem with nandsim and verify the fix.
  It was also verified by Iwo (the reporter).

  I am also confident that this is OK to merge the fix so late because
  this patch affects only the fixup functionality, which is not used by
  most users."

* tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs:
  UBIFS: fix a bug in empty space fix-up
2012-07-20 11:42:30 -07:00
Boaz Harrosh
c999ff6802 pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
It is very common for the end of the file to be unaligned on
stripe size. But since we know it's beyond file's end then
the XOR should be preformed with all zeros.

Old code used to just read zeros out of the OSD devices, which is a great
waist. But what scares me more about this situation is that, we now have
pages attached to the file's mapping that are beyond i_size. I don't
like the kind of bugs this calls for.

Fix both birds, by returning a global zero_page, if offset is beyond
i_size.

TODO:
	Change the API to ->__r4w_get_page() so a NULL can be
	returned without being considered as error, since XOR API
	treats NULL entries as zero_pages.

[Bug since 3.2. Should apply the same way to all Kernels since]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:50:31 +03:00
Boaz Harrosh
9909d45a85 pnfs-obj: don't leak objio_state if ore_write/read fails
[Bug since 3.2 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:50:30 +03:00
Boaz Harrosh
537632e0a5 ore: Unlock r4w pages in exact reverse order of locking
The read-4-write pages are locked in address ascending order.
But where unlocked in a way easiest for coding. Fix that,
locks should be released in opposite order of locking, .i.e
descending address order.

I have not hit this dead-lock. It was found by inspecting the
dbug print-outs. I suspect there is an higher lock at caller that
protects us, but fix it regardless.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:49:25 +03:00
Boaz Harrosh
62b62ad873 ore: Remove support of partial IO request (NFS crash)
Do to OOM situations the ore might fail to allocate all resources
needed for IO of the full request. If some progress was possible
it would proceed with a partial/short request, for the sake of
forward progress.

Since this crashes NFS-core and exofs is just fine without it just
remove this contraption, and fail.

TODO:
	Support real forward progress with some reserved allocations
	of resources, such as mem pools and/or bio_sets

[Bug since 3.2 Kernel]
CC: Stable Tree <stable@kernel.org>
CC: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:47:43 +03:00
Boaz Harrosh
9ff19309a9 ore: Fix NFS crash by supporting any unaligned RAID IO
In RAID_5/6 We used to not permit an IO that it's end
byte is not stripe_size aligned and spans more than one stripe.
.i.e the caller must check if after submission the actual
transferred bytes is shorter, and would need to resubmit
a new IO with the remainder.

Exofs supports this, and NFS was supposed to support this
as well with it's short write mechanism. But late testing has
exposed a CRASH when this is used with none-RPC layout-drivers.

The change at NFS is deep and risky, in it's place the fix
at ORE to lift the limitation is actually clean and simple.
So here it is below.

The principal here is that in the case of unaligned IO on
both ends, beginning and end, we will send two read requests
one like old code, before the calculation of the first stripe,
and also a new site, before the calculation of the last stripe.
If any "boundary" is aligned or the complete IO is within a single
stripe. we do a single read like before.

The code is clean and simple by splitting the old _read_4_write
into 3 even parts:
1._read_4_write_first_stripe
2. _read_4_write_last_stripe
3. _read_4_write_execute

And calling 1+3 at the same place as before. 2+3 before last
stripe, and in the case of all in a single stripe then 1+2+3
is preformed additively.

Why did I not think of it before. Well I had a strike of
genius because I have stared at this code for 2 years, and did
not find this simple solution, til today. Not that I did not try.

This solution is much better for NFS than the previous supposedly
solution because the short write was dealt  with out-of-band after
IO_done, which would cause for a seeky IO pattern where as in here
we execute in order. At both solutions we do 2 separate reads, only
here we do it within a single IO request. (And actually combine two
writes into a single submission)

NFS/exofs code need not change since the ORE API communicates the new
shorter length on return, what will happen is that this case would not
occur anymore.

hurray!!

[Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:45:28 +03:00
Artem Bityutskiy
c6727932cf UBIFS: fix a bug in empty space fix-up
UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
limitations of dumb flasher programs. Namely, of those flashers that are unable
to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
relatively new (introduced in v3.0).

The fix-up routine (fixup_free_space()) is executed only once at the very first
mount if the superblock has the 'space_fixup' flag set (can be done with -F
option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
writes it back to the same LEB. The routine assumes the image is pristine and
does not have anything in the journal.

There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
All but one LEB of the log of a pristine file-system are empty. And one
contains just a commit start node. And 'fixup_free_space()' just unmapped this
LEB, which resulted in wiping the commit start node. As a result, some users
were unable to mount the file-system next time with the following symptom:

UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0

The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
that the beginning of empty space in the log head (c->lhead_offs) was known
on mount. However, it is not the case - it was always 0. UBIFS does not store
in it the master node and finds out by scanning the log on every mount.

The fix is simple - just pass commit start node size instead of 0 to
'fixup_leb()'.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
Cc: stable@vger.kernel.org [v3.0+]
Reported-by: Iwo Mergler <Iwo.Mergler@netcommwireless.com>
Tested-by: Iwo Mergler <Iwo.Mergler@netcommwireless.com>
Reported-by: James Nute <newten82@gmail.com>
2012-07-20 10:13:27 +03:00
Linus Torvalds
a9866ba47c Merge git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French.

* git://git.samba.org/sfrench/cifs-2.6:
  cifs: always update the inode cache with the results from a FIND_*
  cifs: when CONFIG_HIGHMEM is set, serialize the read/write kmaps
  cifs: on CONFIG_HIGHMEM machines, limit the rsize/wsize to the kmap space
  Initialise mid_q_entry before putting it on the pending queue
2012-07-18 09:28:11 -07:00
Al Viro
331ae4962b ext4: fix duplicated mnt_drop_write call in EXT4_IOC_MOVE_EXT
Caused, AFAICS, by mismerge in commit ff9cb1c4ee ("Merge branch
'for_linus' into for_linus_merged")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org  # 3.3+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-18 08:59:46 -07:00
Linus Torvalds
a5e135122c Last-minute PM update for 3.5
This renames CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND to encourage future
 reuse of the capability in question in related cases.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJQBcRhAAoJEKhOf7ml8uNsnIoP/2XhSul9N/AWC5jfEAh4Af07
 QdhfJmYXnXC1Irndh/IoAITu+vHQecm0XjbvAy/9QOBn9oSkM7kNilvOLrCrdzzQ
 j9/BRMRCJRcu/vMyJmt37z0OIgfiktgDoOBaE6nC5t+1nHotcByAMWdy/AGwqqaL
 q3lbYcoRtDDQpDr9XPm68cyRdddvWnq81gXb90gNovvfgCjNFVvscshXmMGv3Luy
 Dx29zROJHJNOWG3kV1Xq7PdNffZj1ChCgIsBRKkzKWROcVEGPEuH5O0wjf4I4rCV
 PW6nRV9WOykqJI5CAnrWzr9bf8AvpclXtGYWFiwPvUF0kMggSoNFb5xQyRy45SBC
 nC+daLZNO123yU8xKb3qXaotsKPJ0qRTKAWUqWaGkRkQ0Mg90VmanyYkmP5PkeUX
 ZABNS4QlxnLGDtZuhSBioUO5pf0iDdzSrYkIOuYD81DGM8yKWWmUyxupOoVW5Kmu
 QD0d34+ZgEndv9znZzBF8DdGxkwjwljJW6sIBw7PGDq3qXcYdzd4awgtPlnGEOh/
 oi6iG24r8oysB8w5IJpwj20/zCvJyYVR+m+eHXxEs373xIGpbAfJbHYRKHqkYgTo
 nYkZyLgE0g46Izqbb42yrN7y5dUhSsrbImTI8L5xaLVkBYhspEuSO/eSLgoklWiw
 VgbmreU3R0apj0hwPcA5
 =oZrz
 -----END PGP SIGNATURE-----

Merge tag 'pm-post-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull a last-minute PM update from Rafael J. Wysocki:
 "This renames CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND to encourage future
  reuse of the capability in question in related cases."

* tag 'pm-post-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: Rename CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND
2012-07-17 14:15:43 -07:00
Michael Kerrisk
d9914cf661 PM: Rename CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND
As discussed in
http://thread.gmane.org/gmane.linux.kernel/1249726/focus=1288990,
the capability introduced in 4d7e30d989
to govern EPOLLWAKEUP seems misnamed: this capability is about governing
the ability to suspend the system, not using a particular API flag
(EPOLLWAKEUP). We should make the name of the capability more general
to encourage reuse in related cases. (Whether or not this capability
should also be used to govern the use of /sys/power/wake_lock is a
question that needs to be separately resolved.)

This patch renames the capability to CAP_BLOCK_SUSPEND. In order to ensure
that the old capability name doesn't make it out into the wild, could you
please apply and push up the tree to ensure that it is incorporated
for the 3.5 release.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-07-17 21:37:27 +02:00
Jeff Layton
cd60042cc1 cifs: always update the inode cache with the results from a FIND_*
When we get back a FIND_FIRST/NEXT result, we have some info about the
dentry that we use to instantiate a new inode. We were ignoring and
discarding that info when we had an existing dentry in the cache.

Fix this by updating the inode in place when we find an existing dentry
and the uniqueid is the same.

Cc: <stable@vger.kernel.org> # .31.x
Reported-and-Tested-by: Andrew Bartlett <abartlet@samba.org>
Reported-by: Bill Robertson <bill_robertson@debortoli.com.au>
Reported-by: Dion Edwards <dion_edwards@debortoli.com.au>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-16 23:57:23 -05:00
Jeff Layton
3cf003c08b cifs: when CONFIG_HIGHMEM is set, serialize the read/write kmaps
Jian found that when he ran fsx on a 32 bit arch with a large wsize the
process and one of the bdi writeback kthreads would sometimes deadlock
with a stack trace like this:

crash> bt
PID: 2789   TASK: f02edaa0  CPU: 3   COMMAND: "fsx"
 #0 [eed63cbc] schedule at c083c5b3
 #1 [eed63d80] kmap_high at c0500ec8
 #2 [eed63db0] cifs_async_writev at f7fabcd7 [cifs]
 #3 [eed63df0] cifs_writepages at f7fb7f5c [cifs]
 #4 [eed63e50] do_writepages at c04f3e32
 #5 [eed63e54] __filemap_fdatawrite_range at c04e152a
 #6 [eed63ea4] filemap_fdatawrite at c04e1b3e
 #7 [eed63eb4] cifs_file_aio_write at f7fa111a [cifs]
 #8 [eed63ecc] do_sync_write at c052d202
 #9 [eed63f74] vfs_write at c052d4ee
#10 [eed63f94] sys_write at c052df4c
#11 [eed63fb0] ia32_sysenter_target at c0409a98
    EAX: 00000004  EBX: 00000003  ECX: abd73b73  EDX: 012a65c6
    DS:  007b      ESI: 012a65c6  ES:  007b      EDI: 00000000
    SS:  007b      ESP: bf8db178  EBP: bf8db1f8  GS:  0033
    CS:  0073      EIP: 40000424  ERR: 00000004  EFLAGS: 00000246

Each task would kmap part of its address array before getting stuck, but
not enough to actually issue the write.

This patch fixes this by serializing the marshal_iov operations for
async reads and writes. The idea here is to ensure that cifs
aggressively tries to populate a request before attempting to fulfill
another one. As soon as all of the pages are kmapped for a request, then
we can unlock and allow another one to proceed.

There's no need to do this serialization on non-CONFIG_HIGHMEM arches
however, so optimize all of this out when CONFIG_HIGHMEM isn't set.

Cc: <stable@vger.kernel.org>
Reported-by: Jian Li <jiali@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-16 23:57:14 -05:00
Jeff Layton
3ae629d98b cifs: on CONFIG_HIGHMEM machines, limit the rsize/wsize to the kmap space
We currently rely on being able to kmap all of the pages in an async
read or write request. If you're on a machine that has CONFIG_HIGHMEM
set then that kmap space is limited, sometimes to as low as 512 slots.

With 512 slots, we can only support up to a 2M r/wsize, and that's
assuming that we can get our greedy little hands on all of them. There
are other users however, so it's possible we'll end up stuck with a
size that large.

Since we can't handle a rsize or wsize larger than that currently, cap
those options at the number of kmap slots we have. We could consider
capping it even lower, but we currently default to a max of 1M. Might as
well allow those luddites on 32 bit arches enough rope to hang
themselves.

A more robust fix would be to teach the send and receive routines how
to contend with an array of pages so we don't need to marshal up a kvec
array at all. That's a fairly significant overhaul though, so we'll need
this limit in place until that's ready.

Cc: <stable@vger.kernel.org>
Reported-by: Jian Li <jiali@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-16 23:57:09 -05:00
Sachin Prabhu
ffc61ccbb9 Initialise mid_q_entry before putting it on the pending queue
A user reported a crash in cifs_demultiplex_thread() caused by an
incorrectly set mid_q_entry->callback() function. It appears that the
callback assignment made in cifs_call_async() was not flushed back to
memory suggesting that a memory barrier was required here. Changing the
code to make sure that the mid_q_entry structure was completely
initialised before it was added to the pending queue fixes the problem.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-07-16 23:57:02 -05:00
Linus Torvalds
fce667c574 xfs: regression fixes for 3.5-rc7
- Really fix a cursor leak in xfs_alloc_ag_vextent_near
  - Fix a performance regression related to doing allocation in workqueues
  - Prevent recursion in xfs_buf_iorequest which is causing stack overflows
  - Don't call xfs_bdstrat_cb in xfs_buf_iodone callbacks
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQAGW0AAoJENaLyazVq6ZOOyEP/0xLuQKFF71eB/VL8BFylEHY
 H22/aDEWTo9pWDxxwBirVioBeCU07ByEv7zeQM1nqEm9pXESTSzsBUYKX2tWSRzY
 YclXYA0rpdCK2cdXpuz+0kWBFr9Y1Q1BIYNll6C3ZqhADgubAMHa13rKVUQlQqpD
 EZhvGrh42ujRhckwmi1E3+g3Ll79fty47WzyzEOa18ij3LI3q5Dm7WZpZlhv/MVW
 Fj975Q+LdJVchoQZ7gTiddMkZ936TwjLxM4EtWZd46CMG2/YRcPHx2YaItI0k6Xa
 Q34pbUHidZjqzng28iO6Y6BaB/rPLX/f7KcoZib+rc85zt8sShoDFXS9eq+DdTeH
 f5AgkPzpFfr3QpK3Fv5ZjICj5SkC6KhI14qnxLZhVGIWgJLGD8lb0fYDwN2y5BHq
 HmA7G4hALBaZ+TOpXQ2+XfxFyrffkQcM/Ja57yfDj37aOKYxMubAco4JlADBzBkg
 m1gF2QebQrbZ49k9x85vpIvoNFXcAg6FeuesYXciMcORCpMXp4f0oLjGlfwf13Fo
 upZ2AGfcpIcMl3fZzliw64VuJpX4QswTzUZFqXH8+Vn9eB6cwFqA8PJpLxBSshCk
 Y6E+LE6oYbsN+T6R+6Xj0DsPl4OIE0eWlGPaZZu9eWku02e0vZMj57RxUWpNcOmH
 6lq8TJaS/N5Qri51NPVs
 =TP1A
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.5-rc7' of git://oss.sgi.com/xfs/xfs

Pull xfs regression fixes from Ben Myers:
 - Really fix a cursor leak in xfs_alloc_ag_vextent_near
 - Fix a performance regression related to doing allocation in
   workqueues
 - Prevent recursion in xfs_buf_iorequest which is causing stack
   overflows
 - Don't call xfs_bdstrat_cb in xfs_buf_iodone callbacks

* tag 'for-linus-v3.5-rc7' of git://oss.sgi.com/xfs/xfs:
  xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacks
  xfs: prevent recursion in xfs_buf_iorequest
  xfs: don't defer metadata allocation to the workqueue
  xfs: really fix the cursor leak in xfs_alloc_ag_vextent_near
2012-07-16 10:02:36 -07:00
Anders Kaseorg
05d290d66b fifo: Do not restart open() if it already found a partner
If a parent and child process open the two ends of a fifo, and the
child immediately exits, the parent may receive a SIGCHLD before its
open() returns.  In that case, we need to make sure that open() will
return successfully after the SIGCHLD handler returns, instead of
throwing EINTR or being restarted.  Otherwise, the restarted open()
would incorrectly wait for a second partner on the other end.

The following test demonstrates the EINTR that was wrongly thrown from
the parent’s open().  Change .sa_flags = 0 to .sa_flags = SA_RESTART
to see a deadlock instead, in which the restarted open() waits for a
second reader that will never come.  (On my systems, this happens
pretty reliably within about 5 to 500 iterations.  Others report that
it manages to loop ~forever sometimes; YMMV.)

  #include <sys/stat.h>
  #include <sys/types.h>
  #include <sys/wait.h>
  #include <fcntl.h>
  #include <signal.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>

  #define CHECK(x) do if ((x) == -1) {perror(#x); abort();} while(0)

  void handler(int signum) {}

  int main()
  {
      struct sigaction act = {.sa_handler = handler, .sa_flags = 0};
      CHECK(sigaction(SIGCHLD, &act, NULL));
      CHECK(mknod("fifo", S_IFIFO | S_IRWXU, 0));
      for (;;) {
          int fd;
          pid_t pid;
          putc('.', stderr);
          CHECK(pid = fork());
          if (pid == 0) {
              CHECK(fd = open("fifo", O_RDONLY));
              _exit(0);
          }
          CHECK(fd = open("fifo", O_WRONLY));
          CHECK(close(fd));
          CHECK(waitpid(pid, NULL, 0));
      }
  }

This is what I suspect was causing the Git test suite to fail in
t9010-svn-fe.sh:

	http://bugs.debian.org/678852

Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-16 08:33:14 -07:00
Christoph Hellwig
1632dcc93f xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacks
xfs_bdstrat_cb only adds a check for a shutdown filesystem over
xfs_buf_iorequest, but xfs_buf_iodone_callbacks just checked for a shut down
filesystem a little earlier.  In addition the shutdown handling in
xfs_bdstrat_cb is not very suitable for this caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-07-13 13:09:49 -05:00
Christoph Hellwig
40a9b7963d xfs: prevent recursion in xfs_buf_iorequest
If the b_iodone handler is run in calling context in xfs_buf_iorequest we
can run into a recursion where xfs_buf_iodone_callbacks keeps calling back
into xfs_buf_iorequest because an I/O error happened, which keeps calling
back into xfs_buf_iorequest.  This chain will usually not take long
because the filesystem gets shut down because of log I/O errors, but even
over a short time it can cause stack overflows if run on the same context.

As a short term workaround make sure we always call the iodone handler in
workqueue context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-07-13 13:09:39 -05:00
Dave Chinner
aa292847b9 xfs: don't defer metadata allocation to the workqueue
Almost all metadata allocations come from shallow stack usage
situations. Avoid the overhead of switching the allocation to a
workqueue as we are not in danger of running out of stack when
making these allocations. Metadata allocations are already marked
through the args that are passed down, so this is trivial to do.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-by: Mel Gorman <mgorman@suse.de>
Tested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-07-13 13:09:27 -05:00
Dave Chinner
e3a746f5aa xfs: really fix the cursor leak in xfs_alloc_ag_vextent_near
The current cursor is reallocated when retrying the allocation, so
the existing cursor needs to be destroyed in both the restart and
the failure cases.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-07-13 13:09:18 -05:00
Linus Torvalds
4264e6a263 NFS client bugfixes for Linux 3.5
- Fix an NFSv4 mount regression
 - Fix O_DIRECT list manipulation snafus
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQADo8AAoJEGcL54qWCgDyu8wP/2entLv4jYgM7zp+/kOBhps6
 RpAv57YYv86pLk5gxoZg/JJs3z4QzS3CjMRXK11CEzRMt1NUqJfI6587TIl11DYW
 0UJcrAQo4X3EgNR7CCKAJaHUmYLayjUgquL8OjdBTuXCrpJzGdHcnBa71oF3xaEb
 jJhsA0W+qqBpa0295HH8sdHkURX1+46OdSd47+Dg+PZurkr0CRhjZo+DX+AkaSsU
 blM+pYUgu20NFaNUEfBphib3XMnnCGXc84g3gqjeOldRMijJGv1VcmhfsmJwmyLA
 1mImwZ2HDzySVLrbA/D7yeddZfXuTaf8krFmyP07U6I7hIEXCAEr16DmqCQCF1UB
 ppzZM9lu8f6df9tIlmjdA/hGOfs5OSQzD1L9Irn8xpRIWYmg+mrxXSzIG6wZE8zu
 n1EURW5CrLrlz2d7rdhqA+Y/GIkAvIO0/iBbJnvkMUdMPsd9/ikFPkHGzMhsK1KN
 AxEi2r+n0e8Hwaueu0EP685QjrEjOyLDdKIsAzNy/UEGN4v297UoW8ZyYnrzEIJG
 mQg6l4ke34zuw3AxoR3X4SuW329rGpG7x0pXNwqJG292T9jki334dPzJCy1LbaGl
 o1g+Z0BcyMe9VC7tRwOFEiGsbcd/OYPRbVVhu/3RlrWuVvj46QDvVpQ7+1pOr2s/
 4Xu70AlKw3B02ge621+p
 =bX2K
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 - Fix an NFSv4 mount regression
 - Fix O_DIRECT list manipulation snafus

* tag 'nfs-for-3.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix an NFSv4 mount regression
  NFS: Fix list manipulation snafus in fs/nfs/direct.c
2012-07-13 10:58:45 -07:00
Dave Jones
8d657eb3b4 Remove easily user-triggerable BUG from generic_setlease
This can be trivially triggered from userspace by passing in something unexpected.

    kernel BUG at fs/locks.c:1468!
    invalid opcode: 0000 [#1] SMP
    RIP: 0010:generic_setlease+0xc2/0x100
    Call Trace:
      __vfs_setlease+0x35/0x40
      fcntl_setlease+0x76/0x150
      sys_fcntl+0x1c6/0x810
      system_call_fastpath+0x1a/0x1f

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: stable@kernel.org # 3.2+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-13 10:50:23 -07:00
Jeff Moyer
91f68c89d8 block: fix infinite loop in __getblk_slow
Commit 080399aaaf ("block: don't mark buffers beyond end of disk as
mapped") exposed a bug in __getblk_slow that causes mount to hang as it
loops infinitely waiting for a buffer that lies beyond the end of the
disk to become uptodate.

The problem was initially reported by Torsten Hilbrich here:

    https://lkml.org/lkml/2012/6/18/54

and also reported independently here:

    http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511

and then Richard W.M.  Jones and Marcos Mello noted a few separate
bugzillas also associated with the same issue.  This patch has been
confirmed to fix:

    https://bugzilla.redhat.com/show_bug.cgi?id=835019

The main problem is here, in __getblk_slow:

        for (;;) {
                struct buffer_head * bh;
                int ret;

                bh = __find_get_block(bdev, block, size);
                if (bh)
                        return bh;

                ret = grow_buffers(bdev, block, size);
                if (ret < 0)
                        return NULL;
                if (ret == 0)
                        free_more_memory();
        }

__find_get_block does not find the block, since it will not be marked as
mapped, and so grow_buffers is called to fill in the buffers for the
associated page.  I believe the for (;;) loop is there primarily to
retry in the case of memory pressure keeping grow_buffers from
succeeding.  However, we also continue to loop for other cases, like the
block lying beond the end of the disk.  So, the fix I came up with is to
only loop when grow_buffers fails due to memory allocation issues
(return value of 0).

The attached patch was tested by myself, Torsten, and Rich, and was
found to resolve the problem in call cases.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-and-Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Josh Boyer <jwboyer@redhat.com>
Cc: Stable <stable@vger.kernel.org>  # 3.0+
[ Jens is on vacation, taking this directly  - Linus ]
--
Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-13 08:36:35 -07:00
Steven J. Magnani
5d8ecbbc28 fat: fix non-atomic NFS i_pos read
fat_encode_fh() can fetch an invalid i_pos value on systems where 64-bit
accesses are not atomic.  Make it use the same accessor as the rest of the
FAT code.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11 16:04:47 -07:00
Bob Liu
fea9f718b3 fs: ramfs: file-nommu: add SetPageUptodate()
There is a bug in the below scenario for !CONFIG_MMU:

 1. create a new file
 2. mmap the file and write to it
 3. read the file can't get the correct value

Because

  sys_read() -> generic_file_aio_read() -> simple_readpage() -> clear_page()

which causes the page to be zeroed.

Add SetPageUptodate() to ramfs_nommu_expand_for_mapping() so that
generic_file_aio_read() do not call simple_readpage().

Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11 16:04:47 -07:00
Luis Henriques
a4e08d001f ocfs2: fix NULL pointer dereference in __ocfs2_change_file_space()
As ocfs2_fallocate() will invoke __ocfs2_change_file_space() with a NULL
as the first parameter (file), it may trigger a NULL pointer dereferrence
due to a missing check.

Addresses http://bugs.launchpad.net/bugs/1006012

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Reported-by: Bret Towe <magnade@gmail.com>
Tested-by: Bret Towe <magnade@gmail.com>
Cc: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11 16:04:43 -07:00
Trond Myklebust
f1daf666dd NFSv4: Fix an NFSv4 mount regression
The helper nfs_fs_mount() will always call nfs4_try_mount with the
mount_info->fill_super argument pointing to nfs_fill_super, which is
NFSv2/v3 only.
Fix is to have nfs4_try_mount replace it with nfs4_fill_super.

The regression was introduced by commit c40f8d1d (NFS: Create a common
fs_mount() function)

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-10 13:25:39 -04:00