Commit Graph

37490 Commits

Author SHA1 Message Date
Anna Schumaker
24bab49122 NFSD: Implement SEEK
This patch adds server support for the NFS v4.2 operation SEEK, which
returns the position of the next hole or data segment in a file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-29 14:35:20 -04:00
Anna Schumaker
87a15a8090 NFSD: Add generic v4.2 infrastructure
It's cleaner to introduce everything at once and have the server reply
with "not supported" than it would be to introduce extra operations when
implementing a specific one in the middle of the list.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-29 14:35:19 -04:00
Christoph Hellwig
0162ac2b97 nfsd: introduce nfsd4_callback_ops
Add a higher level abstraction than the rpc_ops for callback operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:29 -04:00
Christoph Hellwig
f0b5de1b6b nfsd: split nfsd4_callback initialization and use
Split out initializing the nfs4_callback structure from using it.  For
the NULL callback this gets rid of tons of pointless re-initializations.

Note that I don't quite understand what protects us from running multiple
NULL callbacks at the same time, but at least this chance doesn't make
it worse..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:28 -04:00
Christoph Hellwig
326129d02a nfsd: introduce a generic nfsd4_cb
Add a helper to queue up a callback.  CB_NULL has a bit of special casing
because it is special in the specification, but all other new callback
operations will be able to share code with this and a few more changes
to refactor the callback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:27 -04:00
Christoph Hellwig
2faf3b4350 nfsd: remove nfsd4_callback.cb_op
We can always get at the private data by using container_of, no need for
a void pointer.  Also introduce a little to_delegation helper to avoid
opencoding the container_of everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:26 -04:00
Benny Halevy
341b51df1f nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence
This is incorrect when a callback is has to be restarted, in which case
the XDR decoding of the second iteration will see a NULL cb argument.

[hch: updated description]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:25 -04:00
Christoph Hellwig
444b6e910d nfsd: fix nfsd4_cb_recall_done error handling
For any error that is not EBADHANDLE or NFS4ERR_BAD_STATEID,
nfsd4_cb_recall_done first marks the connection down, then
retries until dl_retries hits zero, then marks the connection down
again and sets cb_done.  This changes the code to only retry
for EBADHANDLE or NFS4ERR_BAD_STATEID, and factors setting
cb_done into a single point in the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:25 -04:00
J. Bruce Fields
70b2823535 nfsd4: clarify how grace period ends
The grace period is ended in two steps--first userland is notified that
the grace period is now long enough that any clients who have not yet
reclaimed can be safely forgotten, then we flip the switch that forbids
reclaims and allows new opens.  I had to think a bit to convince myself
that the ordering was right here.  Document it.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-17 16:33:19 -04:00
J. Bruce Fields
bea57fe45b nfsd4: stop grace_time update at end of grace period
The attempt to automatically set a new grace period time at the end of
the grace period isn't really helpful.  We'll probably shut down and
reboot before we actually make use of the new grace period time anyway.
So may as well leave it up to the init system to get this right.

This just confuses people when they see /proc/fs/nfsd/nfsv4gracetime
change from what they set it to.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-17 16:33:18 -04:00
Jeff Layton
65decb650a nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients
In the case of v4.0 clients, we may call into the "create" client
tracking operation multiple times (once for each openowner). Upcalling
for each one of those is wasteful and slow however. We can skip doing
further "create" operations after the first one if we know that one has
already been done.

v4.1+ clients generally only call into this function once (on
RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the
STABLE bit is set. Doing so would make it impossible for nfsdcltrack to
lift the grace period early since the timestamp has a different meaning
in the case where the client is expected to issue a RECLAIM_COMPLETE.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:17 -04:00
Jeff Layton
788a7914ad nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls
The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag,
which basically results in an upcall every time we call into the client
tracking ops.

Change it to set this bit on a successful "check" or "create" request,
and clear it on a "remove" request.  Also, check to see if that bit is
set before upcalling on a "check" or "remove" request, and skip
upcalling appropriately, depending on its state.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:17 -04:00
Jeff Layton
d682e750ce nfsd: serialize nfsdcltrack upcalls for a particular client
In a later patch, we want to add a flag that will allow us to reduce the
need for upcalls. In order to handle that correctly, we'll need to
ensure that racing upcalls for the same client can't occur. In practice
it should be rare for this to occur with a well-behaved client, but it
is possible.

Convert one of the bits in the cl_flags field to be an upcall bitlock,
and use it to ensure that upcalls for the same client are serialized.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:16 -04:00
Jeff Layton
d4318acd5d nfsd: pass extra info in env vars to upcalls to allow for early grace period end
In order to support lifting the grace period early, we must tell
nfsdcltrack what sort of client the "create" upcall is for. We can't
reliably tell if a v4.0 client has completed reclaiming, so we can only
lift the grace period once all the v4.1+ clients have issued a
RECLAIM_COMPLETE and if there are no v4.0 clients.

Also, in order to lift the grace period, we have to tell userland when
the grace period started so that it can tell whether a RECLAIM_COMPLETE
has been issued for each client since then.

Since this is all optional info, we pass it along in environment
variables to the "init" and "create" upcalls. By doing this, we don't
need to revise the upcall format. The UMH upcall can simply make use of
this info if it happens to be present. If it's not then it can just
avoid lifting the grace period early.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:15 -04:00
Jeff Layton
7f5ef2e900 nfsd: add a v4_end_grace file to /proc/fs/nfsd
Allow a privileged userland process to end the v4 grace period early.
Writing "Y", "y", or "1" to the file will cause the v4 grace period to
be lifted.  The basic idea with this will be to allow the userland
client tracking program to lift the grace period once it knows that no
more clients will be reclaiming state.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:14 -04:00
Jeff Layton
d68e3c4aa4 lockd: add a /proc/fs/lockd/nlm_end_grace file
Add a new procfile that will allow a (privileged) userland process to
end the NLM grace period early. The basic idea here will be to have
sm-notify write to this file, if it sent out no NOTIFY requests when
it runs. In that situation, we can generally expect that there will be
no reclaim requests so the grace period can be lifted early.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:13 -04:00
Jeff Layton
3b3e7b7223 nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE
As stated in RFC 5661, section 18.51.3:

    Once a RECLAIM_COMPLETE is done, there can be no further reclaim
    operations for locks whose scope is defined as having completed
    recovery.  Once the client sends RECLAIM_COMPLETE, the server will
    not allow the client to do subsequent reclaims of locking state for
    that scope and, if these are attempted, will return
    NFS4ERR_NO_GRACE.

Ensure that we enforce that requirement.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:13 -04:00
Jeff Layton
919b8049f0 nfsd: remove redundant boot_time parm from grace_done client tracking op
Since it's stored in nfsd_net, we don't need to pass it in separately.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:12 -04:00
Jeff Layton
f779002965 lockd: move lockd's grace period handling into its own module
Currently, all of the grace period handling is part of lockd. Eventually
though we'd like to be able to build v4-only servers, at which point
we'll need to put all of this elsewhere.

Move the code itself into fs/nfs_common and have it build a grace.ko
module. Then, rejigger the Kconfig options so that both nfsd and lockd
enable it automatically.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:11 -04:00
Christoph Hellwig
f0c63124a6 nfsd: update mtime on truncate
This fixes a failure in xfstests generic/313 because nfs doesn't update
mtime on a truncate.  The protocol requires this to be done implicity
for a size changing setattr.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-11 11:12:16 -04:00
Kinglong Mee
027bc41a3e NFSD: Put export if prepare_creds() fail
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:04 -04:00
Kinglong Mee
13c82e8eb5 NFSD: Full checking of authentication name
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:03 -04:00
Kinglong Mee
48c348b09c NFSD: Fix bad using of return value from qword_get
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:02 -04:00
Kinglong Mee
15d176c195 NFSD: Fix a memory leak if nfsd4_recdir_load fail
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:01 -04:00
Kinglong Mee
c2236f141e NFSD: Reset creds after mnt_want_write_file() fail
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:01 -04:00
Kinglong Mee
8519f994e5 NFSD: Put file after ima_file_check fail in nfsd_open()
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:00 -04:00
Trond Myklebust
66f09ca717 nfs: do not start the callback thread until we set rqstp->rq_task
This fixes an Oopsable race when starting up the callback server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-02 17:53:30 -04:00
Trond Myklebust
d4e8990299 lockd: Do not start the lockd thread before we've set nlmsvc_rqst->rq_task
This fixes an Oopsable race when starting lockd.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-02 17:49:17 -04:00
J. Bruce Fields
ccad7dad86 nfsd4: remove labeled NFS warning from config help
The working group appears committed to keeping the protocol stable, the
code has gotten some use and seems to work OK.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-28 16:00:07 -04:00
Anna Schumaker
2b8941b962 NFSD: Update some as-yet unused 4.2 error codes
Recent NFS v4.2 drafts have removed NFS4ERR_METADATA_NOTSUPP and
reassigned the error code to NFS4ERR_UNION_NOTSUPP.

I also add in the NFS4ERR_OFFLOAD_NO_REQS error code.

We're not using any of these yet, so there's no harm done.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-28 16:00:01 -04:00
Kinglong Mee
6cd906627b NFSD: Remove duplicate initialization of file_lock
locks_alloc_lock() has initialized struct file_lock, no need to
re-initialize it here.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-28 15:58:35 -04:00
Rajesh Ghanekar
18c01ab302 nfsd: allow turning off nfsv3 readdir_plus
One of our customer's application only needs file names, not file
attributes. With directories having 10K+ inodes (assuming buffer cache
has directory blocks cached having file names, but inode cache is
limited and hence need eviction of older cached inodes), older inodes
are evicted periodically. So if they keep on doing readdir(2) from NSF
client on multiple directories, some directory's files are periodically
removed from inode cache and hence new readdir(2) on same directory
requires disk access to bring back inodes again to inode cache.

As READDIRPLUS request fetches attributes also, doing getattr on each
file on server, it causes unnecessary disk accesses. If READDIRPLUS on
NFS client is returned with -ENOTSUPP, NFS client uses READDIR request
which just gets the names of the files in a directory, not attributes,
hence avoiding disk accesses on server.

There's already a corresponding client-side mount option, but an export
option reduces the need for configuration across multiple clients.

This flag affects NFSv3 only.  If it turns out it's needed for NFSv4 as
well then we may have to figure out how to extend the behavior to NFSv4,
but it's not currently obvious how to do that.

Signed-off-by: Rajesh Ghanekar <rajesh_ghanekar@symantec.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-18 15:12:14 -04:00
J. Bruce Fields
f7b43d0c99 nfsd4: reserve adequate space for LOCK op
As of  8c7424cff6 "nfsd4: don't try to encode conflicting owner if low
on space", we permit the server to process a LOCK operation even if
there might not be space to return the conflicting lockowner, because
we've made returning the conflicting lockowner optional.

However, the rpc server still wants to know the most we might possibly
return, so we need to take into account the possible conflicting
lockowner in the svc_reserve_space() call here.

Symptoms were log messages like "RPC request reserved 88 but used 108".

Fixes: 8c7424cff6 "nfsd4: don't try to encode conflicting owner if low on space"
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:14 -04:00
J. Bruce Fields
1383bf37ce nfsd4: remove obsolete comment
We do what Neil suggests now.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:14 -04:00
Ross Lagerwall
63bab0651b nfsd3: Check write permission after checking existence
When creating a file that already exists in a read-only directory with
O_EXCL, the NFSv3 server returns EACCES rather than EEXIST (which local
files and the NFSv4 server return).  Fix this by checking the MAY_CREATE
permission only if the file does not exist.  Since this already happens
in do_nfsd_create, the check in nfsd3_proc_create can simply be removed.

Signed-off-by: Ross Lagerwall <rosslagerwall@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:14 -04:00
Jeff Layton
afbda402a0 nfsd: call nfs4_put_deleg_lease outside of state_lock
Currently, we hold the state_lock when releasing the lease. That's
potentially problematic in the future if we allow for setlease methods
that can sleep. Move the nfs4_put_deleg_lease call out of the delegation
unhashing routine (which was always a bit goofy anyway), and into the
unlocked sections of the callers of unhash_delegation_locked.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:14 -04:00
Jeff Layton
6bcc034eac nfsd: protect lease-related nfs4_file fields with fi_lock
Currently these fields are protected with the state_lock, but that
doesn't really make a lot of sense. These fields are "private" to the
nfs4_file, and can be protected with the more granular fi_lock.

The fi_lock is already held when setting these fields. Make the code
hold the fp->fi_lock when clearing the lease-related fields in the
nfs4_file, and no longer require that the state_lock be held when
calling into this function.

To prevent lock inversion with the i_lock, we also move the vfs_setlease
and fput calls outside of the fi_lock. This also sets us up for allowing
vfs_setlease calls to block in the future.

Finally, remove a redundant NULL pointer check. unhash_delegation_locked
locks the fp->fi_lock prior to that check, so fp in that function must
never be NULL.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:13 -04:00
Trond Myklebust
ef9b16dc6d nfsd: Reorder nfsd_cache_match to check more powerful discriminators first
We would normally expect the xid and the checksum to be the best
discriminators. Check them before looking at the procedure number,
etc.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:13 -04:00
Trond Myklebust
89a26b3d29 nfsd: split DRC global spinlock into per-bucket locks
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:13 -04:00
Trond Myklebust
31e60f5222 nfsd: convert num_drc_entries to an atomic_t
...so we can remove the spinlocking around it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust
11acf6ef3b nfsd: Remove the cache_hash list
Now that the lru list is per-bucket, we don't need a second list for
searches.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust
bedd4b61a4 nfsd: convert the lru list into a per-bucket thing
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust
7142b98d9f nfsd: Clean up drc cache in preparation for global spinlock elimination
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust
887999774a nfs: Ensure that nfs_callback_start_svc sets the server rq_task...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:10 -04:00
Trond Myklebust
d6a7ce424f lockd: Ensure that lockd_start_svc sets the server rq_task...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:10 -04:00
Linus Torvalds
e64df3ebe8 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "These are all fixes I'd like to get out to a broader audience.

  The biggest of the bunch is Mark's quota fix, which is also in the
  SUSE kernel, and makes our subvolume quotas dramatically more
  accurate.

  I've been running xfstests with these against your current git
  overnight, but I'm queueing up longer tests as well"

* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: disable strict file flushes for renames and truncates
  Btrfs: fix csum tree corruption, duplicate and outdated checksums
  Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
  Btrfs: fix compressed write corruption on enospc
  btrfs: correctly handle return from ulist_add
  btrfs: qgroup: account shared subtrees during snapshot delete
  Btrfs: read lock extent buffer while walking backrefs
  Btrfs: __btrfs_mod_ref should always use no_quota
  btrfs: adjust statfs calculations according to raid profiles
2014-08-16 09:06:55 -06:00
Linus Torvalds
53b95d6341 File locking related bugfixes for v3.17 (pile #2)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT7ey6AAoJEAAOaEEZVoIVgdkQAJfToVhgddVMOweeGo4wqPqM
 lmS35dYVEy+gPfYhCU2Zytgk9yLlmNJeDT7Y+XGe4l15Ax/PDNuWLiRnUKRy1FrU
 l7cbKRAn1L6TqBO2ang5t68Bw7ojUpRoKWKnDyjVAj80mZYRZPWvQOurLeTtra2o
 XLnHbK54F36s07OXwSTZgbh/ffVQ1RMUBU8fy+0Ws3mTAbzO1KuB5Ws4pdt2ysjI
 14pBHO553X0VXAJxGkH66xxblt1o+Q9aBZp7RL0VNtR4bGU4FMvXy+0D5G6h8AGt
 rhl2PWTNKGg8HFgUK+7nCheH4j/0maXo541D3q+sJhbknRhD3n3x7IBvjkm9tdjB
 OgTkAp1mwL21mJP21MOrAil8uwGoSr7qTCngZY6nNWH4L3EkVl27+LlGDtkgBp/n
 BJyhcPbFSh+K4TTD3dg2rEx7wF/npQ6yPOljjvWXKJEm5lx3ZPkK1l1xS3QnVTMe
 pLq+wTZ9v1cU7+9JFWICQwclv4unjIgqxLo7/op5P5KvTWOHFW4cjdwCBPVE1g3a
 WC2c5jdB0Up6Z59aXAN3p5dk8MCy6NA41lkMfJN4AUAIzb6NvNeDBhrcFaHwXowm
 jgCtEPqFN/HqsEwJmhJ7fcIEKYrOCuzeaR5tGuwJ11re/oULGXTgMkrFwgm/YWwu
 esRBAc53/hZg6oo3ipUH
 =09BX
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux

Pull file locking bugfixes from Jeff Layton:
 "Most of these patches are to fix a long-standing regression that crept
  in when the BKL was removed from the file-locking code.  The code was
  converted to use a conventional spinlock, but some fl_release_private
  ops can block and you can end up sleeping inside the lock.

  There's also a patch to make /proc/locks show delegations as 'DELEG'"

* tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux:
  locks: update Locking documentation to clarify fl_release_private behavior
  locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock
  locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped
  locks: don't reuse file_lock in __posix_lock_file
  locks: don't call locks_release_private from locks_copy_lock
  locks: show delegations as "DELEG" in /proc/locks
2014-08-16 08:58:47 -06:00
Linus Torvalds
da06df548e Merge git://git.kvack.org/~bcrl/aio-next
Pull aio updates from Ben LaHaise.

* git://git.kvack.org/~bcrl/aio-next:
  aio: use iovec array rather than the single one
  aio: fix some comments
  aio: use the macro rather than the inline magic number
  aio: remove the needless registration of ring file's private_data
  aio: remove no longer needed preempt_disable()
  aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()
  aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
2014-08-16 08:56:27 -06:00
Chris Mason
8d875f95da btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file
with new versions.  Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held.  It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.

Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:42 -07:00
Filipe Manana
27b9a8122f Btrfs: fix csum tree corruption, duplicate and outdated checksums
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

  Leaf N:

    slot = 0                           slot = btrfs_header_nritems() - 1
  |-------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
  |-------------------------------------------------------------------|

  Leaf N + 1:

      slot = 0                          slot = btrfs_header_nritems() - 1
  |--------------------------------------------------------------------|
  | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
  |--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

    slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
  |----------------------------------------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
  |----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
        (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

   ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

   Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:40 -07:00