linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-13 06:32:50 +00:00

Author	SHA1	Message	Date
Stanislav Kinsbursky	29dcc16a8e	NFS: make nfs_callback_tcpport6 per network context Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:51 -07:00
Stanislav Kinsbursky	bbe0a3aa4e	NFS: make nfs_callback_tcpport per network context Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:47 -07:00
Stanislav Kinsbursky	23c20ecd44	NFS: callback up - users counting cleanup Usage coutner now increased only is the service was started sccessfully. Even if service is running already, then goto is not required anymore, because service creation and start will be skipped. With this patch code looks clearer. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:38 -07:00
Stanislav Kinsbursky	8e24614443	NFS: callback service start function introduced This is just a code move, which from my POW makes code looks better. I.e. now on start we have 3 different stages: 1) Service creation. 2) Service per-net data allocation. 3) Service start. Patch also renames goto label "out_err:" into "err_start:" to reflect new changes. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:35 -07:00
Stanislav Kinsbursky	691c457ae6	NFS: callback up - transport backchannel cleanup No need to assign transports backchannel server explicitly in nfs41_callback_up() - there is nfs_callback_bc_serv() function for this. By using it, nfs4_callback_up() and nfs41_callback_up() can be called without transport argument. Note: service have to be passed to nfs_callback_bc_serv() instead of callback, since callback link can be uninitialized. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:29 -07:00
Stanislav Kinsbursky	c946556b87	NFS: move per-net callback thread initialization to nfs_callback_up_net() v4: 1) Callback transport creation routine selection by version simlified. This new function in now called before nfs_minorversion_callback_svc_setup()). Also few small changes: 1) current network namespace in nfs_callback_up() was replaced by transport net. 2) svc_shutdown_net() was moved prior to callback usage counter decrement (because in case of per-net data allocation faulure svc_shutdown_net() have to be skipped). Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:24 -07:00
Stanislav Kinsbursky	dd018428dc	NFS: callback service creation function introduced This function creates service if it's not exist, or increase usage counter of the existent, and returns pointer to it. Usage counter will be droppepd by svc_destroy() later in nfs_callback_up(). Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:11 -07:00
Stanislav Kinsbursky	c8ceb4124b	NFS: pass net to nfs_callback_down() Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:24:51 -07:00
Weston Andros Adamson	6168f62cbd	NFSv4: Add ACCESS operation to OPEN compound The OPEN operation has no way to differentiate an open for read and an open for execution - both look like read to the server. This allowed users to read files that didn't have READ access but did have EXEC access, which is obviously wrong. This patch adds an ACCESS call to the OPEN compound to handle the difference between OPENs for reading and execution. Since we're going through the trouble of calling ACCESS, we check all possible access bits and cache the results hopefully avoiding an ACCESS call in the future. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:20:11 -07:00
Sage Weil	6816282dab	ceph: propagate layout error on osd request creation If we are creating an osd request and get an invalid layout, return an EINVAL to the caller. We switch up the return to have an error code instead of NULL implying -ENOMEM. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-10-01 17:20:00 -05:00
Bryan Schumaker	57a51048da	NFS: Use kzalloc() instead of kmalloc() in the idmapper This will allocate memory that has already been zeroed, allowing us to remove the memset later on. Signed-off-by: Bryan Schumaker <bjchuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:18:44 -07:00
Bryan Schumaker	6938867edb	NFS: Remove bad delegations during open recovery I put the client into an open recovery loop by: Client: Open file read half Server: Expire client (echo 0 > /sys/kernel/debug/nfsd/forget_clients) Client: Drop vm cache (echo 3 > /proc/sys/vm/drop_caches) finish reading file This causes a loop because the client never updates the nfs4_state after discovering that the delegation is invalid. This means it will keep trying to read using the bad delegation rather than attempting to re-open the file. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> CC: stable@vger.kernel.org [3.4+] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:17:25 -07:00
Bryan Schumaker	fcb6d9c6b7	NFS: Always use the open stateid when checking for expired opens If we are reading through a delegation, and the delegation is OK then state->stateid will still point to a delegation stateid and not an open stateid. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:17:17 -07:00
J. Bruce Fields	0d22f68f02	nfsd4: don't allow reclaims of expired clients When a confirmed client expires, we normally also need to expire any stable storage record which would allow that client to reclaim state on the next boot. We forgot to do this in some cases. (For example, in destroy_clientid, and in the cases in exchange_id and create_session that destroy and existing confirmed client.) But in most other cases, there's really no harm to calling nfsd4_client_record_remove(), because it is a no-op in the case the client doesn't have an existing The single exception is destroying a client on shutdown, when we want to keep the stable storage records so we can recognize which clients will be allowed to reclaim when we come back up. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:04 -04:00
J. Bruce Fields	6a3b156342	nfsd4: remove redundant callback probe Both nfsd4_init_conn and alloc_init_session are probing the callback channel, harmless but pointless. Also, nfsd4_init_conn should probably be probing in the "unknown" case as well. In fact I don't see any harm to just doing it unconditionally when we get a new backchannel connection. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:03 -04:00
J. Bruce Fields	8f9d3d3b7c	nfsd4: expire old client earlier Before we had to delay expiring a client till we'd found out whether the session and connection allocations would succeed. That's no longer necessary. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:03 -04:00
J. Bruce Fields	81f0b2a496	nfsd4: separate session allocation and initialization This will allow some further simplification. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:02 -04:00
J. Bruce Fields	a827bcb242	nfsd4: clean up session allocation Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:01 -04:00
J. Bruce Fields	1377b69e68	nfsd4: minor free_session cleanup Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:00 -04:00
J. Bruce Fields	e1ff371f9d	nfsd4: new_conn_from_crses should only allocate Do the initialization in the caller, and clarify that the only failure ever possible here was due to allocation. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:00 -04:00
J. Bruce Fields	3ba6367124	nfsd4: separate connection allocation and initialization It'll be useful to have connection allocation and initialization as separate functions. Also, note we'd been ignoring the alloc_conn error return in bind_conn_to_session. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:59 -04:00
J. Bruce Fields	4973050148	nfsd4: reject bad forechannel attrs earlier This could simplify the logic a little later. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:58 -04:00
J. Bruce Fields	d15c077e44	nfsd4: enforce per-client sessions/no-sessions distinction Something like creating a client with setclientid and then trying to confirm it with create_session may not crash the server, but I'm not completely positive of that, and in any case it's obviously bad client behavior. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:58 -04:00
J. Bruce Fields	c116a0af76	nfsd4: set cl_minorversion at create time And remove some mostly obsolete comments. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:57 -04:00
J. Bruce Fields	68eb35081e	nfsd4: don't pin clientids to pseudoflavors I added cr_flavor to the data compared in same_creds without any justification, in `d5497fc693` "nfsd4: move rq_flavor into svc_cred". Recent client changes then started making mount -osec=krb5 server:/export /mnt/ echo "hello" >/mnt/TMP umount /mnt/ mount -osec=krb5i server:/export /mnt/ echo "hello" >/mnt/TMP to fail due to a clid_inuse on the second open. Mounting sequentially like this with different flavors probably isn't that common outside artificial tests. Also, the real bug here may be that the server isn't just destroying the former clientid in this case (because it isn't good enough at recognizing when the old state is gone). But it prompted some discussion and a look back at the spec, and I think the check was probably wrong. Fix and document. Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:14 -04:00
Dmitry Torokhov	dde3ada3d0	Merge branch 'next' into for-linus Prepare first set of updates for 3.7 merge window.	2012-10-01 14:20:58 -07:00
Linus Torvalds	40689ac479	dlm for 3.7 There are two main patches in this set, both related to the userland dlm_controld daemon. The first fixes a deadlock between dlm_controld and the dlm_send workqueue when both access configfs data simultaneously. The second reworks some code to get around a long standing, but intentional, unlock balance warning. The userland daemon no longer takes a lock that is later released from the kernel. The other commits are minor fixes and changes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQaeyUAAoJEDgbc8f8gGmqokAQALIt28LyaIAh9bfc0AWyLldM t81sT2gDR/uF0r2sLHdevunCecUbxR8CN5i+pxbTf1OsDkrhYQDx7GtQvTUefc2M /+JA/8UosZkrsTC0LvayWbsbC15c8epwxmERel/feinZ7dmgqNGBLaMPGJzZMv5/ KvHPRMg/JSJOpnwC64xZ097RsNkX5P8lwnK7U5ivs4vLCYR3ac3+V2+t/hxAMONE dwt7EZ/ADCBMUkQDTgisJA/Xr4yub+tTxUMshwkX+ve8nzSdj8dPzMFfky0bSady +x/r3GSVScQ//qqp8a01Jb2ZeFhKMfpTcPWve4chsngszsSzElJeMbw7numVuU98 7hCUbH5MeSuQ2Nx3qYfi1LhrV0687lbkNLlqET+3PX/r6m6T4SqucOmPv2SeSet9 rLGYNM1hJxx1rPASVtEsE/Xhr1agI58EO9tNb9p79wS4X5fcGqUsQXv4gRnu2/LN GRUTADC7nFDXy6BItYDwsuQfuJjMaDIsZ49dEUXYTlzav/zYK4IpU3fGuG3z/UDZ pQl+g0aK7rsq5hk7jxGfdHDBvhM+kcRMeQuTJqxJuJvLs0pmLWvFJXbyn+xh7y9Y MZYSSI4L3j1s4u6SkGUsMRezfcfJEYQdOzlIs7HEaLG++yBgcScF70ZgJByxzmfr eb9tpzwQvGLY4CvUVbEi =Ghb3 -----END PGP SIGNATURE----- Merge tag 'dlm-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "There are two main patches in this set, both related to the userland dlm_controld daemon. The first fixes a deadlock between dlm_controld and the dlm_send workqueue when both access configfs data simultaneously. The second reworks some code to get around a long standing, but intentional, unlock balance warning. The userland daemon no longer takes a lock that is later released from the kernel. The other commits are minor fixes and changes." * tag 'dlm-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: check the maximum size of a request from user dlm: cleanup send_to_sock routine dlm: convert add_sock routine return value type to void dlm: remove redundant variable assignments dlm: fix unlock balance warnings dlm: fix uninitialized spinlock dlm: fix deadlock between dlm_send and dlm_controld	2012-10-01 13:51:58 -07:00
Wei Yongjun	b905a7f8b7	ceph: convert to use le32_add_cpu() Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu(). dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 14:30:54 -05:00
Yan, Zheng	3e8f43a089	ceph: Fix oops when handling mdsmap that decreases max_mds When i >= newmap->m_max_mds, ceph_mdsmap_get_addr(newmap, i) return NULL. Passing NULL to memcmp() triggers oops. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 14:30:54 -05:00
Alex Elder	c98f533c94	ceph: let path portion of mount "device" be optional A recent change to /sbin/mountall causes any trailing '/' character in the "device" (or fs_spec) field in /etc/fstab to be stripped. As a result, an entry for a ceph mount that intends to mount the root of the name space ends up with now path portion, and the ceph mount option processing code rejects this. That is, an entry in /etc/fstab like: cephserver:port:/ /mnt ceph defaults 0 0 provides to the ceph code just "cephserver:port:" as the "device," and that gets rejected. Although this is a bug in /sbin/mountall, we can have the ceph mount code support an empty/nonexistent path, interpreting it to mean the root of the name space. RFC 5952 offers recommendations for how to express IPv6 addresses, and recommends the usage found in RFC 3986 (which specifies the format for URI's) for representing both IPv4 and IPv6 addresses that include port numbers. (See in particular the definition of "authority" found in the Appendix of RFC 3986.) According to those standards, no host specification will ever contain a '/' character. As a result, it is sufficient to scan a provided "device" from an /etc/fstab entry for the first '/' character, and if it's found, treat that as the beginning of the path. If no '/' character is present, we can treat the entire string as the monitor host specification(s), and assume the path to be the root of the name space. We'll still require a ':' to separate the host portion from the (possibly empty) path portion. This means that we can more formally define how ceph will interpret the "device" it's provided when processing a mount request: "device" will look like: <server_spec>[,<server_spec>...]:[<path>] where <server_spec> is <ip>[:<port>] <path> is optional, but if present must begin with '/' This addresses http://tracker.newdream.net/issues/2919 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>	2012-10-01 14:30:49 -05:00
Linus Torvalds	3498d13b80	TTY merge for 3.7-rc1 As we skipped the merge window for 3.6-rc1 for the tty tree, everything is now settled down and working properly, so we are ready for 3.7-rc1. Here's the patchset, it's big, but the large changes are removing a firmware file and adding a staging tty driver (it depended on the tty core changes, so it's going through this tree instead of the staging tree.) All of these patches have been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlBp36oACgkQMUfUDdst+yk4WgCdEy13hot8fI2Lqnc7W0LKu7GX 4p8AoLTjzrXhLosxdijskDQ9X1OtjrxU =S5Ng -----END PGP SIGNATURE----- Merge tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull TTY changes from Greg Kroah-Hartman: "As we skipped the merge window for 3.6-rc1 for the tty tree, everything is now settled down and working properly, so we are ready for 3.7-rc1. Here's the patchset, it's big, but the large changes are removing a firmware file and adding a staging tty driver (it depended on the tty core changes, so it's going through this tree instead of the staging tree.) All of these patches have been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fix up more-or-less trivial conflicts in - drivers/char/pcmcia/synclink_cs.c: tty NULL dereference fix vs tty_port_cts_enabled() helper function - drivers/staging/{Kconfig,Makefile}: add-add conflict (dgrp driver added close to other staging drivers) - drivers/staging/ipack/devices/ipoctal.c: "split ipoctal_channel from iopctal" vs "TTY: use tty_port_register_device" * tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (235 commits) tty/serial: Add kgdb_nmi driver tty/serial/amba-pl011: Quiesce interrupts in poll_get_char tty/serial/amba-pl011: Implement poll_init callback tty/serial/core: Introduce poll_init callback kdb: Turn KGDB_KDB=n stubs into static inlines kdb: Implement disable_nmi command kernel/debug: Mask KGDB NMI upon entry serial: pl011: handle corruption at high clock speeds serial: sccnxp: Make 'default' choice in switch last serial: sccnxp: Remove mask termios caps for SW flow control serial: sccnxp: Report actual baudrate back to core serial: samsung: Add poll_get_char & poll_put_char Powerpc 8xx CPM_UART setting MAXIDL register proportionaly to baud rate Powerpc 8xx CPM_UART maxidl should not depend on fifo size Powerpc 8xx CPM_UART too many interrupts Powerpc 8xx CPM_UART desynchronisation serial: set correct baud_base for EXSYS EX-41092 Dual 16950 serial: omap: fix the reciever line error case 8250: blacklist Winbond CIR port 8250_pnp: do pnp probe before legacy probe ...	2012-10-01 12:26:52 -07:00
Miao Xie	90abccf2c6	Revert "Btrfs: do not do filemap_write_and_wait_range in fsync" This reverts commit `0885ef5b56` After applying the above patch, the performance slowed down because the dirty page flush can only be done by one task, so revert it. The following is the test result of sysbench: Before After 24MB/s 39MB/s Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:22 -04:00
Josef Bacik	698d0082c4	Btrfs: remove bytes argument from do_chunk_alloc Everybody is just making stuff up, and it's just used to see if we really do need to alloc a chunk, and since we do this when we already know we really do it's just a waste of space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:21 -04:00
Josef Bacik	ea658badc4	Btrfs: delay block group item insertion So we have lots of places where we try to preallocate chunks in order to make sure we have enough space as we make our allocations. This has historically meant that we're constantly tweaking when we should allocate a new chunk, and historically we have gotten this horribly wrong so we way over allocate either metadata or data. To try and keep this from happening we are going to make it so that the block group item insertion is done out of band at the end of a transaction. This will allow us to create chunks even if we are trying to make an allocation for the extent tree. With this patch my enospc tests run faster (didn't expect this) and more efficiently use the disk space (this is what I wanted). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:21 -04:00
Kent Overstreet	be3940c0a9	btrfs: Kill some bi_idx references For immutable bio vecs, I've been auditing and removing bi_idx references. These were harmless, but removing them will make auditing easier. scrub_bio_end_io_worker() was open coding a bio_reset() - but this doesn't appear to have been needed for anything as right after it does a bio_put(), and perusing the code it doesn't appear anything else was holding a reference to the bio. The other use end_bio_extent_readpage() was just for a pr_debug() - changed it to something that might be a bit more useful. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Chris Mason <chris.mason@oracle.com> CC: Stefan Behrens <sbehrens@giantdisaster.de>	2012-10-01 15:19:21 -04:00
Miao Xie	962197babe	Btrfs: fix unnecessary warning when the fragments make the space alloc fail When we wrote some data by compress mode into a btrfs filesystem which was full of the fragments, the kernel will report: BTRFS warning (device xxx): Aborting unused transaction. The reason is: We can not find a long enough free space to store the compressed data because of the fragmentary free space, and the compressed data can not be splited, so the kernel outputed the above message. In fact, btrfs can deal with this problem very well: it fall back to uncompressed IO, split the uncompressed data into small ones, and then store them into to the fragmentary free space. So we shouldn't output the above warning message. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:20 -04:00
Josef Bacik	69ffb54347	Btrfs: create a pinned em when writing to a prealloc range in DIO Wade Cline reported a problem where he was getting garbage and warnings when writing to a preallocated range via O_DIRECT. This is because we weren't creating our normal pinned extent_map for the range we were writing to, which was causing all sorts of issues. This patch fixes the problem and makes his testcase much happier. Thanks, Reported-by: Wade Cline <clinew@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:20 -04:00
Josef Bacik	6df7881a84	Btrfs: move the sb_end_intwrite until after the throttle logic Sage reported the following lockdep backtrace ===================================== [ BUG: bad unlock balance detected! ] 3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted ------------------------------------- btrfs-cleaner/7607 is trying to release lock (sb_internal) at: [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] but there are no more locks to release! other info that might help us debug this: 1 lock held by btrfs-cleaner/7607: #0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs] stack backtrace: Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1 Call Trace: [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110 [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310 [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160 [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs] [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff810b2a95>] lock_release+0xd5/0x220 [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160 [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90 [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60 [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40 [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs] [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs] [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs] [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0 [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs] [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs] [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs] [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [<ffffffff810791ee>] kthread+0xae/0xc0 [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [<ffffffff81635430>] ? retint_restore_args+0x13/0x13 [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0 [<ffffffff8163e740>] ? gs_change+0x13/0x13 This is because the throttle stuff can commit the transaction, which expects to be the one stopping the intwrite stuff, but we've already done it in the __btrfs_end_transaction. Moving the sb_end_intewrite after this logic makes the lockdep go away. Thanks, Tested-by: Sage Weil <sage@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:19 -04:00
Liu Bo	425d17a290	Btrfs: use larger limit for translation of logical to inode This is the change of the kernel side. Translation of logical to inode used to have an upper limit 4k on inode container's size, but the limit is not large enough for a data with a great many of refs, so when resolving logical address, we can end up with "ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493" This changes to regard 64k as the upper limit and use vmalloc instead of kmalloc to get memory more easily. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:19 -04:00
Liu Bo	df031f0752	Btrfs: use helper for logical resolve We already have a helper, iterate_inodes_from_logical(), for logical resolve, so just use it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:18 -04:00
Liu Bo	69917e4312	Btrfs: fix a bug in parsing return value in logical resolve In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag. It is possible to lose our errors because (-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true. I'm not sure if it is on purpose, it just looks too hacky if it is. I'd rather use a real flag and a 'ret' to catch errors. Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Liu Bo <liub.liubo@gmail.com>	2012-10-01 15:19:18 -04:00
liubo	0647d6bd16	Btrfs: cleanup for unused ref cache stuff As ref cache has been removed from btrfs, there is no user on its lock and its check. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:17 -04:00
Miao Xie	8407aa4643	Btrfs: fix corrupted metadata in the snapshot When we delete a inode, we will remove all the delayed items including delayed inode update, and then truncate all the relative metadata. If there is lots of metadata, we will end the current transaction, and start a new transaction to truncate the left metadata. In this way, we will leave a inode item that its link counter is > 0, and also may leave some directory index items in fs/file tree after the current transaction ends. In other words, the metadata in this fs/file tree is inconsistent. If we create a snapshot for this tree now, we will find a inode with corrupted metadata in the new snapshot, and we won't continue to drop the left metadata, because its link counter is not 0. We fix this problem by updating the inode item before the current transaction ends. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:17 -04:00
David Sterba	837e197283	btrfs: polish names of kmem caches Usecase: watch 'grep btrfs < /proc/slabinfo' easy to watch all caches in one go. Signed-off-by: David Sterba <dsterba@suse.cz>	2012-10-01 15:19:16 -04:00
Josef Bacik	a80c8dcf7e	Btrfs: fix our overcommit math I noticed I was seeing large lags when running my torrent test in a vm on my laptop. While trying to make it lag less I noticed that our overcommit math was taking into account the number of bytes we wanted to reclaim, not the number of bytes we actually wanted to allocate, which means we wouldn't overcommit as often. This patch fixes the overcommit math and makes shrink_delalloc() use that logic so that it will stop looping faster. We still have pretty high spikes of latency, but the test now takes 3 minutes less time (about 5% faster). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:16 -04:00
Josef Bacik	dea31f5233	Btrfs: wait on async pages when shrinking delalloc Mitch reported a problem where you could get an ENOSPC error when untarring a kernel git tree onto a 16gb file system with compress-force=zlib. This is because compression is a huge pain, it will return from ->writepages() without having actually created any ordered extents. To get around this we check to see if the async submit counter is up, and if it is wait until it drops to 0 before doing our normal ordered wait dance. With this patch I can now untar a kernel git tree onto a 16gb file system without getting ENOSPC errors. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:15 -04:00
Liu Bo	9e8a4a8b0b	Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag We're going to use this flag EXTENT_DEFRAG to indicate which range belongs to defragment so that we can implement snapshow-aware defrag: We set the EXTENT_DEFRAG flag when dirtying the extents that need defragmented, so later on writeback thread can differentiate between normal writeback and writeback started by defragmentation. Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:15 -04:00
Tsutomu Itoh	3d6b5c3b5c	Btrfs: check return value of ulist_alloc() properly ulist_alloc() has the possibility of returning NULL. So, it is necessary to check the return value. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-01 15:19:14 -04:00
Tsutomu Itoh	f54fb859da	Btrfs: fix error handling in delete_block_group_cache() btrfs_iget() never return NULL. So, NULL check is unnecessary. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-01 15:19:14 -04:00
Miao Xie	903889f462	Btrfs: fix wrong size for the reservation when doing, file pre-allocation. When we ran fsstress(a program in xfstests), the filesystem hung up when it is full. It was because the space reserved in btrfs_fallocate() was wrong, btrfs_fallocate() just used the size of the pre-allocation to reserve the space, didn't took the block size aligning into account, so the size of the reserved space was less than the allocated space, it caused the over reserve problem and made the filesystem hung up when invoking cow_file_range(). Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:14 -04:00
Miao Xie	69ce977a17	Btrfs: output more information when aborting a unused transaction handle Though we dump the stack information when aborting a unused transaction handle, we don't know the correct place where we decide to abort the transaction handle if one function has several place where the transaction abort function is invoked and jumps to the same place after this call. And beside that we also don't know the reason why we jump to abort the current handle. So I modify the transaction abort function and make it output the function name, line and error information. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:13 -04:00
Miao Xie	2ecb79239b	Btrfs: fix unprotected ->log_batch We forget to protect ->log_batch when syncing a file, this patch fix this problem by atomic operation. And ->log_batch is used to check if there are parallel sync operations or not, so it is unnecessary to reset it to 0 after the sync operation of the current log tree complete. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	48c03c4bcf	Btrfs: fix wrong size for the reservation of the, snapshot creation We should insert/update 6 items(root ref, root backref, dir item, dir index, root item and parent inode) when creating a snapshot, not 5 items, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	42874b3db7	Btrfs: fix the snapshot that should not exist The snapshot should be the image of the fs tree before it was created, so the metadata of the snapshot should not exist in the its tree. But now, we found the directory item and directory name index is in both the snapshot tree and the fs tree. It introduces some problems and makes the users feel strange: # mkfs.btrfs /dev/sda1 # mount /dev/sda1 /mnt # mkdir /mnt/1 # cd /mnt/1 # btrfs subvolume snapshot /mnt snap0 # ls -a /mnt/1/snap0/1 . .. [no other file/dir] # ll /mnt/1/snap0/ total 0 drwxr-xr-x 1 root root 10 Ju1 24 12:11 1 ^^^ There is no file/dir in it, but it's size is 10 # cd /mnt/1/snap0/1/snap0 [Enter a unexisted directory successfully...] There is nothing in the directory 1 in snap0, but btrfs told the length of this directory is 10. Beside that, we can enter an unexisted directory, it is very strange to the users. # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1 # ll /mnt/1/snap0/1/ total 0 [None] # ll /mnt/snap1/1/ total 0 drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0 And the source of snap1 did have any directory in Directory 1, but snap1 have a snap0, it is different between the source and the snapshot. So I think we should insert directory item and directory name index and update the parent inode as the last step of snapshot creation, and do not leave the useless metadata in the file tree. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	66d8f3dd1c	Btrfs: add a new "type" field into the block reservation structure Sometimes we need choose the method of the reservation according to the type of the block reservation, such as the reservation for the delayed inode update. Now we identify the type just by comparing the address of the reservation variants, it is very ugly if it is a temporary one because we need compare it with all the common reservation variants. So we add a new "type" field to keep the type the reservation variants. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:11 -04:00
Miao Xie	6352b91da1	Btrfs: use a slab for ordered extents allocation The ordered extent allocation is in the fast path of the IO, so use a slab to improve the speed of the allocation. "Size of the struct is 280, so this will fall into the size-512 bucket, giving 8 objects per page, while own slab will pack 14 objects into a page. Another benefit I see is to check for leaked objects when the module is removed (and the cache destroy takes place)." -- David Sterba Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:11 -04:00
Miao Xie	b9a8cc5bef	Btrfs: fix file extent discount problem in the, snapshot If a snapshot is created while we are writing some data into the file, the i_size of the corresponding file in the snapshot will be wrong, it will be beyond the end of the last file extent. And btrfsck will report: root 256 inode 257 errors 100 Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # dd if=/dev/zero of=tmpfile bs=4M count=1024 & # for ((i=0; i<4; i++)) > do > btrfs sub snap . $i > done This because the algorithm of disk_i_size update is wrong. Though there are some ordered extents behind the current one which we use to update disk_i_size, it doesn't mean those extents will be dealt with in the same transaction. So We shouldn't use the offset of those extents to update disk_i_size. Or we will get the wrong i_size in the snapshot. We fix this problem by recording the max real i_size. If we find there is a ordered extent which is in front of the current one and doesn't complete, we will record the end of the current one into that ordered extent. Surely, if the current extent holds the end of other extent(it must be greater than the current one because it is behind the current one), we will record the number that the current extent holds. In this way, we can exclude the ordered extents that may not be dealth with in the same transaction, and be easy to know the real disk_i_size. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:10 -04:00
Miao Xie	361048f586	Btrfs: fix full backref problem when inserting shared block reference If we create several snapshots at the same time, the following BUG_ON() will be triggered. kernel BUG at fs/btrfs/extent-tree.c:6047! Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done # for ((i=0; i<4; i++)) > do > mkdir $i > for ((j=0; j<200; j++)) > do > btrfs sub snap . $i/$j > done & > done The reason is: Before transaction commit, some operations changed the fs tree and new tree blocks were allocated because of COW. We used the implicit non-shared back reference for those newly allocated tree blocks because they were not shared by two or more trees. And then we created the first snapshot for the fs tree, according to the back reference rules, we also used implicit back refs for the child tree blocks of the root node of the fs tree, now those child nodes/leaves were shared by two trees. Then We didn't deal with the delayed references, and continued to change the fs tree(created the second snapshot and inserted the dir item of the new snapshot into the fs tree). According to the rules of the back reference, we added full back refs for those tree blocks whose parents have be shared by two trees. Now some newly allocated tree blocks had two types of the references. As we know, the delayed reference system handles these delayed references from back to front, and the full delayed reference is inserted after the implicit ones. So when we dealt with the back references of those newly allocated tree blocks, the full references was dealt with at first. And if the first reference is a shared back reference and the tree block that the reference points to is newly allocated, It would be considered as a tree block which is shared by two or more trees when it is allocated and should be a full back reference not a implicit one, the flag of its reference also should be set to FULL_BACKREF. But in fact, it was a non-shared tree block with a implicit reference at beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON was triggered. We have several methods to fix this bug: 1. deal with delayed references after the snapshot is created and before we change the source tree of the snapshot. This is the easiest and safest way. 2. modify the sort method of the delayed reference tree, make the full delayed references be inserted before the implicit ones. It is also very easy, but I don't know if it will introduce some problems or not. 3. modify select_delayed_ref() and make it select the implicit delayed reference at first. This way is not so good because it may wastes CPU time if we have lots of delayed references. 4. set the flags to FULL_BACKREF, this method is a little complex comparing with the 1st way. I chose the 1st way to fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:10 -04:00
Miao Xie	6fa9700e73	Btrfs: fix error path in create_pending_snapshot() This patch fixes the following problem: - If we failed to deal with the delayed dir items, we should abort transaction, just as its comment said. Fix it. - If root reference or root back reference insertion failed, we should abort transaction. Fix it. - Fix the double free problem of pending->inherit. - Do not restore the trans->rsv if we doesn't change it. - make the error path more clearly. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:09 -04:00
Wei Yongjun	cf93dccea6	Btrfs: fix possible memory leak in scrub_setup_recheck_block() bbio has been malloced in btrfs_map_block() and should be freed before leaving from the error handling cases. spatch with a semantic match is used to found this problem. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>	2012-10-01 15:19:09 -04:00
Josef Bacik	7014cdb493	Btrfs: btrfs_drop_extent_cache should never fail I noticed this when I was doing the fsync stuff, we allocate split extents if we drop an extent range that is in the middle of an existing extent. This BUG()'s if we fail to allocate memory, but the fact is this is just a cache, we will just regenerate the cache if we need it, the important part is that we free the range we are given. This can be done without allocations, so if we fail to allocate splits just skip the splitting stage and free our em and look for more extents to drop. This also makes btrfs_drop_extent_cache a void since nobody was checking the return value anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:09 -04:00
Sage Weil	ac14aed665	Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() Josef has suggested that this is not necessary. Removing it also avoids this lockdep splat (after the new sb_internal locking stuff was added): [ 604.090449] ====================================================== [ 604.114819] [ INFO: possible circular locking dependency detected ] [ 604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted [ 604.162193] ------------------------------------------------------- [ 604.186139] btrfs-cleaner/6669 is trying to acquire lock: [ 604.209555] (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 604.257100] [ 604.257100] but task is already holding lock: [ 604.300366] (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs] [ 604.352989] [ 604.352989] which lock already depends on the new lock. [ 604.352989] [ 604.427104] [ 604.427104] the existing dependency chain (in reverse order) is: [ 604.478493] [ 604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}: [ 604.529313] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 604.559621] [<ffffffff81632b69>] down_read+0x39/0x4e [ 604.589382] [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs] [ 604.596161] btrfs: unlinked 1 orphans [ 604.675002] [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs] [ 604.708859] [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs] [ 604.772466] [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs] [ 604.842245] [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs] [ 604.912852] [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs] [ 604.951888] [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560 [ 604.989961] [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0 [ 605.026628] [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b [ 605.064404] [ 605.064404] -> #0 (sb_internal#2){.+.+..}: [ 605.126832] [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90 [ 605.163671] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 605.200228] [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0 [ 605.236818] [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 605.274029] [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs] [ 605.340520] [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs] [ 605.378720] [<ffffffff811972c8>] evict+0xb8/0x1c0 [ 605.416057] [<ffffffff811974d5>] iput+0x105/0x210 [ 605.452373] [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs] [ 605.521627] [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs] [ 605.560520] [<ffffffff810791ee>] kthread+0xae/0xc0 [ 605.598094] [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [ 605.636499] [ 605.636499] other info that might help us debug this: [ 605.636499] [ 605.736504] Possible unsafe locking scenario: [ 605.736504] [ 605.801931] CPU0 CPU1 [ 605.835126] ---- ---- [ 605.867093] lock(&fs_info->cleanup_work_sem); [ 605.898594] lock(sb_internal#2); [ 605.931954] lock(&fs_info->cleanup_work_sem); [ 605.965359] lock(sb_internal#2); [ 605.994758] [ 605.994758] * DEADLOCK * [ 605.994758] [ 606.075281] 2 locks held by btrfs-cleaner/6669: [ 606.104528] #0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs] [ 606.165626] #1: (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs] [ 606.231297] [ 606.231297] stack backtrace: [ 606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1 [ 606.347823] Call Trace: [ 606.376184] [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c [ 606.409243] [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90 [ 606.441343] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.474583] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 606.505934] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.539429] [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0 [ 606.571719] [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0 [ 606.603498] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.637405] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.670165] [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160 [ 606.702144] [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 606.735562] [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs] [ 606.769861] [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs] [ 606.804575] [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs] [ 606.838756] [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40 [ 606.872010] [<ffffffff811972c8>] evict+0xb8/0x1c0 [ 606.903800] [<ffffffff811974d5>] iput+0x105/0x210 [ 606.935416] [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs] [ 606.970510] [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs] [ 607.005648] [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs] [ 607.040724] [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [ 607.104740] [<ffffffff810791ee>] kthread+0xae/0xc0 [ 607.137119] [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10 [ 607.169797] [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [ 607.202472] [<ffffffff81635430>] ? retint_restore_args+0x13/0x13 [ 607.235884] [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0 [ 607.268731] [<ffffffff8163e740>] ? gs_change+0x13/0x13 Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:08 -04:00
Sage Weil	e209db7ace	Btrfs: set journal_info in async trans commit worker We expect current->journal_info to point to the trans handle we are committing. Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:08 -04:00
Sage Weil	6fc4e35485	Btrfs: pass lockdep rwsem metadata to async commit transaction The freeze rwsem is taken by sb_start_intwrite() and dropped during the commit_ or end_transaction(). In the async case, that happens in a worker thread. Tell lockdep the calling thread is releasing ownership of the rwsem and the async thread is picking it up. XFS plays the same trick in fs/xfs/xfs_aops.c. Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:07 -04:00
Josef Bacik	2aaa665581	Btrfs: add hole punching This patch adds hole punching via fallocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:07 -04:00
Josef Bacik	2671485d39	Btrfs: remove unused hint byte argument for btrfs_drop_extents I audited all users of btrfs_drop_extents and found that nobody actually uses the hint_byte argument. I'm sure it was used for something at some point but it's not used now, and the way the pinning works the disk bytenr would never be immediately useful anyway so lets just remove it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:06 -04:00
Liu Bo	d279440511	Btrfs: check if an inode has no checksum when logging it This is based on Josef's "Btrfs: turbo charge fsync". If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum items any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:06 -04:00
Liu Bo	46d8bc3424	Btrfs: fix a bug in checking whether a inode is already in log This is based on Josef's "Btrfs: turbo charge fsync". The current btrfs checks if an inode is in log by comparing root's last_log_commit to inode's last_sub_trans[2]. But the problem is that this root->last_log_commit is shared among inodes. Say we have N inodes to be logged, after the first inode, root's last_log_commit is updated and the N-1 remained files will be skipped. This fixes the bug by keeping a local copy of root's last_log_commit inside each inode and this local copy will be maintained itself. [1]: we regard each log transaction as a subset of btrfs's transaction, i.e. sub_trans Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:06 -04:00
Miao Xie	321f0e7022	Btrfs: fix wrong orphan count of the fs/file tree If we add a new orphan item, we should increase the atomic counter, not decrease it. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:05 -04:00
Liu Bo	4e2f84e63d	Btrfs: improve fsync by filtering extents that we want This is based on Josef's "Btrfs: turbo charge fsync". The above Josef's patch performs very good in random sync write test, because we won't have too much extents to merge. However, it does not performs good on the test: dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync The reason is when we do sequencial sync write, we need to merge the current extent just with the previous one, so that we can get accumulated extents to log: A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ... So we'll have to flush more and more checksum into log tree, which is the bottleneck according to my tests. But we can avoid this by telling fsync the real extents that are needed to be logged. With this, I did the above dd sync write test (size=50m), w/o (orig) w/ (josef's) w/ (this) SATA 104KB/s 109KB/s 121KB/s ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:05 -04:00
Josef Bacik	ca7e70f590	Btrfs: do not needlessly restart the transaction for enospc We will stop and restart a transaction every time we move to a different leaf when truncating a file. This is for enospc reasons, but really we could probably get away with doing this a little better by actually working until we hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do truncates which will fail as soon as the block rsv runs out of space, and then at that point we can stop and restart the transaction and refill the block rsv and carry on. This will make rm'ing of a file with lots of extents a bit faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:04 -04:00
Liu Bo	06d3d22b45	Btrfs: cleanup extents after we finish logging inode This is based on Josef's "Btrfs: turbo charge fsync". We should cleanup those extents after we've finished logging inode, otherwise we may do redundant work on them. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:04 -04:00
Josef Bacik	0fa83cdb1d	Btrfs: only warn if we hit an error when doing the tree logging I hit this a couple times while working on my fsync patch (all my bugs, not normal operation), but with my new stuff we could have new errors from cases I have not encountered, so instead of BUG()'ing we should be WARN()'ing so that we are notified there is a problem but the user doesn't lose their data. We can easily commit the transaction in the case that the tree logging fails and still be fine, so let's try and be as nice to the user as possible. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:03 -04:00
Josef Bacik	5dc562c541	Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:03 -04:00
Josef Bacik	224ecce517	Btrfs: fix possible corruption when fsyncing written prealloced extents While working on my fsync patch my fsync tester kept hitting mismatching md5sums when I would randomly write to a prealloc'ed region, syncfs() and then write to the prealloced region some more and then fsync() and then immediately reboot. This is because the tree logging code will skip writing csums for file extents who's generation is less than the current running transaction. When we mark extents as written we haven't been updating their generation so they were always being skipped. This wouldn't happen if you were to preallocate and then write in the same transaction, but if you for example prealloced a VM you could definitely run into this problem. This patch makes my fsync tester happy again. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Josef Bacik	54338b5cc4	Btrfs: do not allocate chunks as agressively Swinging this pendulum back the other way. We've been allocating chunks up to 2% of the disk no matter how much we actually have allocated. So instead fix this calculation to only allocate chunks if we have more than 80% of the space available allocated. Please test this as it will likely cause all sorts of ENOSPC problems to pop up suddenly. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Josef Bacik	7c735313bd	Btrfs: update last trans if we don't update the inode There is a completely impossible situation to hit where you can preallocate a file, fsync it, write into the preallocated region, have the transaction commit twice and then fsync and then immediately lose power and lose all of the contents of the write. This patch fixes this just so I feel better about the situation and because it is lightweight, we just update the last_trans when we finish an ordered IO and we don't update the inode itself. This way we are completely safe and I feel better. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Jan Schmidt	995e01b7af	Btrfs: fix gcc warnings for 32bit compiles Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-10-01 15:19:01 -04:00
Chris Mason	74dd17fbe3	Btrfs: fix btrfs send for inline items and compression The btrfs send code was assuming the offset of the file item into the extent translated to bytes on disk. If we're compressed, this isn't true, and so it was off into extents owned by other files. It was also improperly handling inline extents. This solves a crash where we may have gone past the end of the file extent item by not testing early enough for an inline extent. It also solves problems where we have a whole between the end of the inline item and the start of the full extent. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-10-01 15:19:00 -04:00
Alexander Block	6d85ed05e1	Btrfs: don't treat top/root directory inode as deleted/reused We can't do the deleted/reused logic for top/root inodes as it would create a stream that tries to delete and recreate the root dir. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:19:00 -04:00
Alexander Block	2981e225f7	Btrfs: ignore non-FS inodes for send/receive We have to ignore inode/space cache objects in send/receive. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:59 -04:00
Alexander Block	2f28f4787c	Btrfs: pass root instead of parent_root to iterate_inode_ref We need to pass the root that we determined earlier to iterate_inode_ref. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:58 -04:00
Alexander Block	d8347fa444	Btrfs: use <= instead of < in is_extent_unchanged Used the wrong compare operator here. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:58 -04:00
Alexander Block	3954096d4b	Btrfs: fix check for changed extent in is_extent_unchanged The previous check was working fine, but this check should be easier to read. Also, we could theoritically have some exotic bugs with the previous checks. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:57 -04:00
Alexander Block	5dc67d0ba9	Btrfs: free nce and nce_head on error in name_cache_insert Both were leaked in case of error. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:56 -04:00
Alexander Block	3e126f32f8	Btrfs: remove unused tmp_path from iterate_dir_item A leftover from older code and unused now. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:55 -04:00
Alexander Block	e938c8ad54	Btrfs: code cleanups for send/receive Doing some code cleanups as suggested by Arne. Changes do not change any logic. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:55 -04:00
Alexander Block	766702ef49	Btrfs: add/fix comments/documentation for send/receive As the subject already said, add/fix comments. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:54 -04:00
Alexander Block	e479d9bb5f	Btrfs: update send_progress at correct places Updating send_progress in process_recorded_refs was not correct. It got updated too early in the cur_inode_new_gen case. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:53 -04:00
Alexander Block	34d73f54e2	Btrfs: make aux field of ulist 64 bit Btrfs send/receive uses the aux field to store inode numbers. On 32 bit machines this may become a problem. Also fix all users of ulist_add and ulist_add_merged. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:53 -04:00
Alexander Block	7e0926fe5f	Btrfs: fix use of radix_tree for name_cache in send/receive We can't easily use the index of the radix tree for inums as the radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels, we now use the lower 32bit of the inum as index and an additional list to store multiple entries per radix tree entry. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:52 -04:00
Alexander Block	17589bd96e	Btrfs: fix memory leak for name_cache in send/receive When everything is done, name_cache_free is called which however forgot to call kfree on the cache entries. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:51 -04:00
Alexander Block	adbe7fb6c4	Btrfs: don't break in the final loop of find_extent_clone If we break, we may miss the clone from send_root which we prefer over all other clones. Commit is a result of Arne's review. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:50 -04:00
Alexander Block	52f9e53ede	Btrfs: use normal return path for root == send_root case Don't have a seperate return path for the mentioned case. Now we do the same "take lowest inode/offset" logic for all found clones. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:50 -04:00
Alexander Block	35075bb046	Btrfs: use kmalloc instead of stack for backref_ctx Make sure to never get in trouble due to the backref_ctx which was on the stack before. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:49 -04:00
Alexander Block	ee849c0472	Btrfs: rename backref_ctx::found_in_send_root to found_itself The new name should be easier to understand/read. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:48 -04:00
Alexander Block	d27aed5e24	Btrfs: remove unused use_list from send/receive code use_list is a leftover and unused. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:48 -04:00
Alexander Block	ccf1626b49	Btrfs: add correct parent to check_dirs when dir got moved We only added the parent for the new position of a moved dir. We also need to add the old parent of the moved dir. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:47 -04:00
Alexander Block	9ea3ef516d	Btrfs: remove unused code with #if 0 fs_path_remove is not used at the moment due to a previous patch. Remove it for now (with #if 0) to avoid compile warnings. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:46 -04:00
Alexander Block	b9291affaa	Btrfs: add missing check for dir != tmp_dir to is_first_ref We missed that check which resultet in all refs with the same name being reported as first_ref. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:45 -04:00
Alexander Block	1f4692da95	Btrfs: fix cur_ino < parent_ino case for send/receive When the current inodes inum is smaller then the inum of the parent directory strange things were happending due to wrong path resolution and other bugs. Fix this with a new approach for the problem. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:45 -04:00
Alexander Block	85a7b33b96	Btrfs: add rdev to get_inode_info in send/receive We need rdev in the next commit. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:44 -04:00
Linus Torvalds	06d2fe153b	Driver core merge for 3.7-rc1 Here is the big driver core update for 3.7-rc1. A number of firmware_class.c updates (as you saw a month or so ago), and some hyper-v updates and some printk fixes as well. All patches that are outside of the drivers/base area have been acked by the respective maintainers, and have all been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlBp3vkACgkQMUfUDdst+ylQoACgldktGFgkCLzH+rGYthrXOC5P 9hUAnjmOhdoHlMTL81vWTlH+BrGernym =khrr -----END PGP SIGNATURE----- Merge tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core merge from Greg Kroah-Hartman: "Here is the big driver core update for 3.7-rc1. A number of firmware_class.c updates (as you saw a month or so ago), and some hyper-v updates and some printk fixes as well. All patches that are outside of the drivers/base area have been acked by the respective maintainers, and have all been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits) memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl() device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init\|exit] Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit dev: Add dev_vprintk_emit and dev_printk_emit netdev_printk/netif_printk: Remove a superfluous logging colon netdev_printk/dynamic_netdev_dbg: Directly call printk_emit dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack driver-core: Shut up dev_dbg_reatelimited() without DEBUG tools/hv: Parse /etc/os-release tools/hv: Check for read/write errors tools/hv: Fix exit() error code tools/hv: Fix file handle leak Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO Tools: hv: Rename the function kvp_get_ip_address() Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO Tools: hv: Add an example script to configure an interface ...	2012-10-01 12:10:44 -07:00
Linus Torvalds	81f56e5375	Linux support for the 64-bit ARM architecture (AArch64) Features currently supported: - 39-bit address space for user and kernel (each) - 4KB and 64KB page configurations - Compat (32-bit) user applications (ARMv7, EABI only) - Flattened Device Tree (mandated for all AArch64 platforms) - ARM generic timers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iQIcBAABAgAGBQJQabRiAAoJEGvWsS0AyF7xXgcQAK+FTXt0ikdQYMkV5AIZXb9i xHRhuiZWx2vKyk0mCqpyGLY58GSmSb6uTBg/2P2Ej7vXdH/RB2goPzjlspfjkDL4 o8RJp7eQ07Uz3KRDYEJgMP8xKZid6KFG93RJ6TjjpKZLuDBdwiG1GP1vb0jVcWfo ttZrj/aI8lMcqrh3Vq5qefP7GWP1OVATqeaGTiT7oo38pXwF3t237xfBr2iDGFBp ZgIRddrxpa7JYUesfJDDDdGHvLq7Vh2jJV+io9qasBZDrtppGJIhZ0vUni2DgIi7 r4i1LcynDN4JaG0maZ4U/YQm74TCD4BqxV8GJ7zwLPTWeN+of+skjhPSLOkA+0fp I+sWjXlv200gDfJZ9qnUld2kFpoDfJi2b7fNDouSDd2OhmVOVWG3jnVP4Z7meVSb O8BYzWDdsAiabuwciUY3OsmW6424lT93b2v86Vncs4unKMvEjOPxYZbUxhqX8f2j gsmWwwD/yS4THx2B6OyW9VT3I5J6miqs2Glt/GG6vPWT5AKQJn9jCxKaBGhPMPIs xe5/GycBYjdk/Y8qRjegxFbEqzQuiRzmkeFn5jwjmBLqpGNbZDpvMaL6adhAKM5/ v6UIKa91ra4fC9N0h6G61pOc9N9DbT8wPbCbdYY0RMTMRuLDZDgAM3Bvz0r2APdD 96leNy6vx684hbkCSLJs =buJB -----END PGP SIGNATURE----- Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 Pull arm64 support from Catalin Marinas: "Linux support for the 64-bit ARM architecture (AArch64) Features currently supported: - 39-bit address space for user and kernel (each) - 4KB and 64KB page configurations - Compat (32-bit) user applications (ARMv7, EABI only) - Flattened Device Tree (mandated for all AArch64 platforms) - ARM generic timers" * tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits) arm64: ptrace: remove obsolete ptrace request numbers from user headers arm64: Do not set the SMP/nAMP processor bit arm64: MAINTAINERS update arm64: Build infrastructure arm64: Miscellaneous header files arm64: Generic timers support arm64: Loadable modules arm64: Miscellaneous library functions arm64: Performance counters support arm64: Add support for /proc/sys/debug/exception-trace arm64: Debugging support arm64: Floating point and SIMD arm64: 32-bit (compat) applications support arm64: User access library functions arm64: Signal handling support arm64: VDSO support arm64: System calls handling arm64: ELF definitions arm64: SMP support arm64: DMA mapping API ...	2012-10-01 11:51:57 -07:00
Steve French	1d4ab90776	[CIFS] Fix indentation of fs/cifs/Kconfig entries make menuconfig for cifs shows multiple entries toward the end of the list with the incorrect indentation (probably a bug in Kconfig parsing of items that are dependant on the module (cifs=m instead of just CONFIG_CIFS). This patch fixes the indentation of all but the last entry (CIFS_ACL) which I don't know how to fix. It also clarifies wording in two places Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-10-01 12:48:03 -05:00
Steve French	e4aa25e780	[CIFS] Fix SMB2 negotiation support to select only one dialect (based on vers=) Based on whether the user (on mount command) chooses: vers=3.0 (for smb3.0 support) vers=2.1 (for smb2.1 support) or (with subsequent patch, which will allow SMB2 support) vers=2.0 (for original smb2.02 dialect support) send only one dialect at a time during negotiate (we had been sending a list). Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-10-01 12:26:22 -05:00
Linus Torvalds	99dbb1632f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull the trivial tree from Jiri Kosina: "Tiny usual fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) doc: fix old config name of kprobetrace fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc btrfs: fix the commment for the action flags in delayed-ref.h btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID vfs: fix kerneldoc for generic_fh_to_parent() treewide: fix comment/printk/variable typos ipr: fix small coding style issues doc: fix broken utf8 encoding nfs: comment fix platform/x86: fix asus_laptop.wled_type module parameter mfd: printk/comment fixes doc: getdelays.c: remember to close() socket on error in create_nl_socket() doc: aliasing-test: close fd on write error mmc: fix comment typos dma: fix comments spi: fix comment/printk typos in spi Coccinelle: fix typo in memdup_user.cocci tmiofb: missing NULL pointer checks tools: perf: Fix typo in tools/perf tools/testing: fix comment / output typos ...	2012-10-01 09:06:36 -07:00
Linus Torvalds	e151960a23	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw Pull GFS2 updates from Steven Whitehouse: "The major feature this time is the "rbm" conversion in the resource group code. The new struct gfs2_rbm specifies the location of an allocatable block in (resource group, bitmap, offset) form. There are a number of added helper functions, and later patches then rewrite some of the resource group code in terms of this new structure. Not only does this give us a nice code clean up, but it also removes some of the previous restrictions where extents could not cross bitmap boundaries, for example. In addition to that, there are a few bug fixes and clean ups, but the rbm work is by far the majority of this patch set in terms of number of changed lines." * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (27 commits) GFS2: Write out dirty inode metadata in delayed deletes GFS2: fix s_writers.counter imbalance in gfs2_ail_empty_gl GFS2: Fix infinite loop in rbm_find GFS2: Consolidate free block searching functions GFS2: Get rid of I_MUTEX_QUOTA usage GFS2: Stop block extents at the end of bitmaps GFS2: Fix unclaimed_blocks() wrapping bug and clean up GFS2: Improve block reservation tracing GFS2: Fall back to ignoring reservations, if there are no other blocks left GFS2: Fix ->show_options() for statfs slow GFS2: Use rbm for gfs2_setbit() GFS2: Use rbm for gfs2_testbit() GFS2: Eliminate unnecessary check for state > 3 in bitfit GFS2: Eliminate redundant calls to may_grant GFS2: Combine functions gfs2_glock_dq_wait and wait_on_demote GFS2: Combine functions gfs2_glock_wait and wait_on_holder GFS2: inline __gfs2_glock_schedule_for_reclaim GFS2: change function gfs2_direct_IO to use a normal gfs2_glock_dq GFS2: rbm code cleanup GFS2: Fix case where reservation finished at end of rgrp ...	2012-10-01 08:51:04 -07:00
Theodore Ts'o	041bbb6d36	ext4: fix mtime update in nodelalloc mode Commits `5e8830dc85` and `41c4d25f78` introduced a regression into v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not take place for files modified via mmap if the page was already in the page cache. This would also affect ext3 file systems mounted using the ext4 file system driver. The problem was that ext4_page_mkwrite() had a shortcut which would avoid calling __block_page_mkwrite() under some circumstances, and the above two commit transferred the responsibility of calling file_update_time() to __block_page_mkwrite --- which woudln't get called in some circumstances. Since __block_page_mkwrite() only has three callers, block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the best way to solve this is to move the responsibility for calling file_update_time() to its caller. This problem was found via xfstests #215 with a file system mounted with -o nodelalloc. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: stable@vger.kernel.org	2012-09-30 23:04:56 -04:00
Dmitry Monakhov	6f2080e644	ext4: fix ext_remove_space for punch_hole case Inode is allowed to have empty leaf only if it this is blockless inode. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-30 23:03:50 -04:00
Dmitry Monakhov	02d262dffc	ext4: punch_hole should wait for DIO writers punch_hole is the place where we have to wait for all existing writers (writeback, aio, dio), but currently we simply flush pended end_io request which is not sufficient. Other issue is that punch_hole performed w/o i_mutex held which obviously result in dangerous data corruption due to write-after-free. This patch performs following changes: - Guard punch_hole with i_mutex - Recheck inode flags under i_mutex - Block all new dio readers in order to prevent information leak caused by read-after-free pattern. - punch_hole now wait for all writers in flight NOTE: XXX write-after-free race is still possible because new dirty pages may appear due to mmap(), and currently there is no easy way to stop writeback while punch_hole is in progress. [ Fixed error return from ext4_ext_punch_hole() to make sure that we release i_mutex before returning EPERM or ETXTBUSY -- Ted ] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-30 23:03:42 -04:00
Al Viro	38b983b346	generic sys_execve() Selected by __ARCH_WANT_SYS_EXECVE in unistd.h. Requires * working current_pt_regs() * NOT doing a syscall-in-kernel kind of kernel_execve() implementation. Using generic kernel_execve() is fine. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-30 22:20:51 -04:00
Al Viro	282124d186	generic kernel_execve() based mostly on arm and alpha versions. Architectures can define __ARCH_WANT_KERNEL_EXECVE and use it, provided that * they have working current_pt_regs(), even for kernel threads. * kernel_thread-spawned threads do have space for pt_regs in the normal location. Normally that's as simple as switching to generic kernel_thread() and making sure that kernel threads do not go through return from syscall path; call the payload from equivalent of ret_from_fork if we are in a kernel thread (or just have separate ret_from_kernel_thread and make copy_thread() use it instead of ret_from_fork in kernel thread case). * they have ret_from_kernel_execve(); it is called after successful do_execve() done by kernel_execve() and gets normal pt_regs location passed to it as argument. It's essentially a longjmp() analog - it should set sp, etc. to the situation expected at the return for syscall and go there. Eventually the need for that sucker will disappear, but that'll take some surgery on kernel_thread() payloads. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-30 13:36:39 -04:00
Miklos Szeredi	8110e16d42	vfs: dcache: fix deadlock in tree traversal IBM reported a deadlock in select_parent(). This was found to be caused by taking rename_lock when already locked when restarting the tree traversal. There are two cases when the traversal needs to be restarted: 1) concurrent d_move(); this can only happen when not already locked, since taking rename_lock protects against concurrent d_move(). 2) racing with final d_put() on child just at the moment of ascending to parent; rename_lock doesn't protect against this rare race, so it can happen when already locked. Because of case 2, we need to be able to handle restarting the traversal when rename_lock is already held. This patch fixes all three callers of try_to_ascend(). IBM reported that the deadlock is gone with this patch. [ I rewrote the patch to be smaller and just do the "goto again" if the lock was already held, but credit goes to Miklos for the real work. - Linus ] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-09-29 17:41:40 -07:00
Brian Norris	74d83beaa2	JFFS2: don't fail on bitflips in OOB JFFS2 was designed without thought for OOB bitflips, it seems, but they can occur and will be reported to JFFS2 via mtd_read_oob()[1]. We don't want to fail on these transactions, since the data was corrected. [1] Few drivers report bitflips for OOB-only transactions. With such drivers, this patch should have no effect. Signed-off-by: Brian Norris <computersforpeace@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2012-09-29 15:34:13 +01:00
Artem Bityutskiy	a445f784ae	JFFS2: fix unmount regression This patch fixes regression introduced by "8bdc81c jffs2: get rid of jffs2_sync_super". We submit a delayed work in order to make sure the write-buffer is synchronized at some point. But we do not flush it when we unmount, which causes an oops when we unmount the file-system and then the delayed work is executed. This patch fixes the issue by adding a "cancel_delayed_work_sync()" infocation in the '->sync_fs()' handler. This will make sure the delayed work is canceled on sync, unmount and re-mount. And because VFS always callse 'sync_fs()' before unmounting or remounting, this fixes the issue. Reported-by: Ludovic Desroches <ludovic.desroches@atmel.com> Cc: stable@vger.kernel.org [3.5+] Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Ludovic Desroches <ludovic.desroches@atmel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2012-09-29 14:58:42 +01:00
Dmitry Monakhov	1f555cfa29	ext4: serialize truncate with owerwrite DIO workers Jan Kara have spotted interesting issue: There are potential data corruption issue with direct IO overwrites racing with truncate: Like: dio write truncate_task ->ext4_ext_direct_IO ->overwrite == 1 ->down_read(&EXT4_I(inode)->i_data_sem); ->mutex_unlock(&inode->i_mutex); ->ext4_setattr() ->inode_dio_wait() ->truncate_setsize() ->ext4_truncate() ->down_write(&EXT4_I(inode)->i_data_sem); ->__blockdev_direct_IO ->ext4_get_block ->submit_io() ->up_read(&EXT4_I(inode)->i_data_sem); # truncate data blocks, allocate them to # other inode - bad stuff happens because # dio is still in flight. In order to serialize with truncate dio worker should grab extra i_dio_count reference before drop i_mutex. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:58:26 -04:00
Dmitry Monakhov	1b65007e98	ext4: endless truncate due to nonlocked dio readers If we have enough aggressive DIO readers, truncate and other dio waiters will wait forever inside inode_dio_wait(). It is reasonable to disable nonlock DIO read optimization during truncate. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:56:15 -04:00
Dmitry Monakhov	1c9114f9c0	ext4: serialize unlocked dio reads with truncate Current serialization will works only for DIO which holds i_mutex, but nonlocked DIO following race is possible: dio_nolock_read_task truncate_task ->ext4_setattr() ->inode_dio_wait() ->ext4_ext_direct_IO ->ext4_ind_direct_IO ->__blockdev_direct_IO ->ext4_get_block ->truncate_setsize() ->ext4_truncate() #alloc truncated blocks #to other inode ->submit_io() #INFORMATION LEAK In order to serialize with unlocked DIO reads we have to rearrange wait sequence 1) update i_size first 2) if i_size about to be reduced wait for outstanding DIO requests 3) and only after that truncate inode blocks Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:55:23 -04:00
Dmitry Monakhov	17335dcc47	ext4: serialize dio nonlocked reads with defrag workers Inode's block defrag and ext4_change_inode_journal_flag() may affect nonlocked DIO reads result, so proper synchronization required. - Add missed inode_dio_wait() calls where appropriate - Check inode state under extra i_dio_count reference. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:41:21 -04:00
Dmitry Monakhov	28a535f9a0	ext4: completed_io locking cleanup Current unwritten extent conversion state-machine is very fuzzy. - For unknown reason it performs conversion under i_mutex. What for? My diagnosis: We already protect extent tree with i_data_sem, truncate and punch_hole should wait for DIO, so the only data we have to protect is end_io->flags modification, but only flush_completed_IO and end_io_work modified this flags and we can serialize them via i_completed_io_lock. Currently all these games with mutex_trylock result in the following deadlock truncate: kworker: ext4_setattr ext4_end_io_work mutex_lock(i_mutex) inode_dio_wait(inode) ->BLOCK DEADLOCK<- mutex_trylock() inode_dio_done() #TEST_CASE1_BEGIN MNT=/mnt_scrach unlink $MNT/file fallocate -l $((102410241024)) $MNT/file aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file sleep 2 truncate -s 0 $MNT/file #TEST_CASE1_END Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286 This patch makes state machine simple and clean: (1) xxx_end_io schedule final extent conversion simply by calling ext4_add_complete_io(), which append it to ei->i_completed_io_list NOTE1: because of (2A) work should be queued only if ->i_completed_io_list was empty, otherwise the work is scheduled already. (2) ext4_flush_completed_IO is responsible for handling all pending end_io from ei->i_completed_io_list Flushing sequence consists of following stages: A) LOCKED: Atomically drain completed_io_list to local_list B) Perform extents conversion C) LOCKED: move converted io's to to_free list for final deletion This logic depends on context which we was called from. D) Final end_io context destruction NOTE1: i_mutex is no longer required because end_io->flags modification is protected by ei->ext4_complete_io_lock Full list of changes: - Move all completion end_io related routines to page-io.c in order to improve logic locality - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io() - remove EXT4_IO_END_FSYNC - Improve SMP scalability by removing useless i_mutex which does not protect io->flags anymore. - Reduce lock contention on i_completed_io_lock by optimizing list walk. - Rename ext4_end_io_nolock to end4_end_io and make it static - Check flush completion status to ext4_ext_punch_hole(). Because it is not good idea to punch blocks from corrupted inode. Changes since V3 (in request to Jan's comments): Fall back to active flush_completed_IO() approach in order to prevent performance issues with nolocked DIO reads. Changes since V2: Fix use-after-free caused by race truncate vs end_io_work Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:14:55 -04:00
Dmitry Monakhov	82e5422911	ext4: fix unwritten counter leakage ext4_set_io_unwritten_flag() will increment i_unwritten counter, so once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back on error path. - add missed error checks to prevent counter leakage - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal that conversion finished. - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future. Visible effect of this bug is that unaligned aio_stress may deadlock Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-28 23:36:25 -04:00
Dmitry Monakhov	e27f41e1b7	ext4: give i_aiodio_unwritten a more appropriate name AIO/DIO prefix is wrong because it account unwritten extents which also may be scheduled from buffered write endio Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-28 23:24:52 -04:00
Dmitry Monakhov	f45ee3a1ea	ext4: ext4_inode_info diet Generic inode has unused i_private pointer which may be used as cur_aio_dio storage. TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow to have concurent AIO_DIO requests. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-28 23:21:09 -04:00
Shirish Pargaonkar	c052e2b423	cifs: obtain file access during backup intent lookup (resend) Rebased and resending the patch. Path based queries can fail for lack of access, especially during lookup during open. open itself would actually succeed becasue of back up intent bit but queries (either path or file handle based) do not have a means to specifiy backup intent bit. So query the file info during lookup using trans2 / findfirst / file_id_full_dir_info to obtain file info as well as file_id/inode value. Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Acked-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-28 15:32:28 -05:00
Trond Myklebust	849b286fd0	NFSv4.1: nfs4_proc_layoutreturn must always drop the plh_block_lgets count Currently it does not do so if the RPC call failed to start. Fix is to move the decrement of plh_block_lgets into nfs4_layoutreturn_release. Also remove a redundant test of task->tk_status in nfs4_layoutreturn_done: if lrp->res.lrs_present is set, then obviously the RPC call succeeded. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:18 -04:00
Trond Myklebust	65857d5768	NFSv4.1: _pnfs_return_layout() shouldn't invalidate the layout on failure Failure of the layoutreturn allocation fails is not a good reason to mark the pnfs_layout_hdr as having failed a layoutget or i/o. Just exit cleanly. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:18 -04:00
Trond Myklebust	e5929f3cff	NFSv4.1: Remove the NFS_LAYOUT_RETURNED state It serves no purpose that the test for whether or not we have valid layout segments doesn't already serve. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:17 -04:00
Trond Myklebust	173f77e9c5	NFSv4.1: Clear NFS_LAYOUT_BULK_RECALL when the layout segments are freed Once all the affected layout segments have been freed up, clear the NFS_LAYOUT_BULK_RECALL flag so that we can reuse the pnfs_layout_hdr Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:17 -04:00
Trond Myklebust	8006bfba36	NFSv4.1: Get rid of the NFS_LAYOUT_DESTROYED state We already have a mechanism for blocking LAYOUTGET by means of the plh_block_lgets counter. The only "service" that NFS_LAYOUT_DESTROYED provides at this point is to block layoutget once the layout segment list is empty, which basically means that you have to wait until the pnfs_layout_hdr is destroyed before you can do pNFS on that file again. This patch enables the reuse of the pnfs_layout_hdr if the layout segment list is empty. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:16 -04:00
Trond Myklebust	579342785f	NFSv4.1: Remove unused 'default allocation' for pnfs_alloc_layout_hdr() ...and ditto for pnfs_free_layout_hdr() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:16 -04:00
Trond Myklebust	a9136d4914	NFSv4.1: Get rid of pNFS spin lock debugging asserts... These are all in static declared functions that are called only once. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:16 -04:00
Trond Myklebust	8f0d27dc5d	NFSv4.1: Balance pnfs_layout_hdr refcount in pnfs_layout_(insert\|remove)_lseg Ensure that the reference count for pnfs_layout_hdr reverts to the original value after a call to pnfs_layout_remove_lseg(). Note that the caller is expected to hold a reference to the struct pnfs_layout_hdr. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:15 -04:00
Trond Myklebust	905ca191cf	NFSv4.1: Clean up pnfs_put_lseg() There is no longer a need to use pnfs_free_lseg_list(). Just call pnfs_free_lseg() directly. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:15 -04:00
Trond Myklebust	9c6263819f	NFSv4.1: Clean up the removal of pnfs_layout_hdr from the server list Move the code into pnfs_free_layout_hdr(), and add checks to get_layout_by_fh_locked to ensure that they don't reference a layout that is being freed. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:14 -04:00
Trond Myklebust	6622c3ea05	NFSv4.1: Free the pnfs_layout_hdr outside the inode->i_lock None of the existing pNFS layout drivers seem to require the inode to be locked while they free the layout header. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:14 -04:00
Trond Myklebust	01d39ce82b	NFSv4.1: Remove redundant reference to the pnfs_layout_hdr Each layout segment already holds a reference to the pnfs_layout_hdr, so there is no need to hold an extra reference that is released once the last layout segment is freed. Ensure that pnfs_find_alloc_layout() always returns a reference to the pnfs_layout_hdr, which will be matched by the final call to pnfs_put_layout_hdr() in pnfs_update_layout(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:13 -04:00
Trond Myklebust	57036a3776	NFSv4.1: Rename the pnfs_put_lseg_common to pnfs_layout_remove_lseg The latter name is more descriptive of the actual function. Also rename pnfs_insert_layout to pnfs_layout_insert_lseg. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:13 -04:00
Trond Myklebust	bb346f6397	NFSv4.1: reset the inode MDS threshold counters on layout destruction Instead of resetting the inode MDS threshold counters when we mark the layout for destruction, do it as part of freeing the layout. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:12 -04:00
Trond Myklebust	965938b83b	NFSv4.1: Get rid of pNFS layout state "NFS_LAYOUT_INVALID" In all cases where we set NFS_LAYOUT_INVALID, we also set NFS_LAYOUT_DESTROYED. Furthermore, in all cases where we test for NFS_LAYOUT_INVALID, we should also be testing for NFS_LAYOUT_DESTROYED, since the latter means that we hold no valid layout segments. Ergo the two are redundant. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:12 -04:00
Trond Myklebust	1f7977c136	NFSv4.1: Simplify the pNFS return-on-close code Confine it to the nfs4_do_close() code. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:12 -04:00
Trond Myklebust	7fdab069b7	NFSv4.1: Fix a race in the pNFS return-on-close code If we sleep after dropping the inode->i_lock, then we are no longer atomic with respect to the rpc_wake_up() call in pnfs_layout_remove_lseg(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:11 -04:00
Trond Myklebust	115ce575cb	NFSv4.1: pnfs_layout_io_set_failed must clear invalid lsegs If pnfs_layout_io_test_failed() authorises a retry of the failed layoutgets, we should clear the existing layout segments so that we start afresh. Do this in pnfs_layout_io_set_failed(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:11 -04:00
Trond Myklebust	3e62121493	NFSv4.1: Don't drop the pnfs_layout_hdr after a layoutget failure We want to cache the pnfs_layout_hdr after a layoutget or i/o failure so that pnfs_update_layout() can find it and know when it is time to retry. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:10 -04:00
Trond Myklebust	830ffb5657	NFSv4.1: Fix a reference leak in pnfs_update_layout If we exit after the call to pnfs_find_alloc_layout(), we have to ensure that we put the struct pnfs_layout_hdr. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:10 -04:00
Trond Myklebust	1dfed2737d	NFSv4.1: pNFS data servers may be temporarily offline In cases where the pNFS data server is just temporarily out of service, we want to mark it as such, and then try again later. Typically that will be in cases of network connection errors etc. This patch allows us to mark the devices as being "unavailable" for such transient errors, and will make them available for retries after a 2 minute timeout period. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:09 -04:00
Trond Myklebust	25c7533357	NFSv4.1: Retry pNFS after a 2 minute timeout If we had to fall back to read/write through MDS, then assume that we should retry pNFS after a suitable timeout period. The following patch sets a timeout of 2 minutes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:09 -04:00
Trond Myklebust	b9e028fd89	NFSv4.1: Add helpers for setting/reading the I/O fail bit ...and make them local to the pnfs.c file. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:09 -04:00
Trond Myklebust	f86bbcf85d	NFSv4.1: Replace dprintk() in pnfs_update_layout with something less buggy Dereferencing nfsi->layout in order to read plh_flags without holding a spin lock is bug prone. Furthermore, the dprintk() tells you nothing about whether or not the call succeeded. Replace it with something that tells you about whether or not a valid layout segment was returned for the inode in question. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:08 -04:00
Trond Myklebust	78e4e05c64	NFSv4.1: Replace get_device_info() with filelayout_get_device_info() Fix the namespace pollution issue. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:08 -04:00
Trond Myklebust	9369a431bc	NFSv4.1: Cleanup; add "pnfs_" prefix to put_lseg() and get_lseg() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:07 -04:00
Trond Myklebust	70c3bd2bdf	NFSv4.1: Cleanup; add "pnfs_" prefix to get_layout_hdr() and put_layout_hdr() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:07 -04:00
Trond Myklebust	49a85061b0	NFSv4.1: Cleanup add a "pnfs_" prefix to mark_matching_lsegs_invalid Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:06 -04:00
Trond Myklebust	a0b0a6e39b	NFS: Clean up the pNFS layoutget interface Ensure that we do return errors from nfs4_proc_layoutget() and that we don't mark the layout as having failed if the error was due to a signal or resource problem on the client side. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:06 -04:00
Trond Myklebust	dcfc4f2546	NFS: Write the entire file if a server reboot occurs during fsync() This is to ensure that we don't clear the NFS_CONTEXT_RESEND_WRITES flag while there are still writes that haven't been resent. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:05 -04:00
Trond Myklebust	05990d1bf2	NFS: Fix fdatasync/fsync() when confronted with a server reboot If the server reboots before it can commit the unstable writes to disk, then nfs_commit_release_pages() will detect this when it compares the verifier returned by COMMIT to the one returned by WRITE. When this happens, the client needs to resend those writes in order to guarantee that they make it to stable storage. This patch adds a signalling mechanism to notify fsync() that it needs to retry all writes before it can exit. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:05 -04:00
Trond Myklebust	795a88c968	NFSv4: Convert the nfs4_lock_state->ls_flags to a bit field Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:04 -04:00
Trond Myklebust	2a369153c8	NFS: Clean up helper function nfs4_select_rw_stateid() We want to be able to pass on the information that the page was not dirtied under a lock. Instead of adding a flag parameter, do this by passing a pointer to a 'struct nfs_lock_owner' that may be NULL. Also reuse this structure in struct nfs_lock_context to carry the fl_owner_t and pid_t. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:04 -04:00
Trond Myklebust	b3c54de6f8	NFS: Convert nfs_get_lock_context to return an ERR_PTR on failure We want to be able to distinguish between allocation failures, and the case where the lock context is not needed (because there are no locks). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 16:03:03 -04:00
David S. Miller	6a06e5e1bb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/team/team.c drivers/net/usb/qmi_wwan.c net/batman-adv/bat_iv_ogm.c net/ipv4/fib_frontend.c net/ipv4/route.c net/l2tp/l2tp_netlink.c The team, fib_frontend, route, and l2tp_netlink conflicts were simply overlapping changes. qmi_wwan and bat_iv_ogm were of the "use HEAD" variety. With help from Antonio Quartulli. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-28 14:40:49 -04:00
Trond Myklebust	0cac120233	NFSv4: Ensure that idmap_pipe_downcall sanity-checks the downcall data Use the idmapper upcall data to verify that the legacy idmapper daemon is indeed responding to an upcall that we sent. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Bryan Schumaker <bjschuma@netapp.com>	2012-09-28 13:43:34 -04:00
Trond Myklebust	e9ab41b620	NFSv4: Clean up the legacy idmapper upcall Replace the BUG_ON(idmap->idmap_key_cons != NULL) with a WARN_ON_ONCE(). Then get rid of the ACCESS_ONCE(idmap->idmap_key_cons). Then add helper functions for starting, finishing and aborting the legacy upcall. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Bryan Schumaker <bjschuma@netapp.com>	2012-09-28 13:42:41 -04:00
Linus Torvalds	7596824e66	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes; one for automount/lazy umount race, another a classic "we don't protect the refcount transition to zero with the lock that protects looking for object in hash" kind of crap in lockd." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: close the race in nlmsvc_free_block() do_add_mount()/umount -l races	2012-09-28 10:02:53 -07:00
Trond Myklebust	0e24d849c4	NFSv4: Remove BUG_ON() and ACCESS_ONCE() calls in the idmapper The use of ACCESS_ONCE() is wrong, since the various routines that set/clear idmap->idmap_key_cons should be strictly ordered w.r.t. each other, and the idmap->idmap_mutex ensures that only one thread at a time may be in an upcall situation. Also replace the BUG_ON()s with WARN_ON_ONCE() where appropriate. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-28 12:27:11 -04:00
Shaohua Li	02f3939e1a	block: makes bio_split support bio without data discard bio hasn't data attached. We hit a BUG_ON with such bio. This makes bio_split works for such bio. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-28 10:38:48 +02:00
James Morris	bf53083445	Linux 3.6-rc7 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQEcBAABAgAGBQJQX7MuAAoJEHm+PkMAQRiG0h0IAJURkrMCAQUxA+Ik66ReH89s LQcVd0U9uL4UUOi7f5WR64Vf9Cfu6VVGX9ZKSvjpNskvlQaUQPMIt4pMe6g4X4dI u0bApEy4XZz3nGabUAghIU8jJ8cDmhCG6kPpSiS7pi7KHc0yIa4WFtJRrIpGaIWT xuK38YOiOHcSDRlLyWZzainMncQp/ixJdxnqVMTonkVLk0q0b84XzOr4/qlLE5lU i+TsK3PRKdQXgvZ4CebL+srPBwWX1dmgP3VkeBloQbSSenSeELICbFWavn2ml+sF GXi4dO93oNquL/Oy5SwI666T4uNcrRPaS+5X+xSZgBW/y2aQVJVJuNZg6ZP/uWk= =0v2l -----END PGP SIGNATURE----- Merge tag 'v3.6-rc7' into next Linux 3.6-rc7 Requested by David Howells so he can merge his key susbsystem work into my tree with requisite -linus changesets.	2012-09-28 13:37:32 +10:00
J. Bruce Fields	fd51790949	trivial select_parent documentation fix "Search list for X" sounds like you're trying to find X on a list. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-09-27 15:43:08 -07:00
Wei Yongjun	ba39ebb614	ext4: convert to use leXX_add_cpu() Convert cpu_to_leXX(leXX_to_cpu(E1) + E2) to use leXX_add_cpu(). dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-27 09:37:53 -04:00
Carlos Maiolino	6d1ab10e69	ext4: ext4_bread usage audit When ext4_bread() returns NULL and err is set to zero, this means there is no phyical block mapped to the specified logical block number. (Previous to commit `90b0a97323`, err was uninitialized in this case, which caused other problems.) The directory handling routines use ext4_bread() in many places, the fact that ext4_bread() now returns NULL with err set to zero could cause problems since a number of these functions will simply return the value of err if the result of ext4_bread() was the NULL pointer, causing the caller of the function to think that the function was successful. Since directories should never contain holes, this case can only happen if the file system is corrupted. This commit audits all of the callers of ext4_bread(), and makes sure they do the right thing if a hole in a directory is found by ext4_bread(). Some ext4_bread() callers did not need any changes either because they already had its own hole detector paths. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-27 09:31:33 -04:00
Wang Sheng-Hui	cbb4ee830e	ext4: remove redundant offset check in mext_check_arguments() In the check code above, if orig_start != donor_start, we would return -EINVAL. So here, orig_start should be equal with donor_start. Remove the redundant check here. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-27 08:00:01 -04:00
Eric Sandeen	c25f9bc614	ext4: don't clear orphan list on ro mount with errors If the file system contains errors and it is being mounted read-only, don't clear the orphan list. We should minimize changes to the file system if it is mounted read-only. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 23:30:12 -04:00
Pavel Shilovsky	f065fd099f	CIFS: Fix possible freed pointer dereference in CIFS_SessSetup Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-26 22:15:24 -05:00
Pavel Shilovsky	4ca3a99ca4	CIFS: Fix possible freed pointer dereference in SMB2_sess_setup and remove redundant (rsp == NULL) checks after SendReceive2. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-26 22:15:18 -05:00
Jan Kara	b794e7a6eb	jbd2: fix assertion failure in commit code due to lacking transaction credits ext4 users of data=journal mode with blocksize < pagesize were occasionally hitting assertion failure in jbd2_journal_commit_transaction() checking whether the transaction has at least as many credits reserved as buffers attached. The core of the problem is that when a file gets truncated, buffers that still need checkpointing or that are attached to the committing transaction are left with buffer_mapped set. When this happens to buffers beyond i_size attached to a page stradding i_size, subsequent write extending the file will see these buffers and as they are mapped (but underlying blocks were freed) things go awry from here. The assertion failure just coincidentally (and in this case luckily as we would start corrupting filesystem) triggers due to journal_head not being properly cleaned up as well. We fix the problem by unmapping buffers if possible (in lots of cases we just need a buffer attached to a transaction as a place holder but it must not be written out anyway). And in one case, we just have to bite the bullet and wait for transaction commit to finish. CC: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Jan Kara <jack@suse.cz>	2012-09-26 23:11:13 -04:00
Pavel Shilovsky	760ad0cac1	CIFS: Make ops->close return void Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-26 22:05:10 -05:00
Djalal Harouni	9b68733273	ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails When the EXT4_IOC_MOVE_EXT ioctl() fails on bigalloc file systems, we should jump to the 'mext_out' label to release the donor file reference. Signed-off-by: Djalal Harouni <tixxdz@opendz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 22:58:50 -04:00
Lukas Czerner	aaf7d73e54	ext4: enable FITRIM ioctl on bigalloc file system With a minor tweaks regarding minimum extent size to discard and discarded bytes reporting the FITRIM can be enabled on bigalloc file system and it works without any problem. This patch fixes minlen handling and discarded bytes reporting to take into consideration bigalloc enabled file systems and finally removes the restriction and allow FITRIM to be used on file system with bigalloc feature enabled. Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 22:21:21 -04:00
Denys Vlasenko	f34f9d186d	coredump: prevent double-free on an error path in core dumper In !CORE_DUMP_USE_REGSET case, if elf_note_info_init fails to allocate memory for info->fields, it frees already allocated stuff and returns error to its caller, fill_note_info. Which in turn returns error to its caller, elf_core_dump. Which jumps to cleanup label and calls free_note_info, which will happily try to free all info->fields again. BOOM. This is the fix. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Cc: Venu Byravarasu <vbyravarasu@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2012-09-26 22:20:21 -04:00
Al Viro	63784dd02b	fcntl: fix misannotations __user * != * __user... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 22:20:20 -04:00
Al Viro	2744c171db	ceph: don't abuse d_delete() on failure exits Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 22:20:20 -04:00
Alan Cox	1fe0c0230a	vfs: delete surplus inode NULL check Each iteration of d_delete we reload inode from dentry->d_inode and then call S_ISDIR(inode-i_mode), so inode cannot possibly be NULL shortly afterwards unless something went horribly wrong. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 22:20:19 -04:00
Al Viro	2903ff019b	switch simple cases of fget_light to fdget Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 22:20:08 -04:00
Jan Kara	b71fc079b5	ext4: fix fdatasync() for files with only i_size changes Code tracking when transaction needs to be committed on fdatasync(2) forgets to handle a situation when only inode's i_size is changed. Thus in such situations fdatasync(2) doesn't force transaction with new i_size to disk and that can result in wrong i_size after a crash. Fix the issue by updating inode's i_datasync_tid whenever its size is updated. CC: <stable@vger.kernel.org> # >= 2.6.32 Reported-by: Kristian Nielsen <knielsen@knielsen-hq.org> Signed-off-by: Jan Kara <jack@suse.cz>	2012-09-26 21:52:20 -04:00
Bernd Schubert	6a08f447fa	ext4: always set i_op in ext4_mknod() ext4_special_inode_operations have their own ifdef CONFIG_EXT4_FS_XATTR to mask those methods. And ext4_iget also always sets it, so there is an inconsistency. Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-26 21:24:57 -04:00
Al Viro	2a117354b7	switch o2hb_region_dev_write() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:26 -04:00
Al Viro	7b540d0646	proc_map_files_readdir(): don't bother with grabbing files all we need is their ->f_mode, so just collect _that_ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:26 -04:00
Al Viro	cb0942b812	make get_file() return its argument simplifies a bunch of callers... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:25 -04:00
Al Viro	64e09fa2e1	switch xfs_find_handle() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:11 -04:00
Al Viro	1ea65c9607	switch xfs_swapext() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:11 -04:00
Al Viro	78f7d75e5d	switch coda get_device_index() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:10 -04:00
Al Viro	8319aa9127	switch btrfs_ioctl_clone() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:09 -04:00
Al Viro	4109633f4c	switch timerfd_[sg]ettime(2) to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:08 -04:00
Al Viro	5e196a9cf5	switch epoll_wait(2) to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:07 -04:00
Al Viro	ecd188159e	switch btrfs_ioctl_snap_create_transid() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:07 -04:00
Al Viro	6bdf295401	switch EXT4_IOC_MOVE_EXT to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:06 -04:00
Al Viro	4557c669ef	export fget_light Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:06 -04:00
Al Viro	399c9b862f	ext4: close struct file leak on EXT4_IOC_MOVE_EXT Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:06 -04:00
Al Viro	d6483b7a78	switch fchmod(2) to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:03 -04:00
Al Viro	6b48c5b207	switch fallocate(2) to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:03 -04:00
Al Viro	bf2965d5b5	switch ftruncate(2) to fget_light Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:02 -04:00
Al Viro	f6d2ac5ca7	namei.c: fix BS comment get_write_access() is needed for nfsd, not binfmt_aout (the latter has no business doing anything of that kind, of course) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:02 -04:00
Al Viro	c6f3d81115	don't leak O_CLOEXEC into ->f_flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:01 -04:00
Cyrill Gorcunov	ddd3e0771b	procfs: Convert /proc/pid/fdinfo/ handling routines to seq-file v2 This patch converts /proc/pid/fdinfo/ handling routines to seq-file which is needed to extend seq operations and plug in auxiliary fdinfo provides from subsystems like eventfd/eventpoll/fsnotify. Note the proc_fd_link no longer call for proc_fd_info, simply because the guts of proc_fd_info() got merged into ->show() of that seq_file Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:01 -04:00
Cyrill Gorcunov	faf60af17f	procfs: Move /proc/pid/fd[info] handling code to fd.[ch] This patch prepares the ground for further extension of /proc/pid/fd[info] handling code by moving fdinfo handling code into fs/proc/fd.c. I think such move makes both fs/proc/base.c and fs/proc/fd.c easier to read. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> CC: Al Viro <viro@ZenIV.linux.org.uk> CC: Alexey Dobriyan <adobriyan@gmail.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: James Bottomley <jbottomley@parallels.com> CC: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> CC: Alexey Dobriyan <adobriyan@gmail.com> CC: Matthew Helsley <matt.helsley@gmail.com> CC: "J. Bruce Fields" <bfields@fieldses.org> CC: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:01 -04:00
Al Viro	864bdb3b6c	new helper: daemonize_descriptors() descriptor-related parts of daemonize, done right. As the result we simplify the locking rules for ->files - we hold task_lock in all cases when we modify ->files. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:00 -04:00
Al Viro	179e037fc1	do_coredump(): make sure that descriptor table isn't shared Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:09:59 -04:00
Al Viro	c3c073f808	new helper: iterate_fd() iterates through the opened files in given descriptor table, calling a supplied function; we stop once non-zero is returned. Callback gets struct file , descriptor number and const void argument passed to iterator. It is called with files->file_lock held, so it is not allowed to block. tty_io, netprio_cgroup and selinux flush_unauthorized_files() converted to its use. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:09:59 -04:00
Al Viro	ad47bd7252	make expand_files() and alloc_fd() static no callers outside of fs/file.c left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:09:58 -04:00
Al Viro	b8318b01a8	take __{set,clear}_{open_fd,close_on_exec}() into fs/file.c nobody uses those outside anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:09:58 -04:00
Al Viro	8280d16172	new helper: replace_fd() analog of dup2(), except that it takes struct file * as source. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:09:57 -04:00
Al Viro	fe17f22d7f	take purely descriptor-related stuff from fcntl.c to file.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:09:57 -04:00
Al Viro	6a6d27de34	take close-on-exec logics to fs/file.c, clean it up a bit ... and add cond_resched() there, while we are at it. We can get large latencies as is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:09:56 -04:00
Lukas Czerner	63fedaf1c2	ext4: remove unused function ext4_ext_check_cache Remove unused function ext4_ext_check_cache() and merge the code back to the ext4_ext_in_cache(). Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 21:09:06 -04:00
Al Viro	483ce1d4b8	take descriptor-related part of close() to file.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:56 -04:00
Al Viro	0ee8cdfe6a	take fget() and friends to fs/file.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:56 -04:00
Al Viro	f869e8a7f7	expose a low-level variant of fd_install() for binder Similar situation to that of __alloc_fd(); do not use unless you really have to. You should not touch any descriptor table other than your own; it's a sure sign of a really bad API design. As with __alloc_fd(), you must use a first-class reference to struct files_struct; something obtained by get_files_struct(some task) (let alone direct task->files) will not do. It must be either current->files, or obtained by get_files_struct(current) by the owner of that sucker and given to you. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:55 -04:00
Al Viro	56007cae94	move put_unused_fd() and fd_install() to fs/file.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:55 -04:00
Al Viro	1983e781da	trim free_fdtable_rcu() embedded case isn't hit anymore Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:54 -04:00
Al Viro	b9e02af0ae	don't bother with call_rcu() in put_files_struct() At that point nobody can see us anyway; everything that looks at files_fdtable(files) is separated from the guts of put_files_struct(files) - either since files is current->files or because we fetched it under task_lock() and hadn't dropped that yet, or because we'd bumped files->count while holding task_lock()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:54 -04:00
Al Viro	7cf4dc3c8d	move files_struct-related bits from kernel/exit.c to fs/file.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:54 -04:00
Al Viro	dcfadfa4ec	new helper: __alloc_fd() Essentially, alloc_fd() in a files_struct we own a reference to. Most of the time wanting to use it is a sign of lousy API design (such as android/binder). It's not a general-purpose interface; better that than open-coding its guts, but again, playing with other process' descriptor table is a sign of bad design. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:53 -04:00
Al Viro	f33ff9927f	take rlimit check to callers of expand_files() ... except for one in android, where the check is different and already done in caller. No need to recalculate rlimit many times in alloc_fd() either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:53 -04:00
Al Viro	352e3b2492	fanotify: sanitize failure exits in copy_event_to_user() * do copy_to_user() before prepare_for_access_response(); that kills the need in remove_access_response(). * don't do fd_install() until we are past the last possible failure exit. Don't use sys_close() on cleanup side - just put_unused_fd() and fput(). Less racy that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:52 -04:00
Al Viro	5b249b1b07	pipe(2) - race-free error recovery don't mess with sys_close() if copy_to_user() fails; just postpone fd_install() until we know it hasn't. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:52 -04:00
Al Viro	c921b40d62	autofs4: don't open-code fd_install() The only difference between autofs_dev_ioctl_fd_install() and fd_install() is __set_close_on_exec() done by the latter. Just use get_unused_fd_flags(O_CLOEXEC) to allocate the descriptor and be done with that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:51 -04:00
Al Viro	1a7bd2265f	make get_unused_fd_flags() a function ... and get_unused_fd() a macro around it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:08:50 -04:00
Al Viro	5905db5ca0	Merge remote branch 'origin' into for-next	2012-09-26 21:07:20 -04:00
Wei Yongjun	85556c9a50	ext4: use kmem_cache_zalloc instead of kmem_cache_alloc/memset Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset(). spatch with a semantic match is used to found this problem. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 20:43:37 -04:00
Carlos Maiolino	2ea0392983	xfs: Make inode32 a remountable option As inode64 is the default option now, and was also made remountable previously, inode32 can also be remounted on-the-fly when it is needed. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 16:01:28 -05:00
Carlos Maiolino	4056c1d08d	xfs: add inode64->inode32 transition into xfs_set_inode32() To make inode32 a remountable option, xfs_set_inode32() should be able to make a transition from inode64 option, disabling inode allocation on higher AGs. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:59:50 -05:00
Carlos Maiolino	4c0837224c	xfs: Fix mp->m_maxagi update during inode64 remount With the changes made on xfs_set_inode64(), to make it behave as xfs_set_inode32() (now leaving to the caller the responsibility to update mp->m_maxagi), we use the return value of xfs_set_inode64() to update mp->m_maxagi during remount. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:58:21 -05:00
Carlos Maiolino	2d2194f61f	xfs: reduce code duplication handling inode32/64 options Add xfs_set_inode32() to be used to enable inode32 allocation mode. this will reduce the amount of duplicated code needed to mount/remount a filesystem with inode32 option. This patch also changes xfs_set_inode64() to return the maximum AG number that inodes can be allocated instead of set mp->m_maxagi by itself, so that the behaviour is the same as xfs_set_inode32(). This simplifies code that calls these functions and needs to know the maximum AG that inodes can be allocated in. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:56:33 -05:00
Carlos Maiolino	08bf540412	xfs: make inode64 as the default allocation mode since 64-bit inodes can be accessed while using inode32, and these can also be used on 32-bit kernels, there is no reason to still keep inode32 as the default mount option. If the filesystem cannot handle 64bit inode numbers (i.e CONFIG_LBDAF is not enabled and BITS_PER_LONG == 32), XFS_MOUNT_SMALL_INUMS will still be set by default, so inode64 is not an unconditional default value. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:54:19 -05:00
Carlos Maiolino	8aea3ff411	xfs: Fix m_agirotor reset during AG selection xfs_ialloc_next_ag() currently resets m_agirotor when it is equal to m_maxagi: if (++mp->m_agirotor == mp->m_maxagi) mp->m_agirotor = 0; But, if for some reason mp->m_maxagi changes to a lower value than current m_agirotor, this condition will never be true, causing m_agirotor to exceed the maximum allowed value (m_maxagi). This implies mainly during lookups for xfs_perag structs in its radix tree, since the agno value used for the lookup is based on m_agirotor. An out-of-range m_agirotor may cause a lookup failure which in case will return NULL. As an example, the value of m_maxagi is decreased during inode64->inode32 remount process, case where I've found this problem. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:42:42 -05:00
Carlos Maiolino	c3a58fecdd	Make inode64 a remountable option Actually, there is no reason about why a user must umount and mount a XFS filesystem to enable 'inode64' option. So, this patch makes this a remountable option. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-09-26 15:41:39 -05:00
Jeff Layton	4f2b86aba8	cifs: change DOS/NT/POSIX mapping of ERRnoresource ERRnoresource is an ERRSRV level (aka server-side) error and means "No resources currently available for request". Currently that maps to POSIX -ENOBUFS. No NT errors map to it currently. NT_STATUS_INSUFFICIENT_RESOURCES and NT_STATUS_INSUFF_SERVER_RESOURCES are also similar in meaning. Currently the client maps those to ERRnomem, which maps to -ENOMEM in POSIX. All of these mappings seem to be quite wrong to me and are confusing for users. All of the above errors indicate problems on the server, not the client. Reporting -ENOMEM or -ENOBUFS implies that the client is running out of resources. This patch changes those mappings. The NT_* errors are changed to map to the SRV level ERRnoresource. That error is in turn changed to return -EREMOTEIO which is the only POSIX error I could find that conveys that something went wrong on the server. While we're at it, change the SMB2 equivalent error to return the same. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Suresh Jayaraman <sjayaraman@suse.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-26 12:55:42 -05:00
Dmitry Monakhov	8c85447391	ext4: reimplement uninit extent optimization for move_extent_per_page() Uninitialized extent may became initialized(parallel writeback task) at any moment after we drop i_data_sem, so we have to recheck extent's state after we hold page's lock and i_data_sem. If we about to change page's mapping we must hold page's lock in order to serialize other users. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 12:54:52 -04:00
Dmitry Monakhov	bb55748805	ext4: clean up online defrag bugs in move_extent_per_page() Non-full list of bugs: 1) uninitialized extent optimization does not hold page's lock, and simply replace brunches after that writeback code goes crazy because block mapping changed under it's feets kernel BUG at fs/ext4/inode.c:1434! ( 288'th xfstress) 2) uninitialized extent may became initialized right after we drop i_data_sem, so extent state must be rechecked 3) Locked pages goes uptodate via following sequence: ->readpage(page); lock_page(page); use_that_page(page) But after readpage() one may invalidate it because it is uptodate and unlocked (reclaimer does that) As result kernel bug at include/linux/buffer_head.c:133! 4) We call write_begin() with already opened stansaction which result in following deadlock: ->move_extent_per_page() ->ext4_journal_start()-> hold journal transaction ->write_begin() ->ext4_da_write_begin() ->ext4_nonda_switch() ->writeback_inodes_sb_if_idle() --> will wait for journal_stop() 5) try_to_release_page() may fail and it does fail if one of page's bh was pinned by journal 6) If we about to change page's mapping we MUST hold it's lock during entire remapping procedure, this is true for both pages(original and donor one) Fixes: - Avoid (1) and (2) simply by temproraly drop uninitialized extent handling optimization, this will be reimplemented later. - Fix (3) by manually forcing page to uptodate state w/o dropping it's lock - Fix (4) by rearranging existing locking: from: journal_start(); ->write_begin to: write_begin(); journal_extend() - Fix (5) simply by checking retvalue - Fix (6) by locking both (original and donor one) pages during extent swap with help of mext_page_double_lock() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 12:52:07 -04:00
Trond Myklebust	13fe4ba1b6	NFSv4.1: decode_getdeviceinfo should check xdr_read_pages() return value Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-26 12:43:10 -04:00
Dmitry Monakhov	f066055a34	ext4: online defrag is not supported for journaled files Proper block swap for inodes with full journaling enabled is truly non obvious task. In order to be on a safe side let's explicitly disable it for now. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-26 12:32:54 -04:00
Dmitry Monakhov	03bd8b9b89	ext4: move_extent code cleanup - Remove usless checks, because it is too late to check that inode != NULL at the moment it was referenced several times. - Double lock routines looks very ugly and locking ordering relays on order of i_ino, but other kernel code rely on order of pointers. Let's make them simple and clean. - check that inodes belongs to the same SB as soon as possible. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-09-26 12:32:19 -04:00
Fengguang Wu	3eab7315c8	fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared blkdev_mmap() isn't used outside of fs/block_dev.c, mark it as static. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-26 09:57:55 +02:00
Mikulas Patocka	62ac665ff9	blockdev: turn a rw semaphore into a percpu rw semaphore This avoids cache line bouncing when many processes lock the semaphore for read. New percpu lock implementation The lock consists of an array of percpu unsigned integers, a boolean variable and a mutex. When we take the lock for read, we enter rcu read section, check for a "locked" variable. If it is false, we increase a percpu counter on the current cpu and exit the rcu section. If "locked" is true, we exit the rcu section, take the mutex and drop it (this waits until a writer finished) and retry. Unlocking for read just decreases percpu variable. Note that we can unlock on a difference cpu than where we locked, in this case the counter underflows. The sum of all percpu counters represents the number of processes that hold the lock for read. When we need to lock for write, we take the mutex, set "locked" variable to true and synchronize rcu. Since RCU has been synchronized, no processes can create new read locks. We wait until the sum of percpu counters is zero - when it is, there are no readers in the critical section. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-26 07:46:43 +02:00
Mikulas Patocka	b87570f5d3	Fix a crash when block device is read and block size is changed at the same time The kernel may crash when block size is changed and I/O is issued simultaneously. Because some subsystems (udev or lvm) may read any block device anytime, the bug actually puts any code that changes a block device size in jeopardy. The crash can be reproduced if you place "msleep(1000)" to blkdev_get_blocks just before "bh->b_size = max_blocks << inode->i_blkbits;". Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct" While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0" You get a BUG. The direct and non-direct I/O is written with the assumption that block size does not change. It doesn't seem practical to fix these crashes one-by-one there may be many crash possibilities when block size changes at a certain place and it is impossible to find them all and verify the code. This patch introduces a new rw-lock bd_block_size_semaphore. The lock is taken for read during I/O. It is taken for write when changing block size. Consequently, block size can't be changed while I/O is being submitted. For asynchronous I/O, the patch only prevents block size change while the I/O is being submitted. The block size can change when the I/O is in progress or when the I/O is being finished. This is acceptable because there are no accesses to block size when asynchronous I/O is being finished. The patch prevents block size changing while the device is mapped with mmap. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-26 07:46:40 +02:00
Tao Ma	0acdb8876f	ext4: don't call update_backups() multiple times for the same bg When performing an online resize, we add a bunch of groups at one time in ext4_flex_group_add, so in most cases a lot of group descriptors will be in the same group block. But in the end of this function, update_backups will be called for every group descriptor and the same block will be copied and journalled again and again. It is really a waste. Fix things so we only update a particular bg descriptor block once and skip subsequent updates of the same block. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-26 00:08:57 -04:00
Dmitry Monakhov	7f1468d1d5	ext4: fix double unlock buffer mess during fs-resize bh_submit_read() is responsible for unlock bh on endio. In addition, we need to use bh_uptodate_or_lock() to avoid races. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-25 23:19:25 -04:00
Jaeden Amero	142e5460a6	compat_ioctl: Avoid using undefined RS-485 IOCTLs Wrap the use of TIOCSRS485 and TIOCGRS485 in #ifdef so that we avoid adding undefined IOCTLs to the ioctl pointer list as compatible ioctls. This change was motivated by a build error on a MIPS build. tree: git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git tty-next head: `ac57e7f38e` commit: `84c3b84860` [10/16] compat_ioctl: Add RS-485 IOCTLs to the list config: mips-fuloong2e_defconfig All related error/warning messages: fs/compat_ioctl.c:869:1: error: 'TIOCSRS485' undeclared here (not in a function) fs/compat_ioctl.c:870:1: error: 'TIOCGRS485' undeclared here (not in a function) vim +869 fs/compat_ioctl.c 863 COMPATIBLE_IOCTL(TIOCSPGRP) 864 COMPATIBLE_IOCTL(TIOCGPGRP) 865 COMPATIBLE_IOCTL(TIOCGPTN) 866 COMPATIBLE_IOCTL(TIOCSPTLCK) 867 COMPATIBLE_IOCTL(TIOCSERGETLSR) 868 COMPATIBLE_IOCTL(TIOCSIG) > 869 COMPATIBLE_IOCTL(TIOCSRS485) 870 COMPATIBLE_IOCTL(TIOCGRS485) 871 #ifdef TCGETS2 872 COMPATIBLE_IOCTL(TCGETS2) Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jaeden Amero <jaeden.amero@ni.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2012-09-25 15:40:56 -07:00
J. Bruce Fields	6e67b5d184	nfsd4: fix bind_conn_to_session xdr comment Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-09-25 13:26:42 -04:00
NeilBrown	62d98c9354	NFS4: avoid underflow when converting error to pointer. In nfs4_create_sec_client, 'flavor' can hold a negative error code (returned from nfs4_negotiate_security), even though it is an 'enum' and hence unsigned. The code is careful to cast it to an (int) before testing if it is negative, however it doesn't cast to an (int) before calling ERR_PTR. On a machine where "void*" is larger than "int", this results in the unsigned equivalent of -1 (e.g. 0xffffffff) being converted to a pointer. Subsequent code determines that this is not negative, and so dereferences it with predictable results. So: cast 'flavor' to a (signed) int before passing to ERR_PTR. cc: Benny Halevy <bhalevy@tonian.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-25 10:38:54 -04:00
Wei Yongjun	e8d920c58d	NFS: fix the return value check by using IS_ERR In case of error, the function rpcauth_create() returns ERR_PTR() and never returns NULL pointer. The NULL test in the return value check should be replaced with IS_ERR(). dpatch engine is used to auto generated this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-09-25 10:36:37 -04:00
Jeff Layton	1b35920490	cifs: remove support for deprecated "forcedirectio" and "strictcache" mount options ...and make the default cache=strict as promised for 3.7. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:35 -05:00
Jeff Layton	52b0c3427e	cifs: remove support for CIFS_IOC_CHECKUMOUNT ioctl ...as promised for 3.7. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:35 -05:00
Pavel Shilovsky	e5d0488719	CIFS: Fix possible memory leaks in SMB2 code and add missed increments of failed async read and write requests. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:34 -05:00
Pavel Shilovsky	e4e3703555	CIFS: Fix endian conversion of IndexNumber by making it __le64 rather than __u64 in FILE_AL_INFO structure. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:34 -05:00
Steve French	12e8a20824	Trivial endian fixes Some trivial endian fixes for the SMB2 code. One warning remains which I asked Pavel to look at. Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:34 -05:00
Steve French	ba02e89915	MARK SMB2 support EXPERIMENTAL Now that the merge of the remaining pieces needed for SMB2 (SMB2.1 dialect) are in, and most test cases pass, we can consider SMB2.1 EXPERIMENTAL rather than "BROKEN." Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:34 -05:00
Steve French	5efeb09707	Update cifs version number With SMB2 support, update from version 1.79 to 2.0 to make it easier for users to recognize which version has SMB2 support. Signed-off-by: Steve French <sfrench@us.ibm.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>	2012-09-24 21:46:34 -05:00
Jeff Layton	3d6d854a13	cifs: add FL_CLOSE to fl_flags mask in cifs_read_flock FL_CLOSE is quite common when you close a file on which you hold a lock. The spurious "Unknown lock flags" message in cFYI is confusing in this case. Reported-by: Alexander Bokovoy <abokovoy@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:34 -05:00
Sachin Prabhu	ecdb69e2cc	cifs: Mangle string used for unc in /proc/mounts The string for "unc=" in /proc/mounts needs to be escaped. The current behaviour can create problems in cases when mounting a share starting with a number. example: >mount -t cifs -o username=test,password=x vm140-31:/17000-test /mnt >mount -o remount,password=x /mnt mount error: could not resolve address for vm140-31x00-test: Unknown error The sub-string "\170" which is part of the unc for the mount above in /proc/mounts is interpreted as character'x' in the case above. Escaping the string fixes the problem. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:34 -05:00
Jeff Layton	101b92d959	cifs: cleanups for cifs_mkdir_qinfo Rename inode pointers for better clarity. Move the d_instantiate call to the end of the function to prevent other tasks from seeing it before we've finished constructing it. Since we should have exclusive access to the inode at this point, remove the spinlock around i_nlink update. Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:34 -05:00
Pavel Shilovsky	233839b1df	CIFS: Fix fast lease break after open problem Now we walk though cifsFileInfo's list for every incoming lease break and look for an equivalent there. That approach misses lease breaks that come just after an open response - we don't have time to populate new cifsFileInfo structure to the list. Fix this by adding new list of pending opens and look for a lease there if we didn't find it in the list of cifsFileInfo structures. Signed-off-by: Pavel Shilovsky <pshilovsky@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:33 -05:00
Pavel Shilovsky	0822f51426	CIFS: Add SMB2.1 lease break support Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:33 -05:00
Pavel Shilovsky	25078105fb	CIFS: Fix cache coherency for read oplock case When we have a file opened with read oplock and we are writing a data to this file, we need to store the data in the cache and then send to the server to ensure that the next read operation will get a coherent data. Also mark it as CONFIG_CIFS_SMB2 because it's more suitable for SMB2 code but can fix some CIFS problems too (when server delays sending an oplock break after a write request). We can drop this ifdefs dependence in future. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:33 -05:00
Pavel Shilovsky	b8c32dbb0d	CIFS: Request SMB2.1 leases if server supports them and we need oplocks. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:33 -05:00
Pavel Shilovsky	579f905323	CIFS: Check for mandatory brlocks on read/write Currently CIFS code accept read/write ops on mandatory locked area when two processes use the same file descriptor - it's wrong. Fix this by serializing io and brlock operations on the inode. Signed-off-by: Pavel Shilovsky <pshilovsky@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:33 -05:00
Pavel Shilovsky	1b4b55a1d9	CIFS: Turn lock mutex into rw semaphore and allow several processes to walk through the lock list and read can_cache_brlcks value if they are not going to modify them. Signed-off-by: Pavel Shilovsky <pshilovsky@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:33 -05:00
Pavel Shilovsky	b140799a11	CIFS: Use brlock cache for SMB2 Signed-off-by: Pavel Shilovsky <pshilovsky@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>	2012-09-24 21:46:33 -05:00
Pavel Shilovsky	f7ba7fe685	CIFS: Add brlock support for SMB2 Signed-off-by: Pavel Shilovsky <pshilovsky@etersoft.ru>	2012-09-24 21:46:33 -05:00
Pavel Shilovsky	027e8eec31	CIFS: Handle SMB2 lock flags Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>	2012-09-24 21:46:32 -05:00
Pavel Shilovsky	d39a4f710b	CIFS: Move brlock code to ops struct Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>	2012-09-24 21:46:32 -05:00
Pavel Shilovsky	f45d34167c	CIFS: Remove spinlock dependence in brlock processing Now we need to lock/unlock a spinlock while processing brlock ops on the inode. Move brlocks of a fid to a separate list and attach all such lists to the inode. This let us not hold a spinlock. Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>	2012-09-24 21:46:32 -05:00
Pavel Shilovsky	1c0bd60b56	CIFS: Add NTLMSSP sec type to defaults to let us negotiate SMB2 without specifying sec type explicitly. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>	2012-09-24 21:46:32 -05:00
Jeff Layton	71953fc6e4	cifs: remove kmap lock and rsize limit Now that we aren't abusing the kmap address space, there's no need for this lock or to impose a limit on the rsize. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2012-09-24 21:46:32 -05:00
Jeff Layton	5819575ec6	cifs: replace kvec array in readdata with a single kvec The array is no longer needed. We just need a single kvec to hold the header for signature checking. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2012-09-24 21:46:32 -05:00
Jeff Layton	8321fec436	cifs: convert async read code to use pages array without kmapping Replace the "marshal_iov" function with a "read_into_pages" function. That function will copy the read data off the socket and into the pages array, kmapping and reading pages one at a time. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2012-09-24 21:46:32 -05:00
Jeff Layton	c5fab6f4f0	cifs: turn the pages list in cifs_readdata into an array We'll need an array to put into a smb_rqst, so convert this into an array instead of (ab)using the lru list_head. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2012-09-24 21:46:32 -05:00
Jeff Layton	f4e49cd2dc	cifs: allocate kvec array for cifs_readdata as a separate allocation Eventually, we're going to want to append a list of pages to cifs_readdata instead of a list of kvecs. To prepare for that, turn the kvec array allocation into a separate one and just keep a pointer to it in the readdata. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	67c1f52951	cifs: add deprecation warning to sockopt=TCP_NODELAY option Now that we're using TCP_CORK on the socket, there's no value in continuting to support this option. Schedule it for removal in 3.9. Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	c84ce4a7b2	cifs: remove the kmap size limit from wsize Now that we're not kmapping so much at once, there's no need to cap the wsize at the amount that can be simultaneously kmapped. Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	eddb079deb	cifs: convert async write code to pass in data via rq_pages array Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	fec344e3f3	cifs: change cifs_call_async to use smb_rqst structs For now, none of the callers populate rq_pages. That will be done for writes in a later patch. While we're at it, change the prototype of setup_async_request not to need a return pointer argument. Just return the pointer to the mid_q_entry or an ERR_PTR. Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	fb308a6f22	cifs: teach signing routines how to deal with arrays of pages in a smb_rqst Use the smb_send_rqst helper function to kmap each page in the array and update the hash for that chunk. Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	97bc00b394	cifs: teach smb_send_rqst how to handle arrays of pages Add code that allows smb_send_rqst to send an array of pages after the initial kvec array has been sent. For now, we simply kmap the page array and send it using the standard smb_send_kvec function. Eventually, we may want to convert this code to use kernel_sendpage under the hood and avoid the kmap altogether for the page data. Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	b8eed28375	cifs: cork the socket before a send and uncork it afterward We want to send SMBs as "atomically" as possible. Prior to sending any data on the socket, cork it to make sure that no non-full frames go out. Afterward, uncork it to make sure all of the data gets pushed out to the wire. Note that this more or less renders the socket=TCP_NODELAY mount option obsolete. When TCP_CORK and TCP_NODELAY are used on the same socket, TCP_NODELAY is essentially ignored. Acked-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	6f49f46b18	cifs: convert send code to use smb_rqst structs Again, just a change in the arguments and some function renaming here. In later patches, we'll change this code to deal with page arrays. In this patch, we add a new smb_send_rqst wrapper and have smb_sendv call that. Then we move most of the existing smb_sendv code into a new function -- smb_send_kvec. This seems a little redundant, but later we'll flesh this out to deal with arrays of pages. Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:31 -05:00
Jeff Layton	0b688cfc8b	cifs: change smb2 signing routines to use smb_rqst structs Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:30 -05:00
Jeff Layton	bf5ea0e2f2	cifs: change signing routines to deal with smb_rqst structs We need a way to represent a call to be sent on the wire that does not require having all of the page data kmapped. Behold the smb_rqst struct. This new struct represents an array of kvecs immediately followed by an array of pages. Convert the signing routines to use these structs under the hood and turn the existing functions for this into wrappers around that. For now, we're just changing these functions to take different args. Later, we'll teach them how to deal with arrays of pages. Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:30 -05:00
Pavel Shilovsky	6fc05c25ca	CIFS: Add statfs support for SMB2 Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:30 -05:00
Pavel Shilovsky	76ec5e3384	CIFS: Move statfs to ops struct Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:30 -05:00
Pavel Shilovsky	983c88a497	CIFS: Add oplock break support for SMB2 Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:30 -05:00
Pavel Shilovsky	95a3f2f377	CIFS: Move oplock break to ops struct Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:30 -05:00
Pavel Shilovsky	2e44b28878	CIFS: Process oplocks for SMB2 Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:30 -05:00
Pavel Shilovsky	d324f08d6a	CIFS: Add readdir support for SMB2 Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:30 -05:00
Pavel Shilovsky	92fc65a74a	CIFS: Move readdir code to ops struct Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:29 -05:00
Pavel Shilovsky	1feeaac753	CIFS: Add set_file_info support for SMB2 Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:29 -05:00
Pavel Shilovsky	6bdf6dbd66	CIFS: Move set_file_info to ops struct Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:29 -05:00
Pavel Shilovsky	c839ff244b	CIFS: Add SMB2 support for set_file_size Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:29 -05:00
Pavel Shilovsky	d143341815	CIFS: Move set_file_size to ops struct Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:29 -05:00
Pavel Shilovsky	568798cc62	CIFS: Add SMB2 support for hardlink operation Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:29 -05:00
Steve French	d6e906f1b5	CIFS: Move hardlink to ops struct Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-09-24 21:46:29 -05:00

... 4 5 6 7 8 ...

29104 Commits