linux

Author	SHA1	Message	Date
Nate Diller	01f2705daf	fs: convert core functions to zero_user_page It's very common for file systems to need to zero part or all of a page, the simplist way is just to use kmap_atomic() and memset(). There's actually a library function in include/linux/highmem.h that does exactly that, but it's confusingly named memclear_highpage_flush(), which is descriptive of how it does the work rather than what the purpose is. So this patchset renames the function to zero_user_page(), and calls it from the various places that currently open code it. This first patch introduces the new function call, and converts all the core kernel callsites, both the open-coded ones and the old memclear_highpage_flush() ones. Following this patch is a series of conversions for each file system individually, per AKPM, and finally a patch deprecating the old call. The diffstat below shows the entire patchset. [akpm@linux-foundation.org: fix a few things] Signed-off-by: Nate Diller <nate.diller@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:55 -07:00
NeilBrown	b41eeef14d	knfsd: avoid Oops if buggy userspace performs confusing filehandle->dentry mapping When a lookup request arrives, nfsd uses information provided by userspace (mountd) to find the right filesystem. It then assumes that the same filehandle type as the incoming filehandle can be used to create an outgoing filehandle. However if mountd is buggy, or maybe just being creative, the filesystem may not support that filesystem type, and the kernel could oops, particularly if 'ex_uuid' is NULL but a FSID_UUID* filehandle type is used. So add some proper checking that the fsid version/type from the incoming filehandle is actually supportable, and ignore that information if it isn't supportable. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:54 -07:00
NeilBrown	072f62ed85	knfsd: various nfsd xdr cleanups 1/ decode_sattr and decode_sattr3 never return NULL, so remove several checks for that. ditto for xdr_decode_hyper. 2/ replace some open coded XDR_QUADLEN calls with calls to XDR_QUADLEN 3/ in decode_writeargs, simply an 'if' to use a single calculation. .page_len is the length of that part of the packet that did not fit in the first page (the head). So the length of the data part is the remainder of the head, plus page_len. 3/ other minor cleanups. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:54 -07:00
Christoph Hellwig	f725b217b1	knfsd: trivial makefile cleanup kbuild directly interprets <modulename>-y as objects to build into a module, no need to assign it to the old foo-objs variable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:54 -07:00
NeilBrown	402acd29e5	knfsd: avoid use of unitialised variables on error path when nfs exports We need to zero various parts of 'exp' before any 'goto out', otherwise when we go to free the contents... we die. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:54 -07:00
Jeff Layton	cd123012d9	RPC: add wrapper for svc_reserve to account for checksum When the kernel calls svc_reserve to downsize the expected size of an RPC reply, it fails to account for the possibility of a checksum at the end of the packet. If a client mounts a NFSv2/3 with sec=krb5i/p, and does I/O then you'll generally see messages similar to this in the server's ring buffer: RPC request reserved 164 but used 208 While I was never able to verify it, I suspect that this problem is also the root cause of some oopses I've seen under these conditions: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227726 This is probably also a problem for other sec= types and for NFSv4. The large reserved size for NFSv4 compound packets seems to generally paper over the problem, however. This patch adds a wrapper for svc_reserve that accounts for the possibility of a checksum. It also fixes up the appropriate callers of svc_reserve to call the wrapper. For now, it just uses a hardcoded value that I determined via testing. That value may need to be revised upward as things change, or we may want to eventually add a new auth_op that attempts to calculate this somehow. Unfortunately, there doesn't seem to be a good way to reliably determine the expected checksum length prior to actually calculating it, particularly with schemes like spkm3. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Acked-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:54 -07:00
Eric W. Biederman	6697164335	nfsd/nfs4state: remove unnecessary daemonize call Acked-by: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:54 -07:00
Peter Staubach	f34b95689d	The NFSv2/NFSv3 server does not handle zero length WRITE requests correctly The NFSv2 and NFSv3 servers do not handle WRITE requests for 0 bytes correctly. The specifications indicate that the server should accept the request, but it should mostly turn into a no-op. Currently, the server will return an XDR decode error, which it should not. Attached is a patch which addresses this issue. It also adds some boundary checking to ensure that the request contains as much data as was requested to be written. It also correctly handles an NFSv3 request which requests to write more data than the server has stated that it is prepared to handle. Previously, there was some support which looked like it should work, but wasn't quite right. Signed-off-by: Peter Staubach <staubach@redhat.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:54 -07:00
Adrian Bunk	8842c9655b	remove nfs4_acl_add_ace() nfs4_acl_add_ace() can now be removed. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Neil Brown <neilb@cse.unsw.edu.au> Acked-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:54 -07:00
Oleg Nesterov	28e53bddf8	unify flush_work/flush_work_keventd and rename it to cancel_work_sync flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq (this was possible from the very beginnig, I missed this). So we can unify flush_work_keventd and flush_work. Also, rename flush_work() to cancel_work_sync() and fix all callers. Perhaps this is not the best name, but "flush_work" is really bad. (akpm: this is why the earlier patches bypassed maintainers) Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jeff Garzik <jeff@garzik.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Auke Kok <auke-jan.h.kok@intel.com>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:53 -07:00
Andrew Morton	a9df62c758	aio: use flush_work() Migrate AIO over to use flush_work(). Cc: "Maciej W. Rozycki" <macro@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:51 -07:00
David Howells	31143d5d51	AFS: implement basic file write support Implement support for writing to regular AFS files, including: (1) write (2) truncate (3) fsync, fdatasync (4) chmod, chown, chgrp, utime. AFS writeback attempts to batch writes into as chunks as large as it can manage up to the point that it writes back 65535 pages in one chunk or it meets a locked page. Furthermore, if a page has been written to using a particular key, then should another write to that page use some other key, the first write will be flushed before the second is allowed to take place. If the first write fails due to a security error, then the page will be scrapped and reread before the second write takes place. If a page is dirty and the callback on it is broken by the server, then the dirty data is not discarded (same behaviour as NFS). Shared-writable mappings are not supported by this patch. [akpm@linux-foundation.org: fix a bunch of warnings] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:50 -07:00
David Howells	416351f28d	AFS: AFS fixups Make some miscellaneous changes to the AFS filesystem: (1) Assert RCU barriers on module exit to make sure RCU has finished with callbacks in this module. (2) Correctly handle the AFS server returning a zero-length read. (3) Split out data zapping calls into one function (afs_zap_data). (4) Rename some afs_file_() functions to afs_() where they apply to non-regular files too. (5) Be consistent about the presentation of volume ID:vnode ID in debugging output. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:50 -07:00
Josef 'Jeff' Sipek	2dfdd266b9	fs: use path_walk in do_path_lookup Since path_walk sets the total_link_count to 0 and calls link_path_walk, we can just call path_walk directly. Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:50 -07:00
Josef 'Jeff' Sipek	62ce39c531	fs: fix indentation in do_path_lookup Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:49 -07:00
Akinobu Mita	92f4c701aa	use simple_read_from_buffer() in fs/ Cleanup using simple_read_from_buffer() in binfmt_misc, configfs, and sysfs. Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Joel Becker <joel.becker@oracle.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:49 -07:00
Uwe Kleine-König	5886269962	fix file specification in comments Many files include the filename at the beginning, serveral used a wrong one. Signed-off-by: Uwe Kleine-König <ukleinek@informatik.uni-freiburg.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2007-05-09 08:58:16 +02:00
Alexander E. Patrakov	148e423f90	Remove obsolete fat_cvf help text The text removed by the following patch refers to functionality that never worked, to non-existing documentation file, and to mount options marked as obsolete in the module. Signed-off-by: Alexander E. Patrakov <patrakov@ums.usu.ru> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2007-05-09 08:58:15 +02:00
Michael Opdenacker	59c51591a0	Fix occurrences of "the the " Signed-off-by: Michael Opdenacker <michael@free-electrons.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2007-05-09 08:57:56 +02:00
Robert P. J. Day	beb7dd86a1	Fix misspellings collected by members of KJ list. Fix the misspellings of "propogate", "writting" and (oh, the shame :-) "kenrel" in the source tree. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2007-05-09 07:14:03 +02:00
WANG Cong	ccf6780dc3	Style fix in fs/select.c Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2007-05-09 07:10:02 +02:00
Ronni Nielsen	0f8952c2fa	fs/libfs.c: >80 columns line break fix Signed-off-by: Ronni Nielsen <theronni@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2007-05-09 06:44:57 +02:00
David Rientjes	4b8df8915a	smaps: only define clear_refs for CONFIG_MMU /proc/pid/clear_refs is only defined in the CONFIG_MMU case, so make sure we don't have any references to clear_refs_smap() in generic procfs code. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 20:41:14 -07:00
Linus Torvalds	7b82dc0e64	Remove suid/sgid bits on [f]truncate() .. to match what we do on write(). This way, people who write to files by using [f]truncate + writable mmap have the same semantics as if they were using the write() family of system calls. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 20:10:00 -07:00
Linus Torvalds	60c9b2746f	Merge git://oss.sgi.com:8090/xfs/xfs-2.6 * git://oss.sgi.com:8090/xfs/xfs-2.6: [XFS] Add lockdep support for XFS [XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks. [XFS] Get rid of redundant "required" in msg. [XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg. [XFS] Remove unused ilen variable and references. [XFS] Fix to prevent the notorious 'NULL files' problem after a crash. [XFS] Fix race condition in xfs_write(). [XFS] Fix uquota and oquota enforcement problems. [XFS] propogate return codes from flush routines [XFS] Fix quotaon syscall failures for group enforcement requests. [XFS] Invalidate quotacheck when mounting without a quota type. [XFS] reducing the number of random number functions. [XFS] remove more misc. unused args [XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it. [XFS] The last argument "lsn" of xfs_trans_commit() is always called with	2007-05-08 11:59:33 -07:00
Linus Torvalds	02a93208ed	Merge branch 'for-2.6.22' of git://git.kernel.dk/data/git/linux-2.6-block * 'for-2.6.22' of git://git.kernel.dk/data/git/linux-2.6-block: [PATCH] ll_rw_blk: fix missing bounce in blk_rq_map_kern() [PATCH] splice: always call into page_cache_readahead() [PATCH] splice(): fix interaction with readahead	2007-05-08 11:34:52 -07:00
Linus Torvalds	18062a91d2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: JFS: Fix race waking up jfsIO kernel thread JFS: use __set_current_state() Copy i_flags to jfs inode flags on write JFS: document uid, gid, and umask mount options in jfs.txt	2007-05-08 11:32:30 -07:00
Dmitriy Monakhov	951744fea0	udf: possible null pointer dereference while load_partition sb_read may return NULL, let's explicitly check it. Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:22 -07:00
Jan Kara	31170b6ad4	udf: support files larger than 1G Make UDF work correctly for files larger than 1GB. As no extent can be longer than (1<<30)-blocksize bytes, we have to create several extents if a big hole is being created. As a side-effect, we now don't discard preallocated blocks when creating a hole. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:21 -07:00
Jan Kara	948b9b2c96	udf: add assertions Add a few assertions into udf_discard_prealloc() to check that the file is sane (mostly helps debugging further patches ;). Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:21 -07:00
Jan Kara	3bf25cb40d	udf: use get_bh() Make UDF use get_bh() instead of directly accessing b_count and use brelse() instead of udf_release_data() which does just brelse()... Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:21 -07:00
Jan Kara	ff116fc8d1	UDF: introduce struct extent_position Introduce a structure extent_position to store a position of an extent and the corresponding buffer_head in one place. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:21 -07:00
Jan Kara	60448b1d6d	udf: use sector_t and loff_t for file offsets Use sector_t and loff_t for file offsets in UDF filesystem. Otherwise an overflow may occur for long files. Also make inode_bmap() return offset in the extent in number of blocks instead of number of bytes - for most callers this is more convenient. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:21 -07:00
Peter Zijlstra	277866a0e3	nfs: fix congestion control: use atomic_longs Change the atomic_t in struct nfs_server to atomic_long_t in anticipation of machines that can handle 8+TB of (4K) pages under writeback. However I suspect other things in NFS will start going bang by then. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:21 -07:00
Ulrich Drepper	1c710c896e	utimensat implementation Implement utimensat(2) which is an extension to futimesat(2) in that it a) supports nano-second resolution for the timestamps b) allows to selectively ignore the atime/mtime value c) allows to selectively use the current time for either atime or mtime d) supports changing the atime/mtime of a symlink itself along the lines of the BSD lutimes(3) functions For this change the internally used do_utimes() functions was changed to accept a timespec time value and an additional flags parameter. Additionally the sys_utime function was changed to match compat_sys_utime which already use do_utimes instead of duplicating the work. Also, the completely missing futimensat() functionality is added. We have such a function in glibc but we have to resort to using /proc/self/fd/* which not everybody likes (chroot etc). Test application (the syscall number will need per-arch editing): #include <errno.h> #include <fcntl.h> #include <time.h> #include <sys/time.h> #include <stddef.h> #include <syscall.h> #define __NR_utimensat 280 #define UTIME_NOW ((1l << 30) - 1l) #define UTIME_OMIT ((1l << 30) - 2l) int main(void) { int status = 0; int fd = open("ttt", O_RDWR\|O_CREAT\|O_EXCL, 0666); if (fd == -1) error (1, errno, "failed to create test file \"ttt\""); struct stat64 st1; if (fstat64 (fd, &st1) != 0) error (1, errno, "fstat failed"); struct timespec t[2]; t[0].tv_sec = 0; t[0].tv_nsec = 0; t[1].tv_sec = 0; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); struct stat64 st2; if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != 0 \|\| st2.st_atim.tv_nsec != 0) { puts ("atim not reset to zero"); status = 1; } if (st2.st_mtim.tv_sec != 0 \|\| st2.st_mtim.tv_nsec != 0) { puts ("mtim not reset to zero"); status = 1; } if (status != 0) goto out; t[0] = st1.st_atim; t[1].tv_sec = 0; t[1].tv_nsec = UTIME_OMIT; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != st1.st_atim.tv_sec \|\| st2.st_atim.tv_nsec != st1.st_atim.tv_nsec) { puts ("atim not set"); status = 1; } if (st2.st_mtim.tv_sec != 0 \|\| st2.st_mtim.tv_nsec != 0) { puts ("mtim changed from zero"); status = 1; } if (status != 0) goto out; t[0].tv_sec = 0; t[0].tv_nsec = UTIME_OMIT; t[1] = st1.st_mtim; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != st1.st_atim.tv_sec \|\| st2.st_atim.tv_nsec != st1.st_atim.tv_nsec) { puts ("mtim changed from original time"); status = 1; } if (st2.st_mtim.tv_sec != st1.st_mtim.tv_sec \|\| st2.st_mtim.tv_nsec != st1.st_mtim.tv_nsec) { puts ("mtim not set"); status = 1; } if (status != 0) goto out; sleep (2); t[0].tv_sec = 0; t[0].tv_nsec = UTIME_NOW; t[1].tv_sec = 0; t[1].tv_nsec = UTIME_NOW; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); struct timeval tv; gettimeofday(&tv,NULL); if (st2.st_atim.tv_sec <= st1.st_atim.tv_sec \|\| st2.st_atim.tv_sec > tv.tv_sec) { puts ("atim not set to NOW"); status = 1; } if (st2.st_mtim.tv_sec <= st1.st_mtim.tv_sec \|\| st2.st_mtim.tv_sec > tv.tv_sec) { puts ("mtim not set to NOW"); status = 1; } if (symlink ("ttt", "tttsym") != 0) error (1, errno, "cannot create symlink"); t[0].tv_sec = 0; t[0].tv_nsec = 0; t[1].tv_sec = 0; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, AT_FDCWD, "tttsym", t, AT_SYMLINK_NOFOLLOW) != 0) error (1, errno, "utimensat failed"); if (lstat64 ("tttsym", &st2) != 0) error (1, errno, "lstat failed"); if (st2.st_atim.tv_sec != 0 \|\| st2.st_atim.tv_nsec != 0) { puts ("symlink atim not reset to zero"); status = 1; } if (st2.st_mtim.tv_sec != 0 \|\| st2.st_mtim.tv_nsec != 0) { puts ("symlink mtim not reset to zero"); status = 1; } if (status != 0) goto out; t[0].tv_sec = 1; t[0].tv_nsec = 0; t[1].tv_sec = 1; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, fd, NULL, t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != 1 \|\| st2.st_atim.tv_nsec != 0) { puts ("atim not reset to one"); status = 1; } if (st2.st_mtim.tv_sec != 1 \|\| st2.st_mtim.tv_nsec != 0) { puts ("mtim not reset to one"); status = 1; } if (status == 0) puts ("all OK"); out: close (fd); unlink ("ttt"); unlink ("tttsym"); return status; } [akpm@linux-foundation.org: add missing i386 syscall table entry] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Alexey Dobriyan <adobriyan@openvz.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:18 -07:00
Jeff Layton	1a1c9bb433	inode numbering: change libfs sb creation routines to avoid collisions with their root inodes This patch makes it so that simple_fill_super and get_sb_pseudo assign their root inodes to be number 1. It also fixes up a couple of callers of simple_fill_super that were passing in files arrays that had an index at number 1, and adds a warning for any caller that sends in such an array. It would have been nice to have made it so that it wasn't possible to make such a collision, but some callers need to be able to control what inode number their entries get, so I think this is the best that can be done. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:16 -07:00
Jeff Layton	866b04fccb	inode numbering: make static counters in new_inode and iunique be 32 bits The problems are: - on filesystems w/o permanent inode numbers, i_ino values can be larger than 32 bits, which can cause problems for some 32 bit userspace programs on a 64 bit kernel. We can't do anything for filesystems that have actual >32-bit inode numbers, but on filesystems that generate i_ino values on the fly, we should try to have them fit in 32 bits. We could trivially fix this by making the static counters in new_inode and iunique 32 bits, but... - many filesystems call new_inode and assume that the i_ino values they are given are unique. They are not guaranteed to be so, since the static counter can wrap. This problem is exacerbated by the fix for #1. - after allocating a new inode, some filesystems call iunique to try to get a unique i_ino value, but they don't actually add their inodes to the hashtable, and so they're still not guaranteed to be unique if that counter wraps. This patch set takes the simpler approach of simply using iunique and hashing the inodes afterward. Christoph H. previously mentioned that he thought that this approach may slow down lookups for filesystems that currently hash their inodes. The questions are: 1) how much would this slow down lookups for these filesystems? 2) is it enough to justify adding more infrastructure to avoid it? What might be best is to start with this approach and then only move to using IDR or some other scheme if these extra inodes in the hashtable prove to be problematic. I've done some cursory testing with this patch and the overhead of hashing and unhashing the inodes with pipefs is pretty low -- just a few seconds of system time added on to the creation and destruction of 10 million pipes (very similar to the overhead that the IDR approach would add). The hard thing to measure is what effect this has on other filesystems. I'm open to ways to try and gauge this. Again, I've only converted pipefs as an example. If this approach is acceptable then I'll start work on patches to convert other filesystems. With a pretty-much-worst-case microbenchmark provided by Eric Dumazet <dada1@cosmosbay.com>: hashing patch (pipebench): sys 1m15.329s sys 1m16.249s sys 1m17.169s unpatched (pipebench): sys 1m9.836s sys 1m12.541s sys 1m14.153s Which works out to 1.05642174294555027017. So ~5-6% slowdown. This patch: When a 32-bit program that was not compiled with large file offsets does a stat and gets a st_ino value back that won't fit in the 32 bit field, glibc (correctly) generates an EOVERFLOW error. We can't do anything about fs's with larger permanent inode numbers, but when we generate them on the fly, we ought to try and have them fit within a 32 bit field. This patch takes the first step toward this by making the static counters in these two functions be 32 bits. [jlayton@redhat.com: mention that it's only the case for 32bit, non-LFS stat] Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:16 -07:00
Alexey Kuznetsov	b140f25108	Invalid return value of execve() resulting in oopses When elf loader fails to map executable (due to memory shortage or because binary is malformed), it can return 0. Normally, this is invisible because process is killed with SIGKILL and it never returns to user space. But if exec() is called from kernel thread (hotplug, whatever) consequences are more interesting and vary depending on architecture. i386. Nothing especially interesting, execve() just returns with "success" :-) x86_64. Fake zero frame is used on way to caller, RSP/RIP are loaded with zeros, ergo... double fault. ia64. Similar to i386, but r32...r95 are corrupted. Sometimes it oopses due to return to zero PC, sometimes it sees NaT in rXX and oopses due to NaT consumption. Signed-off-by: Alexey Kuznetsov <alexey@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:15 -07:00
Akinobu Mita	0c28f287aa	procfs: use simple_read_from_buffer() Cleanup using simple_read_from_buffer() in procfs. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:14 -07:00
Andreas Schwab	83ae1b79c8	Fix error handling in HDIO_GETGEO compat wrapper Don't clobber error from sys_ioctl in HDIO_GETGEO compat wrapper. Signed-off-by: Andreas Schwab <schwab@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:14 -07:00
Stephen Mollett	c007c06e3c	udf: decrement correct link count in udf_rmdir It appears that a minor thinko occurred in udf_rmdir and the (already-cleared) link count on the directory that is being removed was being decremented instead of the link count on its parent directory. This gives rise to lots of kernel messages similar to: UDF-fs warning (device loop1): udf_rmdir: empty directory has nlink != 2 (8) when removing directory trees. No other ill effects have been observed but I guess it could theoretically result in the link count overflowing on a very long-lived, much modified directory. Signed-off-by: Stephen Mollett <molletts@yahoo.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Jan Kara <jack@ucw.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:14 -07:00
OGAWA Hirofumi	c483bab099	fat: fix VFAT compat ioctls on 64-bit systems If you compile and run the below test case in an msdos or vfat directory on an x86-64 system with -m32 you'll get garbage in the kernel_dirent struct followed by a SIGSEGV. The patch fixes this. Reported and initial fix by Bart Oldeman #include <sys/types.h> #include <sys/ioctl.h> #include <dirent.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> struct kernel_dirent { long d_ino; long d_off; unsigned short d_reclen; char d_name[256]; /* We must not include limits.h! */ }; #define VFAT_IOCTL_READDIR_BOTH _IOR('r', 1, struct kernel_dirent [2]) #define VFAT_IOCTL_READDIR_SHORT _IOR('r', 2, struct kernel_dirent [2]) int main(void) { int fd = open(".", O_RDONLY); struct kernel_dirent de[2]; while (1) { int i = ioctl(fd, VFAT_IOCTL_READDIR_BOTH, (long)de); if (i == -1) break; if (de[0].d_reclen == 0) break; printf("SFN: reclen=%2d off=%d ino=%d, %-12s", de[0].d_reclen, de[0].d_off, de[0].d_ino, de[0].d_name); if (de[1].d_reclen) printf("\tLFN: reclen=%2d off=%d ino=%d, %s", de[1].d_reclen, de[1].d_off, de[1].d_ino, de[1].d_name); printf("\n"); } return 0; } Signed-off-by: Bart Oldeman <bartoldeman@users.sourceforge.net> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:14 -07:00
Jan Kara	4f99ed67cc	ext3: copy i_flags to inode flags on write Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext2-specific i_flags. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:13 -07:00
OGAWA Hirofumi	28ec039c21	fat: don't use free_clusters for fat32 It seems that the recent Windows changed specification, and it's undocumented. Windows doesn't update ->free_clusters correctly. This patch doesn't use ->free_clusters by default. (instead, add "usefree" for forcing to use it) Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Juergen Beisert <juergen127@kreuzholzen.de> Cc: Andreas Schwab <schwab@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:13 -07:00
Milind Arun Choudhary	5ab2f7e0fd	reiserfs: use __set_current_state() use __set_current_state(TASK_) instead of current->state = TASK_, in fs/reiserfs Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com> Cc: <reiserfs-dev@namesys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:13 -07:00
Pavel Emelianov	97f0678467	jbd: check for error returned by kthread_create on creating journal thread If the thread failed to create the subsequent wait_event will hang forever. This is likely to happen if kernel hits max_threads limit. Will be critical for virtualization systems that limit the number of tasks and kernel memory usage within the container. (akpm: JBD should be converted fully to the kthread API: kthread_should_stop() and kthread_stop()). Cc: <linux-ext4@vger.kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:13 -07:00
Miklos Szeredi	ee6f958291	check privileges before setting mount propagation There's a missing check for CAP_SYS_ADMIN in do_change_type(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:12 -07:00
Jan Kara	28be5abb40	ext3: copy i_flags to inode flags on write A patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc. from i_flags to EXT3_I(inode)->i_flags when inode is written to disk. The same thing is done on GETFLAGS ioctl. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem (especially, lsattr did not show them and users were wondering...). Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext3-specific i_flags. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. Signed-off-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:12 -07:00
Pavel Emelianov	b5e618181a	Introduce a handy list_first_entry macro There are many places in the kernel where the construction like foo = list_entry(head->next, struct foo_struct, list); are used. The code might look more descriptive and neat if using the macro list_first_entry(head, type, member) \ list_entry((head)->next, type, member) Here is the macro itself and the examples of its usage in the generic code. If it will turn out to be useful, I can prepare the set of patches to inject in into arch-specific code, drivers, networking, etc. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Kirill Korotaev <dev@openvz.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: John McCutchan <ttb@tentacle.dhs.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:11 -07:00
Eric W. Biederman	1bd0cf1fc7	smbfs: remove unnecessary allow_signal Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:11 -07:00
Jeffrey Layton	3361c7bebb	make iunique use a do/while loop rather than its obscure goto loop A while back, Christoph mentioned that he thought that iunique ought to be cleaned up to use a more conventional loop construct. This patch does that, turning the strange goto loop into a do/while. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:10 -07:00
John Johansen	9d0633cfed	Remove redundant check from proc_sys_setattr() notify_change() already calls security_inode_setattr() before calling iop->setattr. Alan sayeth This is a behaviour change on all of these and limits some behaviour of existing established security modules When inode_change_ok is called it has side effects. This includes clearing the SGID bit on attribute changes caused by chmod. If you make this change the results of some rulesets may be different before or after the change is made. I'm not saying the change is wrong but it does change behaviour so that needs looking at closely (ditto all other attribute twiddles) Signed-off-by: Steve Beattie <sbeattie@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: John Johansen <jjohansen@suse.de> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:10 -07:00
John Johansen	1e8123fded	Remove redundant check from proc_setattr() notify_change() already calls security_inode_setattr() before calling iop->setattr. Signed-off-by: Tony Jones <tonyj@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: John Johansen <jjohansen@suse.de> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:10 -07:00
Martin Peschke	09f0892ec7	proc: cleanup: use seq_release_private() where appropriate We can save some lines of code by using seq_release_private(). Signed-off-by: Martin Peschke <mp3@de.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:09 -07:00
Christoph Hellwig	6272e26679	cleanup compat ioctl handling Merge all compat ioctl handling into compat_ioctl.c instead of splitting it over compat.c and compat_ioctl.c. This also allows to get rid of ioctl32.h Signed-off-by: Christoph Hellwig <hch@lst.de> Looks-good-to: Andi Kleen <ak@suse.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:09 -07:00
Philippe De Muyter	19d0e8ce85	partition: add support for sysv68 partitions Add support for the Motorola sysv68 disk partition (slices in motorola doc). Signed-off-by: Philippe De Muyter <phdm@macqel.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:09 -07:00
Christoph Hellwig	644fd4f5de	merge compat_ioctl.h into compat_ioctl.c Now that there is no arch-specific compat ioctl handling left there is not point in having a separate copat_ioctl.h, so merge it into compat_ioctl.c Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:09 -07:00
Milind Arun Choudhary	1525dccbc2	ROUND_UP macro cleanup in fs/smbfs/request.c ROUND_UP macro cleanup use ALIGN Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:09 -07:00
Milind Arun Choudhary	022a169244	ROUND_UP macro cleanup in fs/(select\|compat\|readdir).c ROUND_UP macro cleanup use,ALIGN or DIV_ROUND_UP where ever appropriate. Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:09 -07:00
Alexey Dobriyan	7e80d0d0b6	i386: sched.h inclusion from module.h is baack linux/module.h -> linux/elf.h -> asm-i386/elf.h -> linux/utsname.h -> linux/sched.h Noticeably cut the number of files which are rebuild upon touching sched.h and cut down pulled junk from every module.h inclusion. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:08 -07:00
Alexey Dobriyan	9d65cb4a17	Fix race between cat /proc//wchan and rmmod et al kallsyms_lookup() can go iterating over modules list unprotected which is OK for emergency situations (oops), but not OK for regular stuff like /proc//wchan. Introduce lookup_symbol_name()/lookup_module_symbol_name() which copy symbol name into caller-supplied buffer or return -ERANGE. All copying is done with module_mutex held, so... Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:08 -07:00
Alexey Dobriyan	ffb4512276	Simplify kallsyms_lookup() Several kallsyms_lookup() pass dummy arguments but only need, say, module's name. Make kallsyms_lookup() accept NULLs where possible. Also, makes picture clearer about what interfaces are needed for all symbol resolving business. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:08 -07:00
kalash nainwal	98701d1b0f	(re)register_binfmt returns with -EBUSY When a binary format is unregistered and re-registered, register_binfmt fails with -EBUSY. The reason is that unregister_binfmt does not set fmt->next to NULL, and seeing (fmt->next != NULL), register_binfmt fails with -EBUSY. One can find his way around by explicitly setting fmt->next to NULL after unregistering, but that is kind of unclean (one should better be using only the interfaces, and not the interal members, isn't it?) Attached one-liner can fix it. Signed-off-by: Kalash Nainwal <kalash.nainwal@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:08 -07:00
Randy Dunlap	e63340ae6b	header cleaning: don't include smp_lock.h when not used Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:07 -07:00
Adrian Bunk	e5f00f42f3	make remove_inode_dquot_ref() static remove_inode_dquot_ref() can now become static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:05 -07:00
Alexey Dobriyan	ca509f69de	Protect tty drivers list with tty_mutex Additions and removal from tty_drivers list were just done as well as iterating on it for /proc/tty/drivers generation. testing: modprobe/rmmod loop of simple module which does nothing but tty_register_driver() vs cat /proc/tty/drivers loop BUG: unable to handle kernel paging request at virtual address 6b6b6b6b printing eip: c01cefa7 *pde = 00000000 Oops: 0000 [#1] PREEMPT last sysfs file: devices/pci0000:00/0000:00:1d.7/usb5/5-0:1.0/bInterfaceProtocol Modules linked in: ohci_hcd af_packet e1000 ehci_hcd uhci_hcd usbcore xfs CPU: 0 EIP: 0060:[<c01cefa7>] Not tainted VLI EFLAGS: 00010297 (2.6.21-rc4-mm1 #4) EIP is at vsnprintf+0x3a4/0x5fc eax: 6b6b6b6b ebx: f6cb50f2 ecx: 6b6b6b6b edx: fffffffe esi: c0354700 edi: f6cb6000 ebp: 6b6b6b6b esp: f31f5e68 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process cat (pid: 31864, ti=f31f4000 task=c1998030 task.ti=f31f4000) Stack: 00000000 c0103f20 c013003a c0103f20 00000000 f6cb50da 0000000a 00000f0e f6cb50f2 00000010 00000014 ffffffff ffffffff 00000007 c0354753 f6cb50f2 f73e39dc f73e39dc 00000001 c0175416 f31f5ed8 f31f5ed4 0ee00000 f32090bc Call Trace: [<c0103f20>] restore_nocheck+0x12/0x15 [<c013003a>] mark_held_locks+0x6d/0x86 [<c0103f20>] restore_nocheck+0x12/0x15 [<c0175416>] seq_printf+0x2e/0x52 [<c0192895>] show_tty_range+0x35/0x1f3 [<c0175416>] seq_printf+0x2e/0x52 [<c0192add>] show_tty_driver+0x8a/0x1d9 [<c01758f6>] seq_read+0x70/0x2ba [<c0175886>] seq_read+0x0/0x2ba [<c018d8e6>] proc_reg_read+0x63/0x9f [<c015e764>] vfs_read+0x7d/0xb5 [<c018d883>] proc_reg_read+0x0/0x9f [<c015eab1>] sys_read+0x41/0x6a [<c0103e4e>] sysenter_past_esp+0x5f/0x99 ======================= Code: 00 8b 4d 04 e9 44 ff ff ff 8d 4d 04 89 4c 24 50 8b 6d 00 81 fd ff 0f 00 00 b8 a4 c1 35 c0 0f 46 e8 8b 54 24 2c 89 e9 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 89 c6 8b 44 24 28 89 EIP: [<c01cefa7>] vsnprintf+0x3a4/0x5fc SS:ESP 0068:f31f5e68 Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:05 -07:00
Mark Fasheh	ef51c97623	Remove do_sync_file_range() Remove do_sync_file_range() and convert callers to just use do_sync_mapping_range(). Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:04 -07:00
Randy Dunlap	880ebdc516	reiserfs: proc support requires PROC_FS REISER_FS /proc option needs to depend on PROC_FS. fs/reiserfs/procfs.c: In function 'show_super': fs/reiserfs/procfs.c:134: error: 'reiserfs_proc_info_data_t' has no member named 'max_hash_collisions' fs/reiserfs/procfs.c:134: error: 'reiserfs_proc_info_data_t' has no member named 'breads' fs/reiserfs/procfs.c:135: error: 'reiserfs_proc_info_data_t' has no member named 'bread_miss' fs/reiserfs/procfs.c:135: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key' fs/reiserfs/procfs.c:136: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key_fs_changed' fs/reiserfs/procfs.c:136: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key_restarted' fs/reiserfs/procfs.c:137: error: 'reiserfs_proc_info_data_t' has no member named 'insert_item_restarted' fs/reiserfs/procfs.c:137: error: 'reiserfs_proc_info_data_t' has no member named 'paste_into_item_restarted' fs/reiserfs/procfs.c:138: error: 'reiserfs_proc_info_data_t' has no member named 'cut_from_item_restarted' fs/reiserfs/procfs.c:139: error: 'reiserfs_proc_info_data_t' has no member named 'delete_solid_item_restarted' fs/reiserfs/procfs.c:139: error: 'reiserfs_proc_info_data_t' has no member named 'delete_item_restarted' fs/reiserfs/procfs.c:140: error: 'reiserfs_proc_info_data_t' has no member named 'leaked_oid' fs/reiserfs/procfs.c:140: error: 'reiserfs_proc_info_data_t' has no member named 'leaves_removable' fs/reiserfs/procfs.c: In function 'show_per_level': fs/reiserfs/procfs.c:184: error: 'reiserfs_proc_info_data_t' has no member named 'balance_at' fs/reiserfs/procfs.c:185: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_read_at' fs/reiserfs/procfs.c:186: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_fs_changed' fs/reiserfs/procfs.c:187: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_restarted' fs/reiserfs/procfs.c:188: error: 'reiserfs_proc_info_data_t' has no member named 'free_at' fs/reiserfs/procfs.c:189: error: 'reiserfs_proc_info_data_t' has no member named 'items_at' fs/reiserfs/procfs.c:190: error: 'reiserfs_proc_info_data_t' has no member named 'can_node_be_removed' fs/reiserfs/procfs.c:191: error: 'reiserfs_proc_info_data_t' has no member named 'lnum' fs/reiserfs/procfs.c:192: error: 'reiserfs_proc_info_data_t' has no member named 'rnum' fs/reiserfs/procfs.c:193: error: 'reiserfs_proc_info_data_t' has no member named 'lbytes' fs/reiserfs/procfs.c:194: error: 'reiserfs_proc_info_data_t' has no member named 'rbytes' fs/reiserfs/procfs.c:195: error: 'reiserfs_proc_info_data_t' has no member named 'get_neighbors' fs/reiserfs/procfs.c:196: error: 'reiserfs_proc_info_data_t' has no member named 'get_neighbors_restart' fs/reiserfs/procfs.c:197: error: 'reiserfs_proc_info_data_t' has no member named 'need_l_neighbor' fs/reiserfs/procfs.c:197: error: 'reiserfs_proc_info_data_t' has no member named 'need_r_neighbor' fs/reiserfs/procfs.c: In function 'show_bitmap': fs/reiserfs/procfs.c:224: error: 'reiserfs_proc_info_data_t' has no member named 'free_block' fs/reiserfs/procfs.c:225: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap' fs/reiserfs/procfs.c:226: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap' fs/reiserfs/procfs.c:227: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap' fs/reiserfs/procfs.c:228: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap' fs/reiserfs/procfs.c:229: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap' fs/reiserfs/procfs.c:230: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap' fs/reiserfs/procfs.c:230: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap' fs/reiserfs/procfs.c: In function 'show_journal': fs/reiserfs/procfs.c:384: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:385: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:386: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:387: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:388: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:389: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:390: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:391: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:392: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:393: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:394: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal' fs/reiserfs/procfs.c: In function 'reiserfs_proc_info_init': fs/reiserfs/procfs.c:504: warning: implicit declaration of function '__PINFO' fs/reiserfs/procfs.c:504: error: request for member 'lock' in something not a structure or union fs/reiserfs/procfs.c: In function 'reiserfs_proc_info_done': fs/reiserfs/procfs.c:544: error: request for member 'lock' in something not a structure or union fs/reiserfs/procfs.c:545: error: request for member 'exiting' in something not a structure or union fs/reiserfs/procfs.c:546: error: request for member 'lock' in something not a structure or union Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:04 -07:00
Alexey Dobriyan	19c5d45a09	/proc/*/oom_score oops re badness Eternal quest to make while true; do cat /proc/fs/xfs/stat >/dev/null 2>/dev/null; done while true; do find /proc -type f 2>/dev/null \| xargs cat >/dev/null 2>/dev/null; done while true; do modprobe xfs; rmmod xfs; done work reliably continues and now kernel oopses in the following way: BUG: unable to handle ... at virtual address 6b6b6b6b EIP is at badness process: cat proc_oom_score proc_info_read sys_fstat64 vfs_read proc_info_read sys_read Failing code is prefetch hidden in list_for_each_entry() in badness(). badness() is reachable from two points. One is proc_oom_score, another is out_of_memory() => select_bad_process() => badness(). Second path grabs tasklist_lock, while first doesn't. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:04 -07:00
Eric Dumazet	c23fbb6bcb	VFS: delay the dentry name generation on sockets and pipes 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to build a pathname for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes/sockets) may need to change prototype of this method, but we now use : char d_dname(struct dentry dentry, char buffer, int buflen); 2) Adds a dynamic_dname() helper function that eases d_dname() implementations 3) Defines d_dname method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... 4) Defines d_dname method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Christoph Hellwig <hch@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:03 -07:00
Miklos Szeredi	2793274298	add file position info to proc Add support for finding out the current file position, open flags and possibly other info in the future. These new entries are added: /proc/PID/fdinfo/FD /proc/PID/task/TID/fdinfo/FD For each fd the information is provided in the following format: pos: 1234 flags: 0100002 [bunk@stusta.de: make struct proc_fdinfo_file_operations static] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:03 -07:00
Eric Dumazet	c5141e6d64	procfs: reorder struct pid_dentry to save space on 64bit archs, and constify them Change the order of fields of struct pid_entry (file fs/proc/base.c) in order to avoid a hole on 64bit archs. (8 bytes saved per object) Also change all pid_entry arrays to be const qualified, to make clear they must not be modified. Before (on x86_64) : # size fs/proc/base.o text data bss dec hex filename 15549 2192 0 17741 454d fs/proc/base.o After : # size fs/proc/base.o text data bss dec hex filename 17229 176 0 17405 43fd fs/proc/base.o Thats 336 bytes saved on kernel size on x86_64 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:03 -07:00
Kees Cook	5096add84b	proc: maps protection The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive information about the memory location and usage of processes. Issues: - maps should not be world-readable, especially if programs expect any kind of ASLR protection from local attackers. - maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc check the maps when %n is in a *printf call, and a setuid(getuid()) process wouldn't be able to read its own maps file. (For reference see http://lkml.org/lkml/2006/1/22/150) - a system-wide toggle is needed to allow prior behavior in the case of non-root applications that depend on access to the maps contents. This change implements a check using "ptrace_may_attach" before allowing access to read the maps contents. To control this protection, the new knob /proc/sys/kernel/maps_protect has been added, with corresponding updates to the procfs documentation. [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: New sysctl numbers are old hat] Signed-off-by: Kees Cook <kees@outflux.net> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:02 -07:00
Christoph Hellwig	5843205b55	namei.c: remove utterly outdated comment We don't have a routine called namei() anymore since at least 2.3.x, and the comment is just totally out of sync with the current lookup logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:02 -07:00
Christoph Hellwig	acb0c854fa	vfs: remove superflous sb == NULL checks inode->i_sb is always set, not need to check for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:02 -07:00
Alexey Dobriyan	578c8183c1	proc: remove pathetic ->deleted WARN_ON WARN_ON(de && de->deleted); is sooo unreliable. Why? proc_lookup remove_proc_entry =========== ================= lock_kernel(); spin_lock(&proc_subdir_lock); [find proc entry] spin_unlock(&proc_subdir_lock); spin_lock(&proc_subdir_lock); [find proc entry] proc_get_inode ============== WARN_ON(de && de->deleted); ... if (!atomic_read(&de->count)) free_proc_entry(de); else de->deleted = 1; So, if you have some strange oops [1], and doesn't see this WARN_ON it means nothing. [1] try_module_get() of module which doesn't exist, two lines below should suffice, or not? Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:02 -07:00
Darrick J. Wong	59cd0cbc75	Fix race between proc_readdir and remove_proc_entry Fix the following race: proc_readdir remove_proc_entry ============ ================= spin_lock(&proc_subdir_lock); [choose PDE to start filldir from] spin_unlock(&proc_subdir_lock); spin_lock(&proc_subdir_lock); [find PDE] [free PDE, refcount is 0] spin_unlock(&proc_subdir_lock); /* boom */ if (filldir(dirent, de->name, ... [de_put on error path --adobriyan] Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:02 -07:00
Alexey Dobriyan	7695650a92	Fix race between proc_get_inode() and remove_proc_entry() proc_lookup remove_proc_entry =========== ================= lock_kernel(); spin_lock(&proc_subdir_lock); [find PDE with refcount 0] spin_unlock(&proc_subdir_lock); spin_lock(&proc_subdir_lock); [find PDE with refcount 0] [check refcount and free PDE] spin_unlock(&proc_subdir_lock); proc_get_inode: de_get(de); /* boom */ Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:01 -07:00
Miklos Szeredi	79c0b2df79	add filesystem subtype support There's a slight problem with filesystem type representation in fuse based filesystems. From the kernel's view, there are just two filesystem types: fuse and fuseblk. From the user's view there are lots of different filesystem types. The user is not even much concerned if the filesystem is fuse based or not. So there's a conflict of interest in how this should be represented in fstab, mtab and /proc/mounts. The current scheme is to encode the real filesystem type in the mount source. So an sshfs mount looks like this: sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,... This url-ish syntax works OK for sshfs and similar filesystems. However for block device based filesystems (ntfs-3g, zfs) it doesn't work, since the kernel expects the mount source to be a real device name. A possibly better scheme would be to encode the real type in the type field as "type.subtype". So fuse mounts would look like this: /dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,... user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,... This patch adds the necessary code to the kernel so that this can be correctly displayed in /proc/mounts. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:01 -07:00
Davide Libenzi	6192bd536f	epoll: optimizations and cleanups Epoll is doing multiple passes over the ready set at the moment, because of the constraints over the f_op->poll() call. Looking at the code again, I noticed that we already hold the epoll semaphore in read, and this (together with other locking conditions that hold while doing an epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace (in a single pass). This is a stress application that can be used to test the new code. It spwans multiple thread and call epoll_wait() and epoll_ctl() from many threads. Stress tested on my dual Opteron 254 w/out any problems. http://www.xmailserver.org/totalmess.c This is not a benchmark, just something that tries to stress and exploit possible problems with the new code. Also, I made a stupid micro-benchmark: http://www.xmailserver.org/epwbench.c [1] Considering that epoll must be thread-safe, there are five ways we can be hit during an epoll_wait() transfer loop (ep_send_events()): 1) The epoll fd going away and calling ep_free This just can't happen, since we did an fget() in sys_epoll_wait 2) An epoll_ctl(EPOLL_CTL_DEL) This can't happen because epoll_ctl() gets ep->sem in write, and we're holding it in read during ep_send_events() 3) An fd stored inside the epoll fd going away This can't happen because in eventpoll_release_file() we get ep->sem in write, and we're holding it in read during ep_send_events() 4) Another epoll_wait() happening on another thread They both can be inside ep_send_events() at the same time, we get (splice) the ready-list under the spinlock, so each one will get its own ready list. Note that an fd cannot be at the same time inside more than one ready list, because ep_poll_callback() will not re-queue it if it sees it already linked: if (ep_is_linked(&epi->rdllink)) goto is_linked; Another case that can happen, is two concurrent epoll_wait(), coming in with a userspace event buffer of size, say, ten. Suppose there are 50 event ready in the list. The first epoll_wait() will "steal" the whole list, while the second, seeing no events, will go to sleep. But at the end of ep_send_events() in the first epoll_wait(), we will re-inject surplus ready fds, and we will trigger the proper wake_up to the second epoll_wait(). 5) ep_poll_callback() hitting us asyncronously This is the tricky part. As I said above, the ep_is_linked() test done inside ep_poll_callback(), will guarantee us that until the item will result linked to a list, ep_poll_callback() will not try to re-queue it again (read, write data on any of its members). When we do a list_del() in ep_send_events(), the item will still satisfy the ep_is_linked() test (whatever data is written in prev/next, it'll never be its own pointer), so ep_poll_callback() will still leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink) that it'll become visible to ep_poll_callback(), but at the point we're already past it. [akpm@osdl.org: 80 cols] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:01 -07:00
Dmitriy Monakhov	fedee54d8f	ext3: dirindex error pointer issues - ext3_dx_find_entry() exit with out setting proper error pointer - do_split() exit with out setting proper error pointer it is realy painful because many callers contain folowing code: de = do_split(handle,dir, &bh, frame, &hinfo, &retval); if (!(de)) return retval; <<< WOW retval wasn't changed by do_split(), so caller failed <<< but return SUCCESS :) - Rearrange do_split() error path. Current error path is realy ugly, all this up and down jump stuff doesn't make code easy to understand. [dmonakhov@sw.ru: fix annoying fake error messages] Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org> Cc: Andreas Dilger <adilger@clusterfs.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:01 -07:00
Badari Pulavarty	e3222c4ecc	Merge sys_clone()/sys_unshare() nsproxy and namespace handling sys_clone() and sys_unshare() both makes copies of nsproxy and its associated namespaces. But they have different code paths. This patch merges all the nsproxy and its associated namespace copy/clone handling (as much as possible). Posted on container list earlier for feedback. - Create a new nsproxy and its associated namespaces and pass it back to caller to attach it to right process. - Changed all copy__ns() routines to return a new copy of namespace instead of attaching it to task->nsproxy. - Moved the CAP_SYS_ADMIN checks out of copy__ns() routines. - Removed unnessary !ns checks from copy__ns() and added BUG_ON() just incase. - Get rid of all individual unshare__ns() routines and make use of copy_*_ns() instead. [akpm@osdl.org: cleanups, warning fix] [clg@fr.ibm.com: remove dup_namespaces() declaration] [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval] [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n] Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <containers@lists.osdl.org> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:00 -07:00
Nick Piggin	4fc75ff481	exec: fix remove_arg_zero Petr Tesarik discovered a problem in remove_arg_zero(). He writes: When a script is loaded, load_script() replaces argv[0] with the name of the interpreter and the filename passed to the exec syscall. However, there is no guarantee that the length of the interpreter name plus the length of the filename is greater than the length of the original argv[0]. If the difference happens to cross a page boundary, setup_arg_pages() will call put_dirty_page() [aka install_arg_page()] with an address outside the VMA. Therefore, remove_arg_zero() must free all pages which would be unused after the argument is removed. So, rewrite the remove_arg_zero function without gotos, with a few comments, and with the commonly used explicit index/offset. This fixes the problem and makes it easier to understand as well. [a.p.zijlstra@chello.nl: add comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:00 -07:00
Robert P. J. Day	f87367a6b1	reiserfs: correct misspelled "REISERFS_PROC_INFO" to "CONFIG_REISERFS_PROC_INFO" Correct the misspelling of the preprocessor check of a Kconfig option to refer to CONFIG_REISERFS_PROC_INFO and not just the incorrect REISERFS_PROC_INFO. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:00 -07:00
Alexey Dobriyan	fe08a9d498	reiserfs: shrink superblock if no xattrs This makes in-core superblock fit into one cacheline here. Before: struct dentry * xattr_root; /* 124 4 / / --- cacheline 1 boundary (128 bytes) --- / struct rw_semaphore xattr_dir_sem; / 128 12 / int j_errno; / 140 4 / }; / size: 144, cachelines: 2 / / sum members: 142, holes: 1, sum holes: 2 / / last cacheline: 16 bytes / After: int j_errno; / 124 4 / / --- cacheline 1 boundary (128 bytes) --- / }; / size: 128, cachelines: 1 / / sum members: 126, holes: 1, sum holes: 2 */ Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: <reiserfs-dev@namesys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:15:00 -07:00
Dmitriy Monakhov	2d3466a348	reiserfs: possible null pointer dereference during resize sb_read may return NULL, let's explicitly check it. If so free new bitmap blocks array, after this we may safely exit as it done above during bitmap allocation. Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:59 -07:00
Dmitriy Monakhov	82f703bb8c	freevxfs: possible null pointer dereference fix sb_read may return NULL, so let's explicitly check it. Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:59 -07:00
Vignesh Babu BM	1368c4f248	is_power_of_2 in fs/block_dev.c Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2 Signed-off-by: vignesh babu <vignesh.babu@wipro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:59 -07:00
Vignesh Babu BM	e1b5c1d3da	is_power_of_2 in fs/hfs Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2 Signed-off-by: vignesh babu <vignesh.babu@wipro.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:59 -07:00
Vignesh Babu BM	e7d709c096	is_power_of_2 in fat Replacing (n & (n-1)) in the context of power of 2 checks with is_power_of_2 Signed-off-by: vignesh babu <vignesh.babu@wipro.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:59 -07:00
Florin Malita	3972b7f67b	devpts: add fsnotify create event Currently, devpts doesn't generate an fsnotify event upon pts creation because the regular vfs paths aren't involved. Deallocation, on the other hand, correctly generates a nameremove event thanks to the d_delete() invocation in devpts_pty_kill(). This patch adds the missing fsnotify_create() trigger in devpts_pty_new(). Signed-off-by: Florin Malita <fmalita@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:59 -07:00
Chris Snook	1ae7075bcd	use use SEEK_MAX to validate user lseek arguments Add SEEK_MAX and use it to validate lseek arguments from userspace. Signed-off-by: Chris Snook <csnook@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:59 -07:00
Chris Snook	7b8e89249b	use symbolic constants in generic lseek code Convert magic numbers to SEEK_* values from fs.h Signed-off-by: Chris Snook <csnook@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:59 -07:00
Andrew Morton	24c32d733d	mm: shrink parent dentries when shrinking slab Teach the dentry slab shrinker to aggressively shrink parent dentries when shrinking the dentry cache. This is done to attempt to improve the situation where the dentry slab cache gets a lot of internal fragmentation due to pages containing directory dentries. It is expected that this change will cause some of those dentries to be reaped earlier, and with less scanning. Needs careful testing. Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:58 -07:00
Miklos Szeredi	d52b908646	fix quadratic behavior of shrink_dcache_parent() The time shrink_dcache_parent() takes, grows quadratically with the depth of the tree under 'parent'. This starts to get noticable at about 10,000. These kinds of depths don't occur normally, and filesystems which invoke shrink_dcache_parent() via d_invalidate() seem to have other depth dependent timings, so it's not even easy to expose this problem. However with FUSE it's easy to create a deep tree and d_invalidate() will also get called. This can make a syscall hang for a very long time. This is the original discovery of the problem by Russ Cox: http://article.gmane.org/gmane.comp.file-systems.fuse.devel/3826 The following patch fixes the quadratic behavior, by optionally allowing prune_dcache() to prune ancestors of a dentry in one go, instead of doing it one at a time. Common code in dput() and prune_one_dentry() is extracted into a new helper function d_kill(). shrink_dcache_parent() as well as shrink_dcache_sb() are converted to use the ancestry-pruner option. Only for shrink_dcache_memory() is this behavior not desirable, so it keeps using the old algorithm. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Maneesh Soni <maneesh@in.ibm.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:58 -07:00
William Cohen	97dc32cdb1	reduce size of task_struct on 64-bit machines This past week I was playing around with that pahole tool (http://oops.ghostprotocols.net:81/acme/dwarves/) and looking at the size of various struct in the kernel. I was surprised by the size of the task_struct on x86_64, approaching 4K. I looked through the fields in task_struct and found that a number of them were declared as "unsigned long" rather than "unsigned int" despite them appearing okay as 32-bit sized fields. On x86_64 "unsigned long" ends up being 8 bytes in size and forces 8 byte alignment. Is there a reason there a reason they are "unsigned long"? The patch below drops the size of the struct from 3808 bytes (60 64-byte cachelines) to 3760 bytes (59 64-byte cachelines). A couple other fields in the task struct take a signficant amount of space: struct thread_struct thread; 688 struct held_lock held_locks[30]; 1680 CONFIG_LOCKDEP is turned on in the .config [akpm@linux-foundation.org: fix printk warnings] Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:58 -07:00
Markus Rechberger	4d7bf11d64	ext2/3/4: fix file date underflow on ext2 3 filesystems on 64 bit systems Taken from http://bugzilla.kernel.org/show_bug.cgi?id=5079 signed long ranges from -2.147.483.648 to 2.147.483.647 on x86 32bit 10000011110110100100111110111101 .. -2,082,844,739 10000011110110100100111110111101 .. 2,212,122,557 <- this currently gets stored on the disk but when converting it to a 64bit signed long value it loses its sign and becomes positive. Cc: Andreas Dilger <adilger@dilger.ca> Cc: <linux-ext4@vger.kernel.org> Andreas says: This patch is now treating timestamps with the high bit set as negative times (before Jan 1, 1970). This means we lose 1/2 of the possible range of timestamps (lopping off 68 years before unix timestamp overflow - now only 30 years away :-) to handle the extremely rare case of setting timestamps into the distant past. If we are only interested in fixing the underflow case, we could just limit the values to 0 instead of storing negative values. At worst this will skew the timestamp by a few hours for timezones in the far east (files would still show Jan 1, 1970 in "ls -l" output). That said, it seems 32-bit systems (mine at least) allow files to be set into the past (01/01/1907 works fine) so it seems this patch is bringing the x86_64 behaviour into sync with other kernels. On the plus side, we have a patch that is ready to add nanosecond timestamps to ext3 and as an added bonus adds 2 high bits to the on-disk timestamp so this extends the maximum date to 2242. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:58 -07:00
Alexey Dobriyan	8948e11f45	Allow access to /proc/$PID/fd after setuid() /proc/$PID/fd has r-x------ permissions, so if process does setuid(), it will not be able to access /proc/*/fd/. This breaks fstatat() emulation in glibc. open("foo", O_RDONLY\|O_DIRECTORY) = 4 setuid32(65534) = 0 stat64("/proc/self/fd/4/bar", 0xbfafb298) = -1 EACCES (Permission denied) Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Acked-By: Kirill Korotaev <dev@openvz.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:58 -07:00
Andrew Morton	7e4c3690b0	block_write_full_page(): report ENOSPC block_write_full_page() forgot to propagate ENPSOC into the address_space. Cc: Guillaume Chazarain <guichaz@yahoo.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:57 -07:00
Guillaume Chazarain	3e9f45bd18	Factor outstanding I/O error handling Cleanup: setting an outstanding error on a mapping was open coded too many times. Factor it out in mapping_set_error(). Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:57 -07:00
Jeff Dike	f1adc05e77	uml: hostfs style fixes hostfs needed some style goodness. Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:57 -07:00
Alberto Bertogli	5822b7faca	uml: make hostfs_setattr() support operations on unlinked open files This patch allows hostfs_setattr() to work on unlinked open files by calling set_attr() (the userspace part) with the inode's fd. Without this, applications that depend on doing attribute changes to unlinked open files will fail. It works by using the fd versions instead of the path ones (for example fchmod() instead of chmod(), fchown() instead of chown()) when an fd is available. Signed-off-by: Alberto Bertogli <albertito@gmail.com> Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:57 -07:00
Dmitriy Monakhov	0ceb331433	mm: move common segment checks to separate helper function [akpm@linux-foundation.org: cleanup] Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org> Cc: Christoph Hellwig <hch@lst.de> Acked-by: Anton Altaparmakov <aia21@cam.ac.uk> Acked-by: David Chinner <dgc@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 11:14:57 -07:00
Jens Axboe	86aa5ac53e	[PATCH] splice: always call into page_cache_readahead() Don't try to guess what the read-ahead logic will do, allow it to make its own decisions. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-05-08 08:46:19 +02:00
Fengguang Wu	9ae9d68cbf	[PATCH] splice(): fix interaction with readahead Eric Dumazet, thank you for disclosing this bug. Readahead logic somehow fails to populate the page range with data. It can be because 1) the readahead routine is not always called in the following lines of fs/splice.c: if (!loff \|\| nr_pages > 1) page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages); 2) even called, page_cache_readahead() wont guarantee the pages are there. It wont submit readahead I/O for pages already in the radix tree, or when (ra_pages == 0), or after 256 cache hits. In your case, it should be because of the retried reads, which lead to excessive cache hits, and disables readahead at some time. And that _one_ failure of readahead blocks the whole read process. The application receives EAGAIN and retries the read, but __generic_file_splice_read() refuse to make progress: - in the previous invocation, it has allocated a blank page and inserted it into the radix tree, but never has the chance to start I/O for it: the test of SPLICE_F_NONBLOCK goes before that. - in the retried invocation, the readahead code will neither get out of the cache hit mode, nor will it submit I/O for an already existing page. Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-05-08 08:44:36 +02:00
Lachlan McIlroy	f7c66ce3f7	[XFS] Add lockdep support for XFS SGI-PV: 963965 SGI-Modid: xfs-linux-melb:xfs-kern:28485a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:50:19 +10:00
Lachlan McIlroy	71dfd5a396	[XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks. In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA() which means that the file could change from not-cached to cached and we need to redo the direct I/O checks. We should also redo the direct I/O checks when the file size changes regardless if O_APPEND is set or not. SGI-PV: 963483 SGI-Modid: xfs-linux-melb:xfs-kern:28440a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:50:12 +10:00
Utako Kusaka	3a02ee1828	[XFS] Get rid of redundant "required" in msg. SGI-PV: 963466 SGI-Modid: xfs-linux-melb:xfs-kern:28416a Signed-off-by: Utako Kusaka <utako@tnes.nec.co.jp> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>	2007-05-08 13:50:06 +10:00
Tim Shimmin	e6a0e9cdff	[XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg. SGI-PV: 963465 SGI-Modid: xfs-linux-melb:xfs-kern:28414a Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2007-05-08 13:49:59 +10:00
Tim Shimmin	f10bb2dad0	[XFS] Remove unused ilen variable and references. SGI-PV: 907752 SGI-Modid: xfs-linux-melb:xfs-kern:28344a Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>	2007-05-08 13:49:53 +10:00
Lachlan McIlroy	ba87ea699e	[XFS] Fix to prevent the notorious 'NULL files' problem after a crash. The problem that has been addressed is that of synchronising updates of the file size with writes that extend a file. Without the fix the update of a file's size, as a result of a write beyond eof, is independent of when the cached data is flushed to disk. Often the file size update would be written to the filesystem log before the data is flushed to disk. When a system crashes between these two events and the filesystem log is replayed on mount the file's size will be set but since the contents never made it to disk the file is full of holes. If some of the cached data was flushed to disk then it may just be a section of the file at the end that has holes. There are existing fixes to help alleviate this problem, particularly in the case where a file has been truncated, that force cached data to be flushed to disk when the file is closed. If the system crashes while the file(s) are still open then this flushing will never occur. The fix that we have implemented is to introduce a second file size, called the in-memory file size, that represents the current file size as viewed by the user. The existing file size, called the on-disk file size, is the one that get's written to the filesystem log and we only update it when it is safe to do so. When we write to a file beyond eof we only update the in- memory file size in the write operation. Later when the I/O operation, that flushes the cached data to disk completes, an I/O completion routine will update the on-disk file size. The on-disk file size will be updated to the maximum offset of the I/O or to the value of the in-memory file size if the I/O includes eof. SGI-PV: 958522 SGI-Modid: xfs-linux-melb:xfs-kern:28322a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:49:46 +10:00
Lachlan McIlroy	2a32963130	[XFS] Fix race condition in xfs_write(). This change addresses a race in xfs_write() where, for direct I/O, the flags need_i_mutex and need_flush are setup before the iolock is acquired. The logic used to setup the flags may change between setting the flags and acquiring the iolock resulting in these flags having incorrect values. For example, if a file is not currently cached then need_i_mutex is set to zero and then if the file is cached before the iolock is acquired we will fail to do the flushinval before the direct write. The flush (and also the call to xfs_zero_eof()) need to be done with the iolock held exclusive so we need to acquire the iolock before checking for cached data (or if the write begins after eof) to prevent this state from changing. For direct I/O I've chosen to always acquire the iolock in shared mode initially and if there is a need to promote it then drop it and reacquire it. There's also some other tidy-ups including removing the O_APPEND offset adjustment since that work is done in generic_write_checks() (and we don't use offset as an input parameter anywhere). SGI-PV: 962170 SGI-Modid: xfs-linux-melb:xfs-kern:28319a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:49:39 +10:00
Kouta Ooizumi	e6d29426bc	[XFS] Fix uquota and oquota enforcement problems. When uquota and oquota (gquota/pquota) are enabled for accounting both are enforced if ether has enforcement active. Conditions: - Both XFS_UQUOTA_ACCT and XFS_GQUOTA_ACCT are enabled. - Either XFS_UQUOTA_ENFD or XFS_OQUOTA_ENFD is enabled. - The usage without enforce is reached at the soft limit. Problems: 1. "repquota" shows all grace time even if no enforcement. 2. we cannot make a file over a hard limits even if no enforcement. SGI-PV: 962291 SGI-Modid: xfs-linux-melb:xfs-kern:28272a Signed-off-by: Kouta Ooizumi <k-ooizumi@tnes.nec.co.jp> Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:49:33 +10:00
Lachlan McIlroy	d3cf209476	[XFS] propogate return codes from flush routines This patch handles error return values in fs_flush_pages and fs_flushinval_pages. It changes the prototype of fs_flushinval_pages so we can propogate the errors and handle them at higher layers. I also modified xfs_itruncate_start so that it could propogate the error further. SGI-PV: 961990 SGI-Modid: xfs-linux-melb:xfs-kern:28231a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Stewart Smith <stewart@flamingspork.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:49:27 +10:00
Donald Douwsma	424ea91ba6	[XFS] Fix quotaon syscall failures for group enforcement requests. xfs_qm_scall_quotaon was incorrectly failing requests to enable group quota enforcement. Fixes logic error in OQUOTA handling. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28227a Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:49:15 +10:00
Donald Douwsma	646d5bdab3	[XFS] Invalidate quotacheck when mounting without a quota type. When quotas are mounted or remounted without a particular quota type the quota accounting for that type becomes invalid. Previously we were ignoring this leading to accounting errors. SGI-PV: 961964 SGI-Modid: xfs-linux-melb:xfs-kern:28225a Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Utako Kusaka <utako@tnes.nec.co.jp> Signed-off-by: Vlad Apostolov <vapo@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:49:09 +10:00
Joe Perches	e7a23a9b37	[XFS] reducing the number of random number functions. Patch provided by Joe Perches SGI-PV: 961696 SGI-Modid: xfs-linux-melb:xfs-kern:28209a Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:49:03 +10:00
Eric Sandeen	e9ed9d2240	[XFS] remove more misc. unused args Patch provided by Eric Sandeen. SGI-PV: 961695 SGI-Modid: xfs-linux-melb:xfs-kern:28205a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:48:56 +10:00
Eric Sandeen	ef497f8a1e	[XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it. Patch provided by Eric Sandeen. SGI-PV: 961694 SGI-Modid: xfs-linux-melb:xfs-kern:28204a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:48:49 +10:00
Eric Sandeen	1c72bf9003	[XFS] The last argument "lsn" of xfs_trans_commit() is always called with NULL. Patch provided by Eric Sandeen. SGI-PV: 961693 SGI-Modid: xfs-linux-melb:xfs-kern:28199a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-05-08 13:48:42 +10:00
David Woodhouse	1c97964520	[JFFS2] Simplify and clean up jffs2_add_tn_to_tree() some more. Fixing at least a couple more bugs in the process. Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2007-05-08 00:19:54 +01:00
Linus Torvalds	2d56d3c43c	Merge branch 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux * 'server-cluster-locking-api' of git://linux-nfs.org/~bfields/linux: gfs2: nfs lock support for gfs2 lockd: add code to handle deferred lock requests lockd: always preallocate block in nlmsvc_lock() lockd: handle test_lock deferrals lockd: pass cookie in nlmsvc_testlock lockd: handle fl_grant callbacks lockd: save lock state on deferral locks: add fl_grant callback for asynchronous lock return nfsd4: Convert NFSv4 to new lock interface locks: add lock cancel command locks: allow {vfs,posix}_lock_file to return conflicting lock locks: factor out generic/filesystem switch from setlock code locks: factor out generic/filesystem switch from test_lock locks: give posix_test_lock same interface as ->lock locks: make ->lock release private data before returning in GETLK case locks: create posix-to-flock helper functions locks: trivial removal of unnecessary parentheses	2007-05-07 12:34:24 -07:00
Linus Torvalds	5cefcab3db	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (34 commits) [GFS2] Uncomment sprintf_symbol calling code [DLM] lowcomms style [GFS2] printk warning fixes [GFS2] Patch to fix mmap of stuffed files [GFS2] use lib/parser for parsing mount options [DLM] Lowcomms nodeid range & initialisation fixes [DLM] Fix dlm_lowcoms_stop hang [DLM] fix mode munging [GFS2] lockdump improvements [GFS2] Patch to detect corrupt number of dir entries in leaf and/or inode blocks [GFS2] bz 236008: Kernel gpf doing cat /debugfs/gfs2/xxx (lock dump) [DLM] fs/dlm/ast.c should #include "ast.h" [DLM] Consolidate transport protocols [DLM] Remove redundant assignment [GFS2] Fix bz 234168 (ignoring rgrp flags) [DLM] change lkid format [DLM] interface for purge (2/2) [DLM] add orphan purging code (1/2) [DLM] split create_message function [GFS2] Set drop_count to 0 (off) by default ...	2007-05-07 12:26:27 -07:00
Bryan Wu	1394f03221	blackfin architecture This adds support for the Analog Devices Blackfin processor architecture, and currently supports the BF533, BF532, BF531, BF537, BF536, BF534, and BF561 (Dual Core) devices, with a variety of development platforms including those avaliable from Analog Devices (BF533-EZKit, BF533-STAMP, BF537-STAMP, BF561-EZKIT), and Bluetechnix! Tinyboards. The Blackfin architecture was jointly developed by Intel and Analog Devices Inc. (ADI) as the Micro Signal Architecture (MSA) core and introduced it in December of 2000. Since then ADI has put this core into its Blackfin processor family of devices. The Blackfin core has the advantages of a clean, orthogonal,RISC-like microprocessor instruction set. It combines a dual-MAC (Multiply/Accumulate), state-of-the-art signal processing engine and single-instruction, multiple-data (SIMD) multimedia capabilities into a single instruction-set architecture. The Blackfin architecture, including the instruction set, is described by the ADSP-BF53x/BF56x Blackfin Processor Programming Reference http://blackfin.uclinux.org/gf/download/frsrelease/29/2549/Blackfin_PRM.pdf The Blackfin processor is already supported by major releases of gcc, and there are binary and source rpms/tarballs for many architectures at: http://blackfin.uclinux.org/gf/project/toolchain/frs There is complete documentation, including "getting started" guides available at: http://docs.blackfin.uclinux.org/ which provides links to the sources and patches you will need in order to set up a cross-compiling environment for bfin-linux-uclibc This patch, as well as the other patches (toolchain, distribution, uClibc) are actively supported by Analog Devices Inc, at: http://blackfin.uclinux.org/ We have tested this on LTP, and our test plan (including pass/fails) can be found at: http://docs.blackfin.uclinux.org/doku.php?id=testing_the_linux_kernel [m.kozlowski@tuxland.pl: balance parenthesis in blackfin header files] Signed-off-by: Bryan Wu <bryan.wu@analog.com> Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl> Signed-off-by: Aubrey Li <aubrey.li@analog.com> Signed-off-by: Jie Zhang <jie.zhang@analog.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:58 -07:00
Akinobu Mita	5bc98594d5	hugetlbfs: add NULL check in hugetlb_zero_setup() If hugetlbfs module_init() fails, hugetlbfs_vfsmount is not initialized and shmget() with SHM_HUGETLB flag will cause NULL pointer dereference. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:57 -07:00
Christoph Lameter	50953fe9e0	slab allocators: Remove SLAB_DEBUG_INITIAL flag I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by SLAB. I think its purpose was to have a callback after an object has been freed to verify that the state is the constructor state again? The callback is performed before each freeing of an object. I would think that it is much easier to check the object state manually before the free. That also places the check near the code object manipulation of the object. Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was compiled with SLAB debugging on. If there would be code in a constructor handling SLAB_DEBUG_INITIAL then it would have to be conditional on SLAB_DEBUG otherwise it would just be dead code. But there is no such code in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real use of, difficult to understand and there are easier ways to accomplish the same effect (i.e. add debug code before kfree). There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be clear in fs inode caches. Remove the pointless checks (they would even be pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors. This is the last slab flag that SLUB did not support. Remove the check for unimplemented flags from SLUB. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:57 -07:00
Benjamin Herrenschmidt	036e08568c	get_unmapped_area handles MAP_FIXED in hugetlbfs Generic hugetlb_get_unmapped_area() now handles MAP_FIXED by just calling prepare_hugepage_range() Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: William Irwin <bill.irwin@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Matthew Wilcox <willy@debian.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:57 -07:00
Christoph Lameter	0a31bd5f2b	KMEM_CACHE(): simplify slab cache creation This patch provides a new macro KMEM_CACHE(<struct>, <flags>) to simplify slab creation. KMEM_CACHE creates a slab with the name of the struct, with the size of the struct and with the alignment of the struct. Additional slab flags may be specified if necessary. Example struct test_slab { int a,b,c; struct list_head; } __cacheline_aligned_in_smp; test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC) will create a new slab named "test_slab" of the size sizeof(struct test_slab) and aligned to the alignment of test slab. If it fails then we panic. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:55 -07:00
Peter Zijlstra	96018fdacb	mm: optimize acorn partition truncate invalidate_bdev() is superfluous when truncate_inode_pages() is also called. do call invalidate_bh_lrus() though, to avoid stale pointers. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:55 -07:00
Peter Zijlstra	f9a14399ae	mm: optimize kill_bdev() Remove duplicate work in kill_bdev(). It currently invalidates and then truncates the bdev's mapping. invalidate_mapping_pages() will opportunistically remove pages from the mapping. And truncate_inode_pages() will forcefully remove all pages. The only thing truncate doesn't do is flush the bh lrus. So do that explicitly. This avoids (very unlikely) but possible invalid lookup results if the same bdev is quickly re-issued. It also will prevent extreme kernel latencies which are observed when blockdevs which have a large amount of pagecache are unmounted, by avoiding invalidate_mapping_pages() on that path. invalidate_mapping_pages() has no cond_resched (it can be called under spinlock), whereas truncate_inode_pages() has one. [akpm@linux-foundation.org: restore nrpages==0 optimisation] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:55 -07:00
Peter Zijlstra	f98393a64c	mm: remove destroy_dirty_buffers from invalidate_bdev() Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't been used in 6 years (so akpm says). find * -name \.[ch] \| xargs grep -l invalidate_bdev \| while read file; do quilt add $file; sed -ie 's/invalidate_bdev($[^,]$,[^)]*)/invalidate_bdev(\1)/g' $file; done Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:55 -07:00
Christoph Lameter	d85f33855c	Make page->private usable in compound pages If we add a new flag so that we can distinguish between the first page and the tail pages then we can avoid to use page->private in the first page. page->private == page for the first page, so there is no real information in there. Freeing up page->private makes the use of compound pages more transparent. They become more usable like real pages. Right now we have to be careful f.e. if we are going beyond PAGE_SIZE allocations in the slab on i386 because we can then no longer use the private field. This is one of the issues that cause us not to support debugging for page size slabs in SLAB. Having page->private available for SLUB would allow more meta information in the page struct. I can probably avoid the 16 bit ints that I have in there right now. Also if page->private is available then a compound page may be equipped with buffer heads. This may free up the way for filesystems to support larger blocks than page size. We add PageTail as an alias of PageReclaim. Compound pages cannot currently be reclaimed. Because of the alias one needs to check PageCompound first. The RFC for the this approach was discussed at http://marc.info/?t=117574302800001&r=1&w=2 [nacc@us.ibm.com: fix hugetlbfs] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:53 -07:00
David Rientjes	b813e931b4	smaps: add clear_refs file to clear reference Adds /proc/pid/clear_refs. When any non-zero number is written to this file, pte_mkold() and ClearPageReferenced() is called for each pte and its corresponding page, respectively, in that task's VMAs. This file is only writable by the user who owns the task. It is now possible to measure _approximately_ how much memory a task is using by clearing the reference bits with echo 1 > /proc/pid/clear_refs and checking the reference count for each VMA from the /proc/pid/smaps output at a measured time interval. For example, to observe the approximate change in memory footprint for a task, write a script that clears the references (echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and extracts the size in kB. Add the sizes for each VMA together for the total referenced footprint. Moments later, repeat the process and observe the difference. For example, using an efficient Mozilla: accumulated time referenced memory ---------------- ----------------- 0 s 408 kB 1 s 408 kB 2 s 556 kB 3 s 1028 kB 4 s 872 kB 5 s 1956 kB 6 s 416 kB 7 s 1560 kB 8 s 2336 kB 9 s 1044 kB 10 s 416 kB This is a valuable tool to get an approximate measurement of the memory footprint for a task. Cc: Hugh Dickins <hugh@veritas.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> [akpm@linux-foundation.org: build fixes] [mpm@selenic.com: rename for_each_pmd] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:52 -07:00
David Rientjes	f79f177c25	smaps: add pages referenced count to smaps Adds an additional unsigned long field to struct mem_size_stats called 'referenced'. For each pte walked in the smaps code, this field is incremented by PAGE_SIZE if it has pte-reference bits. An additional line was added to the /proc/pid/smaps output for each VMA to indicate how many pages within it are currently marked as referenced or accessed. Cc: Hugh Dickins <hugh@veritas.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:52 -07:00
David Rientjes	826fad1b93	smaps: extract pmd walker from smaps code Extracts the pmd walker from smaps-specific code in fs/proc/task_mmu.c. The new struct pmd_walker includes the struct vm_area_struct of the memory to walk over. Iteration begins at the vma->vm_start and completes at vma->vm_end. A pointer to another data structure may be stored in the private field such as struct mem_size_stats, which acts as the smaps accumulator. For each pmd in the VMA, the action function is called with a pointer to its struct vm_area_struct, a pointer to the pmd_t, its start and end addresses, and the private field. The interface for walking pmd's in a VMA for fs/proc/task_mmu.c is now: void for_each_pmd(struct vm_area_struct vma, void (action)(struct vm_area_struct vma, pmd_t pmd, unsigned long addr, unsigned long end, void private), void private); Since the pmd walker is now extracted from the smaps code, smaps_one_pmd() is invoked for each pmd in the VMA. Its behavior and efficiency is identical to the existing implementation. Cc: Hugh Dickins <hugh@veritas.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:52 -07:00
Adrian Bunk	ac267728f1	mm/slab.c: proper prototypes Add proper prototypes in include/linux/slab.h. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:52 -07:00
Nick Piggin	3d67f2d7c0	fs: buffer don't PageUptodate without page locked __block_write_full_page is calling SetPageUptodate without the page locked. This is unusual, but not incorrect, as PG_writeback is still set. However the next patch will require that SetPageUptodate always be called with the page locked. Simply don't bother setting the page uptodate in this case (it is unusual that the write path does such a thing anyway). Instead just leave it to the read side to bring the page uptodate when it notices that all buffers are uptodate. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:51 -07:00
Nick Piggin	6fe6900e1e	mm: make read_cache_page synchronous Ensure pages are uptodate after returning from read_cache_page, which allows us to cut out most of the filesystem-internal PageUptodate calls. I didn't have a great look down the call chains, but this appears to fixes 7 possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd. All depending on whether the filler is async and/or can return with a !uptodate page. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:51 -07:00
Adrian Bunk	d2ba27e800	proper prototype for hugetlb_get_unmapped_area() Add a proper prototype for hugetlb_get_unmapped_area() in include/linux/hugetlb.h. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-07 12:12:51 -07:00
David Woodhouse	fcf3cafb3e	[JFFS2] Remove another bogus optimisation in jffs2_add_tn_to_tree() We attempted to insert new nodes into the tree by just using rb_replace_node to let them replace an earlier node which they completely overlapped. However, that could place the new node into the wrong place in the tree, since its start could be node only before the start of the victim, but before the node _before_ the victim in the tree (if that previous node actually ends _after_ the new node, thus isn't entirely overlapped and wasn't itself chosen to be the victim). Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2007-05-07 13:16:13 +01:00
Marc Eshel	586759f03e	gfs2: nfs lock support for gfs2 Add NFS lock support to GFS2. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-06 20:38:50 -04:00
Marc Eshel	1a8322b2b0	lockd: add code to handle deferred lock requests Rewrite nlmsvc_lock() to use the asynchronous interface. As with testlock, we answer nlm requests in nlmsvc_lock by first looking up the block and then using the results we find in the block if B_QUEUED is set, and calling vfs_lock_file() otherwise. If this a new lock request and we get -EINPROGRESS return on a non-blocking request then we defer the request. Also modify nlmsvc_unlock() to call the filesystem method if appropriate. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-05-06 20:38:50 -04:00
Marc Eshel	f812048020	lockd: always preallocate block in nlmsvc_lock() Normally we could skip ever having to allocate a block in the case where the client asks for a non-blocking lock, or asks for a blocking lock that succeeds immediately. However we're going to want to always look up a block first in order to check whether we're revisiting a deferred lock call, and to be prepared to handle the case where the filesystem returns -EINPROGRESS--in that case we want to make sure the lock we've given the filesystem is the one embedded in the block that we'll use to track the deferred request. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-05-06 20:38:50 -04:00
Marc Eshel	5ea0d75037	lockd: handle test_lock deferrals Rewrite nlmsvc_testlock() to use the new asynchronous interface: instead of immediately doing a posix_test_lock(), we first look for a matching block. If the subsequent test_lock returns anything other than -EINPROGRESS, we then remove the block we've found and return the results. If it returns -EINPROGRESS, then we defer the lock request. In the case where the block we find in the first step has B_QUEUED set, we bypass the vfs_test_lock entirely, instead using the block to decide how to respond: with nlm_lck_denied if B_TIMED_OUT is set. with nlm_granted if B_GOT_CALLBACK is set. by dropping if neither B_TIMED_OUT nor B_GOT_CALLBACK is set Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-05-06 20:38:50 -04:00
Marc Eshel	85f3f1b3f7	lockd: pass cookie in nlmsvc_testlock Change NLM internal interface to pass more information for test lock; we need this to make sure the cookie information is pushed down to the place where we do request deferral, which is handled for testlock by the following patch. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-05-06 20:38:50 -04:00
Marc Eshel	0e4ac9d935	lockd: handle fl_grant callbacks Add code to handle file system callback when the lock is finally granted. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-05-06 20:38:50 -04:00
Marc Eshel	2b36f412ab	lockd: save lock state on deferral We need to keep some state for a pending asynchronous lock request, so this patch adds that state to struct nlm_block. This also adds a function which defers the request, by calling rqstp->rq_chandle.defer and storing the resulting deferred request in a nlm_block structure which we insert into lockd's global block list. That new function isn't called yet, so it's dead code until a later patch. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-05-06 20:38:50 -04:00
Marc Eshel	2beb6614f5	locks: add fl_grant callback for asynchronous lock return Acquiring a lock on a cluster filesystem may require communication with remote hosts, and to avoid blocking lockd or nfsd threads during such communication, we allow the results to be returned asynchronously. When a ->lock() call needs to block, the file system will return -EINPROGRESS, and then later return the results with a call to the routine in the fl_grant field of the lock_manager_operations struct. This differs from the case when ->lock returns -EAGAIN to a blocking lock request; in that case, the filesystem calls fl_notify when the lock is granted, and the caller retries the original lock. So while fl_notify is merely a hint to the caller that it should retry, fl_grant actually communicates the final result of the lock operation (with the lock already acquired in the succesful case). Therefore fl_grant takes a lock, a status and, for the test lock case, a conflicting lock. We also allow fl_grant to return an error to the filesystem, to handle the case where the fl_grant requests arrives after the lock manager has already given up waiting for it. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-05-06 20:38:49 -04:00
Marc Eshel	fd85b8170d	nfsd4: Convert NFSv4 to new lock interface Convert NFSv4 to the new lock interface. We don't define any callback for now, so we're not taking advantage of the asynchronous feature--that's less critical for the multi-threaded nfsd then it is for the single-threaded lockd. But this does allow a cluster filesystems to export cluster-coherent locking to NFS. Note that it's cluster filesystems that are the issue--of the filesystems that define lock methods (nfs, cifs, etc.), most are not exportable by nfsd. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-05-06 20:38:49 -04:00
Marc Eshel	9b9d2ab415	locks: add lock cancel command Lock managers need to be able to cancel pending lock requests. In the case where the exported filesystem manages its own locks, it's not sufficient just to call posix_unblock_lock(); we need to let the filesystem know what's happening too. We do this by adding a new fcntl lock command: FL_CANCELLK. Some day this might also be made available to userspace applications that could benefit from an asynchronous locking api. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>	2007-05-06 20:38:28 -04:00
Marc Eshel	150b393456	locks: allow {vfs,posix}_lock_file to return conflicting lock The nfsv4 protocol's lock operation, in the case of a conflict, returns information about the conflicting lock. It's unclear how clients can use this, so for now we're not going so far as to add a filesystem method that can return a conflicting lock, but we may as well return something in the local case when it's easy to. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>	2007-05-06 19:23:24 -04:00
Marc Eshel	7723ec9777	locks: factor out generic/filesystem switch from setlock code Factor out the code that switches between generic and filesystem-specific lock methods; eventually we want to call this from lock managers (lockd and nfsd) too; currently they only call the generic methods. This patch does that for all the setlk code. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>	2007-05-06 18:08:49 -04:00
J. Bruce Fields	3ee17abd14	locks: factor out generic/filesystem switch from test_lock Factor out the code that switches between generic and filesystem-specific lock methods; eventually we want to call this from lock managers (lockd and nfsd) too; currently they only call the generic methods. This patch does that for test_lock. Note that this hasn't been necessary until recently, because the few filesystems that define ->lock() (nfs, cifs...) aren't exportable via NFS. However GFS (and, in the future, other cluster filesystems) need to implement their own locking to get cluster-coherent locking, and also want to be able to export locking to NFS (lockd and NFSv4). So we accomplish this by factoring out code such as this and exporting it for the use of lockd and nfsd. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>	2007-05-06 18:06:44 -04:00
Marc Eshel	9d6a8c5c21	locks: give posix_test_lock same interface as ->lock posix_test_lock() and ->lock() do the same job but have gratuitously different interfaces. Modify posix_test_lock() so the two agree, simplifying some code in the process. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>	2007-05-06 17:39:00 -04:00
J. Bruce Fields	70cc6487a4	locks: make ->lock release private data before returning in GETLK case The file_lock argument to ->lock is used to return the conflicting lock when found. There's no reason for the filesystem to return any private information with this conflicting lock, but nfsv4 is. Fix nfsv4 client, and modify locks.c to stop calling fl_release_private for it in this case. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Cc: "Trond Myklebust" <Trond.Myklebust@netapp.com>"	2007-05-06 17:38:19 -04:00
David Woodhouse	96dd8d25d1	[JFFS2] Remove broken insert_point optimisation in jffs2_add_tn_to_tree() The original code would remember, during the first pass over the tree, a suitable place to start the insertion from when we eventually come to add a new node. The optimisation was broken, and we sometimes ended up inserting a new node in the wrong place because we started the insertion from the wrong point. Just ditch the optimisation and start the insertion from the root of the tree, for now. I'll try it again when I'm feeling cleverer. Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2007-05-06 14:41:40 +01:00
Linus Torvalds	b7405e1643	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] Fix typo in cifs readme from previous commit [CIFS] Make sec=none force an anonymous mount [CIFS] Change semaphore to mutex for cifs lock_sem [CIFS] Fix oops in reset_cifs_unix_caps on reconnect [CIFS] UID/GID override on CIFS mounts to Samba [CIFS] prefixpath mounts to servers supporting posix paths used wrong slash [CIFS] Update cifs version to 1.49 [CIFS] Replace kmalloc/memset combination with kzalloc [CIFS] Add IPv6 support [CIFS] New CIFS POSIX mkdir performance improvement (part 2) [CIFS] New CIFS POSIX mkdir performance improvement [CIFS] Add write perm for usr to file on windows should remove r/o dos attr [CIFS] Remove unnecessary parm to cifs_reopen_file [CIFS] Switch cifsd to kthread_run from kernel_thread [CIFS] Remove unnecessary checks	2007-05-05 15:30:53 -07:00
Steve French	0ec54aa8af	[CIFS] Fix typo in cifs readme from previous commit Signed-off-by: Steve French <sfrench@us.ibm.com>	2007-05-05 22:08:06 +00:00
Linus Torvalds	ea62ccd00f	Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 * 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (231 commits) [PATCH] i386: Don't delete cpu_devs data to identify different x86 types in late_initcall [PATCH] i386: type may be unused [PATCH] i386: Some additional chipset register values validation. [PATCH] i386: Add missing !X86_PAE dependincy to the 2G/2G split. [PATCH] x86-64: Don't exclude asm-offsets.c in Documentation/dontdiff [PATCH] i386: avoid redundant preempt_disable in __unlazy_fpu [PATCH] i386: white space fixes in i387.h [PATCH] i386: Drop noisy e820 debugging printks [PATCH] x86-64: Fix allnoconfig error in genapic_flat.c [PATCH] x86-64: Shut up warnings for vfat compat ioctls on other file systems [PATCH] x86-64: Share identical video.S between i386 and x86-64 [PATCH] x86-64: Remove CONFIG_REORDER [PATCH] x86-64: Print type and size correctly for unknown compat ioctls [PATCH] i386: Remove copy_*_user BUG_ONs for (size < 0) [PATCH] i386: Little cleanups in smpboot.c [PATCH] x86-64: Don't enable NUMA for a single node in K8 NUMA scanning [PATCH] x86: Use RDTSCP for synchronous get_cycles if possible [PATCH] i386: Add X86_FEATURE_RDTSCP [PATCH] i386: Implement X86_FEATURE_SYNC_RDTSC on i386 [PATCH] i386: Implement alternative_io for i386 ... Fix up trivial conflict in include/linux/highmem.h manually. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-05 14:55:20 -07:00
Dave Kleikamp	05ec9e26be	JFS: Fix race waking up jfsIO kernel thread It's possible for a journal I/O request to be added to the log_redrive queue and the jfsIO thread to be awakened after the thread releases log_redrive_lock but before it sets its state to TASK_INTERRUPTIBLE. The jfsIO thread should set the state before giving up the spinlock, so the waking thread will really wake it. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>	2007-05-05 14:24:05 -05:00
David Woodhouse	1123e2a859	[JFFS2] Remember to calculate overlap on nodes which replace older nodes This fixes a problem Artem found with the integck test tool -- we weren't correctly keeping track of the 'overlap' flag in some cases, which led to the nodes being played back in an incorrect order and file corruption. Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2007-05-05 16:29:34 +01:00
David Woodhouse	3fddb6c985	[JFFS2] Don't advance c->wbuf_ofs to next eraseblock after wbuf flush After flushing the last page of an eraseblock, don't leave the wbuf 'offset' field pointing at the start of the next physical eraseblock. This was causing a BUG() on NOR-ECC (Sibley) flash, where we start writing a little further in, after the cleanmarker. Debugged by Alexander Belyakov <abelyako@googlemail.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2007-05-05 09:52:49 +01:00
Linus Torvalds	fa24aa561a	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: ocfs2: Force use of GFP_NOFS in ocfs2_write() ocfs2: fix sparse warnings in fs/ocfs2/cluster ocfs2: fix sparse warnings in fs/ocfs2/dlm ocfs2: fix sparse warnings in fs/ocfs2 [PATCH] Copy i_flags to ocfs2 inode flags on write [PATCH] ocfs2: use __set_current_state() ocfs2: Wrap access of directory allocations with ip_alloc_sem. [PATCH] fs/ocfs2/: make 3 functions static ocfs2: Implement compat_ioctl()	2007-05-04 20:44:54 -07:00
Jeff Layton	8426c39c12	[CIFS] Make sec=none force an anonymous mount We had a customer report that attempting to make CIFS mount with a null username (i.e. doing an anonymous mount) doesn't work. Looking through the code, it looks like CIFS expects a NULL username from userspace in order to trigger an anonymous mount. The mount.cifs code doesn't seem to ever pass a null username to the kernel, however. It looks also like the kernel can take a sec=none option, but it only seems to look at it if the username is already NULL. This seems redundant and effectively makes sec=none useless. The following patch makes sec=none force an anonymous mount. Signed-off-by: Steve French <sfrench@us.ibm.com>	2007-05-05 03:27:49 +00:00
Linus Torvalds	4d4700707c	Merge git://git.linux-nfs.org/pub/linux/nfs-2.6 * git://git.linux-nfs.org/pub/linux/nfs-2.6: (28 commits) NFS: Fix a compile glitch on 64-bit systems NFS: Clean up nfs_create_request comments spkm3: initialize hash spkm3: remove bad kfree, unnecessary export spkm3: fix spkm3's use of hmac NFS4: invalidate cached acl on setacl NFS: Fix directory caching problem - with test case and patch. NFS: Set meaningful value for fattr->time_start in readdirplus results. NFS: Added support to turn off the NFSv3 READDIRPLUS RPC. SUNRPC: RPC client should retry with different versions of rpcbind SUNRPC: remove old portmapper NFS: switch NFSROOT to use new rpcbind client SUNRPC: switch the RPC server to use the new rpcbind registration API SUNRPC: switch socket-based RPC transports to use rpcbind SUNRPC: introduce rpcbind: replacement for in-kernel portmapper SUNRPC: Eliminate side effects from rpc_malloc SUNRPC: RPC buffer size estimates are too large NLM: Shrink the maximum request size of NLM4 requests NFS: Use pgoff_t in structures and functions that pass page cache offsets NFS: Clean up nfs_sync_mapping_wait() ...	2007-05-04 19:55:11 -07:00
Linus Torvalds	7e20ef030d	Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (49 commits) [SCTP]: Set assoc_id correctly during INIT collision. [SCTP]: Re-order SCTP initializations to avoid race with sctp_rcv() [SCTP]: Fix the SO_REUSEADDR handling to be similar to TCP. [SCTP]: Verify all destination ports in sctp_connectx. [XFRM] SPD info TLV aggregation [XFRM] SAD info TLV aggregationx [AF_RXRPC]: Sort out MTU handling. [AF_IUCV/IUCV] : Add missing section annotations [AF_IUCV]: Implementation of a skb backlog queue [NETLINK]: Remove bogus BUG_ON [IPV6]: Some cleanups in include/net/ipv6.h [TCP]: zero out rx_opt in tcp_disconnect() [BNX2]: Fix TSO problem with small MSS. [NET]: Rework dev_base via list_head (v3) [TCP] Highspeed: Limited slow-start is nowadays in tcp_slow_start [BNX2]: Update version and reldate. [BNX2]: Print bus information for PCIE devices. [BNX2]: Add 1-shot MSI handler for 5709. [BNX2]: Restructure PHY event handling. [BNX2]: Add indirect spinlock. ...	2007-05-04 19:36:58 -07:00
Trond Myklebust	84dde76c4a	NFS: Fix a compile glitch on 64-bit systems fs/nfs/pagelist.c:226: error: conflicting types for 'nfs_pageio_init' include/linux/nfs_page.h:80: error: previous declaration of 'nfs_pageio_init' was here Thanks to Andrew for spotting this... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-05-04 14:44:06 -04:00
Pavel Emelianov	7562f876cd	[NET]: Rework dev_base via list_head (v3) Cleanup of dev_base list use, with the aim to simplify making device list per-namespace. In almost every occasion, use of dev_base variable and dev->next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-05-03 15:13:45 -07:00
David Howells	ec9c948546	[AFS]: Adjust the new netdevice scanning code Adjust the new netdevice scanning code provided by Patrick McHardy: (1) Restore the function banner comments that were dropped. (2) Rather than using an array size of 6 in some places and an array size of ETH_ALEN in others, pass a pointer instead and pass the array size through so that we can actually check it. (3) Do the buffer fill count check before checking the for_primary_ifa condition again. This permits us to skip that check should maxbufs be reached before we run out of interfaces. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-05-03 03:29:41 -07:00
Patrick McHardy	dc1f6bff6a	[AFS]: Replace rtnetlink client by direct dev_base walking Replace the large and complicated rtnetlink client by two simple functions for getting the MAC address for the first ethernet device and building a list of IPv4 addresses. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-05-03 03:28:49 -07:00
Patrick McHardy	5b35fad9d4	[AFS]: Fix memory leak in SRXAFSCB_GetCapabilities The interface array is not freed on exit. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-05-03 03:27:39 -07:00
David Howells	fbb3fcba72	[AFS]: Fix use of __exit functions from __init path Fix use of __exit functions from __init path. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-05-03 03:12:46 -07:00
David Howells	80c72fe415	[AFS/AF_RXRPC]: Miscellaneous fixes. Make miscellaneous fixes to AFS and AF_RXRPC: () Make AF_RXRPC select KEYS rather than RXKAD or AFS_FS in Kconfig. () Don't use FS_BINARY_MOUNTDATA. () Remove a done 'TODO' item in a comemnt on afs_get_sb(). () Don't pass a void * as the page pointer argument of kmap_atomic() as this breaks on m68k. Patch from Geert Uytterhoeven <geert@linux-m68k.org>. () Use match_() functions rather than doing my own parsing. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-05-03 03:11:29 -07:00
Roland Dreier	796e5661f6	[CIFS] Change semaphore to mutex for cifs lock_sem Originally at http://lkml.org/lkml/2006/9/2/86 The recent change to "allow Windows blocking locks to be cancelled via a CANCEL_LOCK call" introduced a new semaphore in struct cifsFileInfo, lock_sem. However, semaphores used as mutexes are deprecated these days, and there's no reason to add a new one to the kernel. Therefore, convert lock_sem to a struct mutex (and also fix one indentation glitch on one of the lines changed anyway). Signed-off-by: Roland Dreier <roland@digitalvampire.org> Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2007-05-03 04:33:45 +00:00
Steve French	0b2365f826	[CIFS] Fix oops in reset_cifs_unix_caps on reconnect Signed-off-by: Steve French <sfrench@us.ibm.com>	2007-05-03 04:30:13 +00:00
Greg Kroah-Hartman	823bccfc40	remove "struct subsystem" as it is no longer needed We need to work on cleaning up the relationship between kobjects, ksets and ktypes. The removal of 'struct subsystem' is the first step of this, especially as it is not really needed at all. Thanks to Kay for fixing the bugs in this patch. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-05-02 18:57:59 -07:00
Randy Dunlap	2609e7b9be	sysfs: printk format warning Fix sysfs printk format warning: fs/sysfs/bin.c:62: warning: format '%d' expects type 'int', but argument 4 has type 'size_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-05-02 18:57:59 -07:00
Mark Fasheh	9315f130e1	ocfs2: Force use of GFP_NOFS in ocfs2_write() We can otherwise recurse into the file system. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:08:34 -07:00
Mark Fasheh	5fdf1e6771	ocfs2: fix sparse warnings in fs/ocfs2/cluster Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:08:23 -07:00
Mark Fasheh	a7d25539fd	ocfs2: fix sparse warnings in fs/ocfs2/dlm Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:08:15 -07:00
Mark Fasheh	1ca1a111b1	ocfs2: fix sparse warnings in fs/ocfs2 None of these are actually harmful, but the noise makes looking for real problems difficult. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:08:08 -07:00
Jan Kara	6e4b0d5692	[PATCH] Copy i_flags to ocfs2 inode flags on write Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ocfs2-specific ip_attr. Hence, when someone sets these flags via a different interface than ioctl, they are stored correctly. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:07:58 -07:00
Milind Arun Choudhary	5c2c9d383e	[PATCH] ocfs2: use __set_current_state() use __set_current_state(TASK_) instead of current->state = TASK_, in fs/ocfs2 Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:07:50 -07:00
Joel Becker	ee19a77956	ocfs2: Wrap access of directory allocations with ip_alloc_sem. OCFS2_I(inode)->ip_alloc_sem is a read-write semaphore protecting local concurrent access of ocfs2 inodes. However, ocfs2 directories were not taking the semaphore while they accessed or modified the allocation tree. ocfs2_extend_dir() needs to take the semaphore in a write mode when it adds to the allocation. All other directory users get there via ocfs2_bread(), which takes the semaphore in read mode. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:07:42 -07:00
Adrian Bunk	6cb129f567	[PATCH] fs/ocfs2/: make 3 functions static This patch makes the following needlessly global functions static: - aops.c: ocfs2_write_data_page() - dlmglue.c: ocfs2_dump_meta_lvb_info() - file.c: ocfs2_set_inode_size() Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:07:27 -07:00
Mark Fasheh	586d232b19	ocfs2: Implement compat_ioctl() We need this to support 32 bit system calls on 64 bit kernels. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:07:16 -07:00
Andi Kleen	2724b6db66	[PATCH] x86-64: Shut up warnings for vfat compat ioctls on other file systems vfat implements compat handlers for these ioctls, but when they were executed on other file systems the kernel would still complain about an unknown compat ioctl. Just declare them as compatible and let them be rejected when not needed by the normal path. This makes wine runs a lot quieter Signed-off-by: Andi Kleen <ak@suse.de>	2007-05-02 19:27:21 +02:00
Andi Kleen	a106009bdf	[PATCH] x86-64: Print type and size correctly for unknown compat ioctls Signed-off-by: Andi Kleen <ak@suse.de>	2007-05-02 19:27:21 +02:00
Andi Kleen	9d016dd43b	[PATCH] x86-64: Shut up 32bit emulation for SIOCGIFCOUNT The kernel doesn't implement it, but some programs like java use it anyways. Shut the code up. Signed-off-by: Andi Kleen <ak@suse.de>	2007-05-02 19:27:20 +02:00
Andi Kleen	421f028100	[PATCH] x86-64: Define IGNORE_IOCTL() macro for compat_ioctls Define a new IGNORE_IOCTL() to let a compat ioctl not be warned about even when it is not implemented. This is the same as COMPATIBLE_IOCTL internally, but better self documentng. Valid reasons to use this: - It is implemented with ->compat_ioctl on some device, but programs call it on others too. - The ioctl is not implemented in the native kernel, but programs call it commonly anyways. Most other reasons are not valid. Signed-off-by: Andi Kleen <ak@suse.de>	2007-05-02 19:27:20 +02:00
Ian Campbell	79e030114a	[PATCH] i386: Allow i386 crash kernels to handle x86_64 dumps The specific case I am encountering is kdump under Xen with a 64 bit hypervisor and 32 bit kernel/userspace. The dump created is 64 bit due to the hypervisor but the dump kernel is 32 bit for maximum compatibility. It's possibly less likely to be useful in a purely native scenario but I see no reason to disallow it. [akpm@linux-foundation.org: build fix] Signed-off-by: Ian Campbell <ian.campbell@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Vivek Goyal <vgoyal@in.ibm.com> Cc: Horms <horms@verge.net.au> Cc: Magnus Damm <magnus.damm@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2007-05-02 19:27:09 +02:00
Jason Uhlenkott	a19b89cad5	NFS: Clean up nfs_create_request comments Remove some stale comments about hard limits which went away in 2.5. Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-05-02 07:37:29 -07:00
J. Bruce Fields	08efa202eb	NFS4: invalidate cached acl on setacl The ACL that the server sets may not be exactly the one we set--for example, it may silently turn off bits that it does not support. So we should remove any cached ACL so that any subsequent request for the ACL will go to the server. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-05-02 07:36:09 -07:00
David Woodhouse	7c96b7a146	[JFFS2] Remove dead file histo_mips.h Its contents were subsumed into compr_rubin.c in a previous commit, but I forgot to git-rm it. Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2007-05-02 08:36:21 +01:00
Steven Whitehouse	37fde8ca6c	[GFS2] Uncomment sprintf_symbol calling code Now that the patch from -mm has gone upstream, we can uncomment the code in GFS2 which uses sprintf_symbol. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Robert Peterson <rpeterso@redhat.com>	2007-05-01 09:51:39 +01:00
David Teigland	617e82e10c	[DLM] lowcomms style Replace some printk with log_print, and fix some simple cases of lines over 80. Also, return -ENOTCONN if lowcomms_start fails due to no local IP address being available. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:51 +01:00
akpm@linux-foundation.org	f391a4ead6	[GFS2] printk warning fixes alpha: fs/gfs2/dir.c: In function 'gfs2_dir_read_leaf': fs/gfs2/dir.c:1322: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'sector_t' fs/gfs2/dir.c: In function 'gfs2_dir_read': fs/gfs2/dir.c:1455: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type '__u64' Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:48 +01:00
Steven Whitehouse	bf126aee6d	[GFS2] Patch to fix mmap of stuffed files If a stuffed file is mmaped and a page fault is generated at some offset above the initial page, we need to create a zero page to hang the buffer heads off before we can unstuff the file. This is a fix for bz #236087 Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:46 +01:00
Josef Bacik	476c006be0	[GFS2] use lib/parser for parsing mount options This patch converts the mount option parsing to use the kernels lib/parser stuff like all of the other filesystems. I tested this and it works well. Thank you, Signed-off-by: Josef Bacik <jwhiter@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:43 +01:00
Patrick Caulfield	30d3a2373f	[DLM] Lowcomms nodeid range & initialisation fixes Fix a few range & initialization bugs in lowcomms. - max_nodeid is really the highest nodeid encountered, so all loops must include it in their iterations. - clean dlm_local_count & connection_idr so we can do a clean restart. - Remove a spurious BUG_ON Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:41 +01:00
Josef Bacik	2439fe5072	[DLM] Fix dlm_lowcoms_stop hang When you attempt to release a lockspace in DLM, it will hang trying to down a semaphore that has already been downed. The attached patch fixes the problem. Signed-off-by: Josef Bacik <jwhiter@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Patrick Caulfield <pcaulfie@redhat.com>	2007-05-01 09:11:38 +01:00
David Teigland	7d3c1feb80	[DLM] fix mode munging There are flags to enable two specialized features in the dlm: 1. CONVDEADLK causes the dlm to resolve conversion deadlocks internally by changing the granted mode of locks to NL. 2. ALTPR/ALTCW cause the dlm to change the requested mode of locks to PR or CW to grant them if the normal requested mode can't be granted. GFS direct i/o exercises both of these features, especially when mixed with buffered i/o. The dlm has problems with them. The first problem is on the master node. If it demotes a lock as a part of converting it, the actual step of converting the lock isn't being done after the demotion, the lock is just left sitting on the granted queue with a granted mode of NL. I think the mistaken assumption was that the call to grant_pending_locks() would grant it, but that function naturally doesn't look at locks on the granted queue. The second problem is on the process node. If the master either demotes or gives an altmode, the munging of the gr/rq modes is never done in the process copy of the lock, leaving the master/process copies out of sync. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:36 +01:00
Robert Peterson	5f8820960c	[GFS2] lockdump improvements The patch below consists of the following changes (in code order): 1. I fixed a minor compiler warning regarding the printing of a kernel symbol address. 2. I implemented a suggestion from Dave Teigland that moves the debugfs information for gfs2 into a subdirectory so we can easily expand our use of debugfs in the future. The current code keeps the glock information in: /debug/gfs2/<fs> With the patch, the new code keeps the glock information in: /debug/gfs2/<fs>/glock That will allow us to create more debugfs files in the future. 3. This fixes a bug whereby a failed mount attempt causes the debugfs file to not be deleted. Failed mount attempts should always clean up after themselves, including deleting the debugfs file and/or directory. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:33 +01:00
Steven Whitehouse	bdd19a22f8	[GFS2] Patch to detect corrupt number of dir entries in leaf and/or inode blocks This patch detects when the number of entries in a leaf block or inode block (in the case of stuffed directories) is corrupt and informs the user. It prevents us from running off the end of the array thats been allocated for the sorting in this case, Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:30 +01:00
Robert Peterson	7a0079d9e3	[GFS2] bz 236008: Kernel gpf doing cat /debugfs/gfs2/xxx (lock dump) This is for Bugzilla Bug 236008: Kernel gpf doing cat /debugfs/gfs2/xxx (lock dump) seen at the "gfs2 summit". This also fixes the bug that caused garbage to be printed by the "initialized at" field. I apologize for the kludge, but that code will all be ripped out anyway when the official sprint_symbol function becomes available in the Linux kernel. I also changed some formatting so that spaces are replaced by proper tabs. Signed-off-by: Robert Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:28 +01:00
Adrian Bunk	8fa1de386f	[DLM] fs/dlm/ast.c should #include "ast.h" Every file should include the headers containing the prototypes for it's global functions. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:25 +01:00
Patrick Caulfield	6ed7257b46	[DLM] Consolidate transport protocols This patch consolidates the TCP & SCTP protocols for the DLM into a single file and makes it switchable at run-time (well, at least before the DLM actually starts up!) For RHEL5 this patch requires Neil Horman's patch that expands the in-kernel socket API but that has already been twice ACKed so it should be OK. The patch adds a new lowcomms.c file that replaces the existing lowcomms-sctp.c & lowcomms-tcp.c files. Signed-off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:23 +01:00
Patrick Caulfield	fc7c44f03d	[DLM] Remove redundant assignment This patch removes a redundant (and incorrect) assignment from compat_output Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:20 +01:00
Steven Whitehouse	a43a49066d	[GFS2] Fix bz 234168 (ignoring rgrp flags) Ths following patch makes GFS2 use the rgrp flags properly. Although there are also separate flags for both data and metadata as well, I've not implemented these as there seems little use for them. On the otherhand, the "noalloc" flag is generally useful for future changes we might which to make, so this ensures that we interpret it correctly. In addition I fixed the comment above the function which was incorrect. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:17 +01:00
David Teigland	ce03f12b37	[DLM] change lkid format A lock id is a uint32 and is used as an opaque reference to the lock. For userland apps, the lkid is passed up, through libdlm, as the return value from a write() on the dlm device. This created a problem when the high bit was 1, making the lkid look like an error. This is fixed by changing how the lkid is composed. The low 16 bits identified the hash bucket for the lock and the high 16 bits were a per-bucket counter (which eventually hit 0x8000 causing the problem). These are simply swapped around; the number of hash table buckets is far below 0x8000, making all lkid's positive when viewed as signed. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:15 +01:00
David Teigland	72c2be776b	[DLM] interface for purge (2/2) Add code to accept purge commands from userland. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:12 +01:00
David Teigland	8499137d4e	[DLM] add orphan purging code (1/2) Add code for purging orphan locks. A process can also purge all of its own non-orphan locks by passing a pid of zero. Code already exists for processes to create persistent locks that become orphans when the process exits, but the complimentary capability for another process to then purge these orphans has been missing. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:10 +01:00
David Teigland	7e4dac3359	[DLM] split create_message function This splits the current create_message() function into two parts so that later patches can call the new lower-level _create_message() function when they don't have an rsb struct. No functional change in this patch. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:07 +01:00
Steven Whitehouse	f01963f264	[GFS2] Set drop_count to 0 (off) by default This sets the drop_count to 0 by default which is a better default for most people. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:05 +01:00
David Teigland	b9af8a788a	[GFS2] use log_error before LM_OUT_ERROR We always want to see the details of the error returned to gfs, but log_debug is often turned off, so use log_error (printk). Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:02 +01:00
David Teigland	ef0c2bb05f	[DLM] overlapping cancel and unlock Full cancel and force-unlock support. In the past, cancel and force-unlock wouldn't work if there was another operation in progress on the lock. Now, both cancel and unlock-force can overlap an operation on a lock, meaning there may be 2 or 3 operations in progress on a lock in parallel. This support is important not only because cancel and force-unlock are explicit operations that an app can use, but both are used implicitly when a process exits while holding locks. Summary of changes: - add-to and remove-from waiters functions were rewritten to handle situations with more than one remote operation outstanding on a lock - validate_unlock_args detects when an overlapping cancel/unlock-force can be sent and when it needs to be delayed until a request/lookup reply is received - processing request/lookup replies detects when cancel/unlock-force occured during the op, and carries out the delayed cancel/unlock-force - manipulation of the "waiters" (remote operation) state of a lock moved under the standard rsb mutex that protects all the other lock state - the two recovery routines related to locks on the waiters list changed according to the way lkb's are now locked before accessing waiters state - waiters recovery detects when lkb's being recovered have overlapping cancel/unlock-force, and may not recover such locks - revert_lock (cancel) returns a value to distinguish cases where it did nothing vs cases where it actually did a cancel; the cancel completion ast should only be done when cancel did something - orphaned locks put on new list so they can be found later for purging - cancel must be called on a lock when making it an orphan - flag user locks (ENDOFLIFE) at the end of their useful life (to the application) so we can return an error for any further cancel/unlock-force - we weren't setting COMP/BAST ast flags if one was already set, so we'd lose either a completion or blocking ast - clear an unread bast on a lock that's become unlocked Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:11:00 +01:00
Patrick Caulfield	0320672702	[DLM] fix coverity-spotted stupidity Replacement patch to remove redundant code rather than moving it around. Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:57 +01:00
Robert Peterson	04b933f27b	[GFS2] Red Hat bz 228540: owner references In Testing the previously posted and accepted patch for https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=228540 I uncovered some gfs2 badness. It turns out that the current gfs2 code saves off a process pointer when glocks is taken in both the glock and glock holder structures. Those structures will persist in memory long after the process has ended; pointers to poisoned memory. This problem isn't caused by the 228540 fix; the new capability introduced by the fix just uncovered the problem. I wrote this patch that avoids saving process pointers and instead saves off the process pid. Rather than referencing the bad pointers, it now does process lookups. There is special code that makes the output nicer for printing holder information for processes that have ended. This patch also adds a stub for the new "sprint_symbol" function that exists in Andrew Morton's -mm patch set, but won't go into the base kernel until 2.6.22, since it adds functionality but doesn't fix a bug. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:55 +01:00
Benjamin Marzinski	172e045a7f	[GFS2] flush the log if a transaction can't allocate space This is a fix for bz #208514. When GFS2 frees up space, the freed blocks aren't available for reuse until the resource group is successfully written to the ondisk journal. So in rare cases, GFS2 operations will fail, saying that the filesystem is out of space, when in reality, you are just waiting for a log flush. For instance, on a 1Gig filesystem, if I continually write 10 Mb to a file, and then truncate it, after a hundred interations, the write will fail with -ENOSPC, even though the filesystem is just 1% full. The attached patch calls a log flush in these cases. I tested this patch fairly heavily to check if there were any locking issues that I missed, and it seems to work just fine. Also, this patch only does the log flush if get_local_rgrp makes a complete loop of resource groups without skipping any do to locking issues. The code would be slightly simpler if it just always did the log flush after the first failed pass, and you could only ever have to go through the loop twice, instead of up to three times. However, I guessed that failing to find a rg simply do to locking issues would be common enough to skip the log flush in that case, but I'm not certain that this is the right way to go. Either way, I don't suppose this code will be hit all that often. Signed-off-by: Benjamin E. Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:52 +01:00
Benjamin Marzinski	6883562588	[GFS2] Fix log entry list corruption When glock_lo_add and rg_lo_add attempt to add an element to the log, they check to see if has already been added before locking the log. If another process adds that element to the log in this window between the check and locking the log, the element will be added to the list twice. This causes the log element list to become corrupted in such a way that the log element can never be successfully removed from the list. This patch pulls the list_empty() check inside the log lock, to remove this window. Signed-off-by: Benjamin E. Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:50 +01:00
Steven Whitehouse	f35ac346bc	[GFS2] Speed up lock_dlm's locking (move sprintf) The following patch speeds up lock_dlm's locking by moving the sprintf out from the lock acquisition path and into the lock creation path. This reduces the amount of CPU time used in acquiring locks by a fair amount. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: David Teigland <teigland@redhat.com>	2007-05-01 09:10:47 +01:00
Patrick Caulfield	254da030df	[DLM] Don't delete misc device if lockspace removal fails Currently if the lockspace removal fails the misc device associated with a lockspace is left deleted. After that there is no way to access the orphaned lockspace from userland. This patch recreates the misc device if th dlm_release_lockspace fails. I believe this is better than attempting to remove the lockspace first because that leaves an unattached device lying around. The potential gap in which there is no access to the lockspace between removing the misc device and recreating it is acceptable ... after all the application is trying to remove it, and only new users of the lockspace will be affected. Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:44 +01:00
Steven Whitehouse	420d2a1028	[GFS2] Fix a bug on i386 due to evaluation order Since gcc didn't evaluate the last two terms of the expression in glock.c:1881 as a constant expression, it resulted in an error on i386 due to the lack of a 64bit divide instruction. This adds some brackets to fix the problem. This was reported by Andrew Morton. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>	2007-05-01 09:10:42 +01:00
Steven Whitehouse	3b8249f617	[GFS2] Fix bz 224480 and cleanup glock demotion code This patch prevents the printing of a warning message in cases where the fs is functioning normally by handing off responsibility for unlinked, but still open inodes, to another node for eventual deallocation. Also, there is now an improved system for ensuring that such requests to other nodes do not get lost. The callback on the iopen lock is only ever called when i_nlink == 0 and when a node is unable to deallocate it due to it still being in use on another node. When a node receives the callback therefore, it knows that i_nlink must be zero, so we mark it as such (in gfs2_drop_inode) in order that it will then attempt deallocation of the inode itself. As an additional benefit, queuing a demote request no longer requires a memory allocation. This simplifies the code for dealing with gfs2_holders as it removes one special case. There are two new fields in struct gfs2_glock. gl_demote_state is the state which the remote node has requested and gl_demote_time is the time when the request came in. Both fields are only valid when the GLF_DEMOTE flag is set in gl_flags. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:39 +01:00
Josef Whiter	1de9139092	[GFS2] Fix bz 231380, unlock page before dequeing glocks in gfs2_commit_write If we are writing a file, and in the middle of writing the file another node attempts to get a shared lock on that file (by doing a du for example) the process doing the writing will hang waiting on lock_page. The reason for this is because when we have waiters on a exclusive glock, we will go through and flush out all dirty pages associated with that inode and release the lock. The problem is that when we flush the dirty pages, we could hit a page that we have locked durring the generic_file_buffered_write part of this operation. This patch unlocks the page before we go to dequeue the lock and locks it immediatly afterwards, since generic_file_buffered_write needs the page locked when the commit_write is completed. This patch resolves the problem, however if somebody sees a better way to do this please don't hesistate to yell. Signed-off-by: Josef Whiter <jwhiter@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:37 +01:00
Patrick Caulfield	89adc934f3	[DLM] Fix uninitialised variable in receiving The length of the second element of the kvec array was not initialised before being added to the first one. This could cause invalid lengths to be passed to kernel_recvmsg Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:34 +01:00
Josef Whiter	5c7342d894	[GFS2] fix bz 231369, gfs2 will oops if you specify an invalid mount option If you specify an invalid mount option when trying to mount a gfs2 filesystem, gfs2 will oops. The attached patch resolves this problem. Signed-off-by: Josef Whiter <jwhiter@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:32 +01:00
Robert Peterson	7c52b166c5	[GFS2] Add gfs2_tool lockdump support to gfs2 (bz 228540) The attached patch resolves bz 228540. This adds the capability for gfs2 to dump gfs2 locks through the debugfs file system. This used to exist in gfs1 as "gfs_tool lockdump" but it's missing from gfs2 because all the ioctls were stripped out. Please see the bugzilla for more history about the fix. This patch is also attached to the bugzilla record. The patch is against Steve Whitehouse's latest nmw git tree kernel (2.6.21-rc1) and has been tested on system trin-10. Signed-off-by: Robert Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2007-05-01 09:10:29 +01:00
Neil Brown	83672d392f	NFS: Fix directory caching problem - with test case and patch. Try running this script in an NFS mounted directory (Client relatively recent - 2.6.18 has the problem as does 2.6.20). ------------------------------------------------------ #!/bin/bash # # This script will produce the following errormessage from tar: # # tar: newdir/innerdir/innerfile: file changed as we read it # create dirs rm -rf nfstest mkdir -p nfstest/dir/innerdir # create files (should not be empty) echo "Hello World!" >nfstest/dir/file echo "Hello World!" >nfstest/dir/innerdir/innerfile # problem only happens if we sleep before chmod sleep 1 # change file modes chmod -R a+r nfstest # rename dir mv nfstest/dir nfstest/newdir # tar it tar -cf nfstest/nfstest.tar -C nfstest newdir # restore old dir name mv nfstest/newdir nfstest/dir -------------------------------------------------------- What happens: The 'chmod -R' does a readdir_plus in each directory and the results get cached in the page cache. It then updates the ctime on each file by one second. When this happens, the post-op attributes are used to update the ctime stored on the client to match the value in the kernel. The 'mv' calls shrink_dcache_parent on the directory tree which flushes all the dentries (so a new lookup will be required) but doesn't flush the inodes or pagecache. The 'tar' does a readdir on each directory, but (in the case of 'innerdir' at least) satisfies it from the pagecache and uses the READDIRPLUS data to update all the inodes. In the case of 'innerdir/innerfile', the ctime is out of date. 'tar' then calls 'lstat' on innerdir/innerfile getting an old ctime. It then opens the file (triggering a GETATTR), reads the content, and then calls fstat to see if anything has changed. It finds that ctime has changed and so complains. The problem seems to be that the cache readdirplus info is kept around for too long. My patch below discards pagecache data for directories when dentry_iput is called on them. This effectively removes the symptom which convinces me that I correctly understand the problem. However I'm not convinced that is a proper solution, as there could easily be other races that trigger the same problem without being affected by this 'fix'. One possibility would be to require that readdirplus pagecache data be only used once to instantiate an inode. Somehow it should then be invalidated so that if the dentry subsequently disappears, it will cause a new request to the server to fill in the stat data. Another possibility is to compare the cache_change_attribute on the inode with something similar for the readdirplus info and reject the info from readdirplus if it is too old. I haven't tried to implement these and would value other opinions before I do. Thanks, NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:19 -07:00
Neil Brown	1f4eab7e7c	NFS: Set meaningful value for fattr->time_start in readdirplus results. Don't use uninitialsed value for fattr->time_start in readdirplus results. The 'fattr' structure filled in by nfs3_decode_direct does not get a value for ->time_start set. Thus if an entry is for an inode that we already have in cache, when nfs_readdir_lookup calls nfs_fhget, it will call nfs_refresh_inode and may update the inode with out-of-date information. Directories are read a page at a time, so each page could have a different timestamp that "should" be used to set the time_start for the fattr for info in that page. However storing the timestamp per page is awkward. (We could stick in the first 4 bytes and only read 4092 bytes, but that is a bigger code change than I am interested it). This patch ignores the readdir_plus attributes if a readdir finds the information already in cache, and otherwise sets ->time_start to the time the readdir request was sent to the server. It might be nice to store - in the directory inode - the time stamp for the earliest readdir request that is still in the page cache, so that we don't ignore attribute data that we don't have to. This patch doesn't do that. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:18 -07:00
Steve Dickson	74dd34e6e8	NFS: Added support to turn off the NFSv3 READDIRPLUS RPC. READDIRPLUS can be a performance hindrance when the client is working with large directories. In addition, some servers still have bugs in their implementations (e.g. Tru64 returns wrong values for the fsid). Add a mount flag to enable users to turn it off at mount time following the implementation in Apple's NFS client. Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:16 -07:00
Chuck Lever	00a6e7bbf9	SUNRPC: RPC client should retry with different versions of rpcbind Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:16 -07:00
Chuck Lever	df8b172a88	NFS: switch NFSROOT to use new rpcbind client It is arguable whether NFSROOT will support IPv6, and thus whether rpcb_getport_external needs to support rpcbind versions greater than 2. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:14 -07:00
Chuck Lever	2bea90d43a	SUNRPC: RPC buffer size estimates are too large The RPC buffer size estimation logic in net/sunrpc/clnt.c always significantly overestimates the requirements for the buffer size. A little instrumentation demonstrated that in fact rpc_malloc was never allocating the buffer from the mempool, but almost always called kmalloc. To compute the size of the RPC buffer more precisely, split p_bufsiz into two fields; one for the argument size, and one for the result size. Then, compute the sum of the exact call and reply header sizes, and split the RPC buffer precisely between the two. That should keep almost all RPC buffers within the 2KiB buffer mempool limit. And, we can finally be rid of RPC_SLACK_SPACE! Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:10 -07:00
Chuck Lever	511d2e8855	NLM: Shrink the maximum request size of NLM4 requests NLM version 4 requests estimate the call and reply header sizes rather conservatively, using the very maximum size allowed in the protocol even though Linux always uses only a small fraction of the allowable space. Reduce the size of caller and lock arguments to conserve RPC buffer space while XDR encoding NLM4 arguments. Add compile-time checks to ensure the hostname string won't overflow NLM protocol maximums. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:09 -07:00
Trond Myklebust	ca52fec152	NFS: Use pgoff_t in structures and functions that pass page cache offsets Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:09 -07:00
Trond Myklebust	724c439c20	NFS: Clean up nfs_sync_mapping_wait() It has no business touching wbc->pages_skipped. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:08 -07:00
Trond Myklebust	8d5658c949	NFS: Fix a buffer overflow in the allocation of struct nfs_read/writedata Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:07 -07:00
Trond Myklebust	c63c7b0513	NFS: Fix a race when doing NFS write coalescing Currently we do write coalescing in a very inefficient manner: one pass in generic_writepages() in order to lock the pages for writing, then one pass in nfs_flush_mapping() and/or nfs_sync_mapping_wait() in order to gather the locked pages for coalescing into RPC requests of size "wsize". In fact, it turns out there is actually a deadlock possible here since we only start I/O on the second pass. If the user signals the process while we're in nfs_sync_mapping_wait(), for instance, then we may exit before starting I/O on all the requests that have been queued up. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:06 -07:00
Trond Myklebust	8b09bee308	NFS: Cleanup for nfs_readpages() Do the coalescing of read requests into block sized requests at start of I/O as we scan through the pages instead of going through a second pass. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:05 -07:00
Trond Myklebust	bcb71bba7e	NFS: Another cleanup of the read/write request coalescing code Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:04 -07:00
Trond Myklebust	d8a5ad75cc	NFS: Cleanup the coalescing code Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:04 -07:00
Trond Myklebust	91e59c368c	NFS: Don't wait for congestion in nfs_update_request() It is redundant, and will interfere with the call to balance_dirty_pages_ratelimited_nr in generic_file_write(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:03 -07:00
Amnon Aaronsohn	1a0ba9ae48	NFS: statfs error-handling fix The nfs statfs function returns a success code on error, and fills the output buffer with invalid values. The attached patch makes it return a correct error code instead. Signed-off-by: Amnon Aaronsohn <amnonaar@gmail.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> (Modified patch to reinstate the dprintk())	2007-04-30 22:17:02 -07:00
Trond Myklebust	d585158b60	NFS: Fix nfs_set_page_dirty() Be more careful about testing page->mapping. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2007-04-30 22:17:02 -07:00
Jeff Mahoney	1173a729fc	reiserfs: suppress lockdep warning We're getting lockdep warnings due to a post-2.6.21-rc7 bugfix. The xattr_sem can never be taken in the manner described. Internal inodes are protected by I_PRIVATE. Add the appropriate annotation. Cc: <stable@kernel.org> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-04-30 16:40:40 -07:00
Steve French	4523cc3044	[CIFS] UID/GID override on CIFS mounts to Samba When CIFS Unix Extensions are negotiated we get the Unix uid and gid owners of the file from the server (on the Unix Query Path Info levels), but if the server's uids don't match the client uid's users were having to disable the Unix Extensions (which turned off features they still wanted). The changeset patch allows users to override uid and/or gid for file/directory owner with a default uid and/or gid specified at mount (as is often done when mounting from Linux cifs client to Windows server). This changeset also displays the uid and gid used by default in /proc/mounts (if applicable). Also cleans up code by adding some of the missing spaces after "if" keywords per-kernel style guidelines (as suggested by Randy Dunlap when he reviewed the patch). Signed-off-by: Steve French <sfrench@us.ibm.com>	2007-04-30 20:13:06 +00:00
Linus Torvalds	cd9bb7e736	Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block * 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: [PATCH] elevator: elv_list_lock does not need irq disabling [BLOCK] Don't pin lots of memory in mempools cfq-iosched: speedup cic rb lookup ll_rw_blk: add io_context private pointer cfq-iosched: get rid of cfqq hash cfq-iosched: tighten queue request overlap condition cfq-iosched: improve sync vs async workloads cfq-iosched: never allow an async queue idling cfq-iosched: get rid of ->dispatch_slice cfq-iosched: don't pass unused preemption variable around cfq-iosched: get rid of ->cur_rr and ->cfq_list cfq-iosched: slice offset should take ioprio into account [PATCH] cfq-iosched: style cleanups and comments cfq-iosched: sort IDLE queues into the rbtree cfq-iosched: sort RT queues into the rbtree [PATCH] cfq-iosched: speed up rbtree handling cfq-iosched: rework the whole round-robin list concept cfq-iosched: minor updates cfq-iosched: development update cfq-iosched: improve preemption for cooperating tasks	2007-04-30 08:12:39 -07:00
Jens Axboe	5972511b77	[BLOCK] Don't pin lots of memory in mempools Currently we scale the mempool sizes depending on memory installed in the machine, except for the bio pool itself which sits at a fixed 256 entry pre-allocation. There's really no point in "optimizing" this OOM path, we just need enough preallocated to make progress. A single unit is enough, lets scale it down to 2 just to be on the safe side. This patch saves ~150kb of pinned kernel memory on a 32-bit box. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-04-30 09:08:17 +02:00
Paul Mackerras	49e1900d4c	Merge branch 'linux-2.6' into for-2.6.22	2007-04-30 12:38:01 +10:00

... 3 4 5 6 7 ...

5652 Commits