linux/fs
Andrea Arcangeli 7766755a2f Fix /proc dcache deadlock in do_exit
This patch fixes a sles9 system hang in start_this_handle from a customer
with some heavy workload where all tasks are waiting on kjournald to commit
the transaction, but kjournald waits on t_updates to go down to zero (it
never does).

This was reported as a lowmem shortage deadlock but when checking the debug
data I noticed the VM wasn't under pressure at all (well it was really
under vm pressure, because lots of tasks hanged in the VM prune_dcache
methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
mode, the holder of the journal handle should have if this was a vm issue
in the first place).

No task was apparently holding the leftover handle in the committing
transaction, so I deduced t_updates was stuck to 1 because a journal_stop
was never run by some path (this turned out to be correct).  With a debug
patch adding proper reverse links and stack trace logging in ext3 deployed
in production, I found journal_stop is never run because
mark_inode_dirty_sync is called inside release_task called by do_exit.
(that was quite fun because I would have never thought about this
subtleness, I thought a regular path in ext3 had a bug and it forgot to
call journal_stop)

do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
come back to run journal_stop)

The reason is that shrink_dcache_parent is racy by design (feature not
a bug) and it can do blocking I/O in some case, but the point is that
calling shrink_dcache_parent at the last stage of do_exit isn't safe
for self-reaping tasks.

I guess the memory pressure of the unbalanced highmem system allowed
to trigger this more easily.

Now mainline doesn't have this line in iput (like sles9 has):

    	     if (inode->i_state & I_DIRTY_DELAYED)
	     			mark_inode_dirty_sync(inode);

so it will probably not crash with ext3, but for example ext2 implements an
I/O-blocking ext2_put_inode that will lead to similar screwups with
ext2_free_blocks never coming back and it's definitely wrong to call
blocking-IO paths inside do_exit.  So this should fix a subtle bug in
mainline too (not verified in practice though).  The equivalent fix for
ext3 is also not verified yet to fix the problem in sles9 but I don't have
doubt it will (it usually takes days to crash, so it'll take weeks to be
sure).

An alternate fix would be to offload that work to a kernel thread, but I
don't think a reschedule for this is worth it, the vm should be able to
collect those entries for the synchronous release_task.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:18 -08:00
..
9p 9p: use copy of the options value instead of original 2007-11-06 08:02:53 -06:00
adfs Slab API: remove useless ctor parameter and reorder parameters 2007-10-17 08:42:45 -07:00
affs fs: mark nibblemap const 2007-10-17 08:42:47 -07:00
afs vfs: Add 64 bit i_version support 2008-01-28 23:58:27 -05:00
autofs Use task_pid_nr() instead of pid_nr(task_pid()) 2007-10-19 11:53:43 -07:00
autofs4 pid namespaces: round up the API 2007-10-19 11:53:37 -07:00
befs fs/: Spelling fixes 2008-02-03 17:33:42 +02:00
bfs regression: bfs endianness bug 2007-12-05 09:25:20 -08:00
cifs Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
coda coda: convert struct class_device to struct device 2008-01-24 20:40:05 -08:00
configfs configfs: file.c fix possible recursive locking 2008-01-25 15:05:47 -08:00
cramfs fs/cramfs/inode.c: replace hardcoded value with preprocessor constant 2007-10-18 14:37:29 -07:00
debugfs Kobject: convert fs/* from kobject_unregister() to kobject_put() 2008-01-24 20:40:40 -08:00
devpts devpts: add fsnotify create event 2007-05-08 11:14:59 -07:00
dlm dlm: static initialization improvements 2008-01-30 11:04:43 -06:00
ecryptfs Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
efs exportfs: make struct export_operations const 2007-10-22 08:13:21 -07:00
exportfs exportfs: update documentation 2007-10-22 08:13:21 -07:00
ext2 ext2: Fix the max file size for ext2 file system. 2008-01-28 23:58:26 -05:00
ext3 Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
ext4 Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
fat fat: optimize fat_count_free_clusters() 2008-01-08 16:10:35 -08:00
freevxfs fs/: Spelling fixes 2008-02-03 17:33:42 +02:00
fuse Kobject: convert fs/* from kobject_unregister() to kobject_put() 2008-01-24 20:40:40 -08:00
gfs2 Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
hfs hfs: fix coverity-found null deref 2008-01-17 15:38:58 -08:00
hfsplus Slab API: remove useless ctor parameter and reorder parameters 2007-10-17 08:42:45 -07:00
hostfs uml: fix hostfs style 2007-10-16 09:43:07 -07:00
hpfs Slab API: remove useless ctor parameter and reorder parameters 2007-10-17 08:42:45 -07:00
hppfs
hugetlbfs hugetlb: allow sticky directory mount option 2008-02-05 09:44:14 -08:00
isofs exportfs: make struct export_operations const 2007-10-22 08:13:21 -07:00
jbd spinlock: lockbreak cleanup 2008-01-30 13:31:20 +01:00
jbd2 spinlock: lockbreak cleanup 2008-01-30 13:31:20 +01:00
jffs2 Typoes: "whith" -> "with" 2008-02-03 15:14:02 +02:00
jfs Spelling fixes: lenght->length 2008-02-03 15:42:53 +02:00
lockd NLM: tear down RPC clients in nlm_shutdown_hosts 2008-02-01 16:42:15 -05:00
minix limit minixfs printks on corrupted dir i_size 2007-10-17 08:42:53 -07:00
msdos
ncpfs vm audit: add VM_DONTEXPAND to mmap for drivers that need it 2008-02-04 07:55:38 -08:00
nfs Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
nfs_common
nfsd nfsd: more careful input validation in nfsctl write methods 2008-02-01 16:42:15 -05:00
nls sparse pointer use of zero as null 2007-10-18 14:37:31 -07:00
ntfs is_vmalloc_addr(): Check if an address is within the vmalloc boundaries 2008-02-05 09:44:14 -08:00
ocfs2 Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
openpromfs [SPARC]: Constify function pointer tables. 2008-01-22 18:29:20 -08:00
partitions Kobject: convert fs/* from kobject_unregister() to kobject_put() 2008-01-24 20:40:40 -08:00
proc Fix /proc dcache deadlock in do_exit 2008-02-05 09:44:18 -08:00
qnx4 Slab API: remove useless ctor parameter and reorder parameters 2007-10-17 08:42:45 -07:00
ramfs Remove valueless definition of hard-selected RAMFS option 2007-10-17 08:42:56 -07:00
reiserfs Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
romfs fs/romfs/inode.c: trivial improvements 2007-10-17 08:42:47 -07:00
smbfs Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc 2008-02-01 11:45:47 +11:00
sysfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 2008-01-25 17:19:08 -08:00
sysv Slab API: remove useless ctor parameter and reorder parameters 2007-10-17 08:42:45 -07:00
udf fs/udf/balloc.c: mark a variable as uninitialized_var() 2007-10-17 08:43:00 -07:00
ufs ufs: fix nexstep dir block size 2007-12-05 09:21:18 -08:00
vfat
xfs is_vmalloc_addr(): Check if an address is within the vmalloc boundaries 2008-02-05 09:44:14 -08:00
aio.c core: remove last users of empty FASTCALL macro 2008-01-30 13:31:17 +01:00
anon_inodes.c anon-inodes use open coded atomic_inc for the shared inode 2007-10-17 08:43:00 -07:00
attr.c VFS: make notify_change pass ATTR_KILL_S*ID to setattr operations 2007-10-18 14:37:22 -07:00
bad_inode.c sendfile: remove bad_sendfile() from bad_file_ops 2007-07-10 08:04:15 +02:00
binfmt_aout.c mm: fix exit_mmap BUG() on a.out binary exit 2007-12-20 07:49:53 -08:00
binfmt_elf_fdpic.c pid namespaces: changes to show virtual ids to user 2007-10-19 11:53:40 -07:00
binfmt_elf.c fs/binfmt_elf.c: spello fix 2008-02-03 18:05:15 +02:00
binfmt_em86.c Convert files to UTF-8 and some cleanups 2007-10-19 23:21:04 +02:00
binfmt_flat.c binfmt_flat: warning fixes 2007-10-17 08:42:54 -07:00
binfmt_misc.c Convert files to UTF-8 and some cleanups 2007-10-19 23:21:04 +02:00
binfmt_script.c Convert files to UTF-8 and some cleanups 2007-10-19 23:21:04 +02:00
binfmt_som.c core_pattern: ignore RLIMIT_CORE if core_pattern is a pipe 2007-10-17 08:42:50 -07:00
bio.c __bio_clone: don't calculate hw/phys segment counts 2008-01-28 10:04:46 +01:00
block_dev.c Driver core: convert block from raw kobjects to core devices 2008-01-24 20:40:36 -08:00
buffer.c bufferhead: revert constructor removal 2008-02-05 09:44:14 -08:00
char_dev.c Kobject: rename kobject_init_ng() to kobject_init() 2008-01-24 20:40:38 -08:00
compat_binfmt_elf.c x86: compat_binfmt_elf 2008-01-30 13:31:46 +01:00
compat_ioctl.c remove __attribute_used__ 2008-01-28 23:21:18 +01:00
compat.c timerfd: new timerfd API 2008-02-05 09:44:07 -08:00
dcache.c dcache: don't expose uninitialized memory in /proc/<pid>/fd/<fd> 2007-10-22 08:13:18 -07:00
dcookies.c Remove fs.h from mm.h 2007-07-29 17:09:29 -07:00
direct-io.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
dnotify.c mm: Remove slab destructors from kmem_cache_create(). 2007-07-20 10:11:58 +09:00
dquot.c Don't send quota messages repeatedly when hardlimit reached 2007-12-23 12:54:36 -08:00
drop_caches.c invalidate_mapping_pages(): add cond_resched 2007-07-16 09:05:36 -07:00
eventfd.c eventfd use waitqueue lock ... 2007-05-18 13:09:34 -07:00
eventpoll.c lockdep: annotate epoll 2008-02-05 09:44:07 -08:00
exec.c exec: rework the group exit and fix the race with kill 2008-02-05 09:44:07 -08:00
fcntl.c pid namespaces: changes to show virtual ids to user 2007-10-19 11:53:40 -07:00
fifo.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
file_table.c fs/file_table.c: use list_for_each_entry() instead of list_for_each() 2007-10-19 11:53:38 -07:00
file.c
filesystems.c add filesystem subtype support 2007-05-08 11:15:01 -07:00
fs-writeback.c Revert "writeback: introduce writeback_control.more_io to indicate more io" 2008-01-14 21:21:29 -08:00
generic_acl.c Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check 2007-07-17 12:00:03 -07:00
inode.c ext4: Add inode version support in ext4 2008-01-28 23:58:27 -05:00
inotify_user.c change inotifyfs magic as the same magic is used for futexfs 2007-10-17 08:43:00 -07:00
inotify.c [PATCH] new helper - inotify_evict_watch() 2007-10-21 02:37:38 -04:00
internal.h cleanup compat ioctl handling 2007-05-08 11:15:09 -07:00
ioctl.c drop obsolete sys_ioctl export 2007-07-16 09:05:48 -07:00
ioprio.c cfq-iosched: relax IOPRIO_CLASS_IDLE restrictions 2008-01-28 11:38:15 +01:00
Kconfig nfsd: select CONFIG_PROC_FS in nfsv4 and gss server cases 2008-02-01 16:42:04 -05:00
Kconfig.binfmt x86: compat_binfmt_elf Kconfig 2008-01-30 13:31:46 +01:00
libfs.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
locks.c pid-namespaces-vs-locks-interaction 2008-02-03 17:51:36 -05:00
Makefile x86: compat_binfmt_elf Kconfig 2008-01-30 13:31:46 +01:00
mbcache.c fs: Fix to correct the mbcache entries counter 2007-10-25 15:18:29 -07:00
mpage.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
namei.c Use access mode instead of open flags to determine needed permissions 2008-01-12 14:47:58 -08:00
namespace.c kobject: convert main fs kobject to use kobject_create 2008-01-24 20:40:13 -08:00
nfsctl.c nfsctl: use vfs_path_lookup 2007-07-19 10:04:45 -07:00
no-block.c
open.c mark sys_open/sys_read exports unused 2007-11-14 18:45:42 -08:00
pipe.c sched: affine sync wakeups 2007-10-15 17:00:19 +02:00
pnode.c Introduce a handy list_first_entry macro 2007-05-08 11:15:11 -07:00
pnode.h [PATCH] new helpers - collect_mounts() and release_collected_mounts() 2007-10-21 02:37:25 -04:00
posix_acl.c
quota_v1.c
quota_v2.c
quota.c [IA64] Fix build failure in fs/quota.c 2007-07-27 15:40:13 -07:00
read_write.c ext4: export iov_shorten from kernel for ext4's use 2008-01-28 23:58:27 -05:00
read_write.h
readdir.c Use mutex_lock_killable in vfs_readdir 2007-12-06 17:39:54 -05:00
select.c fs/select, remove unused macros 2007-10-19 11:53:41 -07:00
seq_file.c [FS] seq_file: Introduce the seq_open_private() 2007-10-10 16:55:33 -07:00
signalfd.c Fix a small number of "memeber" typoes. 2008-02-03 15:12:15 +02:00
splice.c splice: always updated atime in direct splice 2008-02-01 09:26:32 +01:00
stack.c
stat.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
super.c Convert files to UTF-8 and some cleanups 2007-10-19 23:21:04 +02:00
sync.c Introduce fixed sys_sync_file_range2() syscall, implement on PowerPC and ARM 2007-06-28 11:38:30 -07:00
timerfd.c timerfd: new timerfd API 2008-02-05 09:44:07 -08:00
utimes.c VFS: check nanoseconds in utimensat 2007-10-17 08:42:52 -07:00
xattr_acl.c
xattr.c [PATCH] pass dentry to audit_inode()/audit_inode_child() 2007-10-21 02:37:18 -04:00