linux

Author	SHA1	Message	Date
Paul Mundt	debf950716	serial: sh-sci: Generalize overrun handling. This consolidates all of the broken out overrun handling and ensures that we have sensible defaults per-port type, in addition to making sure that overruns are flagged appropriately in the error mask for parts that haven't explicitly disabled support for it. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2011-06-08 18:19:37 +09:00
Sekhar Nori	7e9f194521	davinci: make PCM platform devices static Make the PCM device structures used in devices.c and devices-da8xx.c static as they are used only in the respective files. This was found when trying to build a single image for DaVinci and DA8x devices using runtime P2V support. Signed-off-by: Sekhar Nori <nsekhar@ti.com>	2011-06-08 14:41:37 +05:30
Thomas Gleixner	7416401661	arm: davinci: Fix fallout from generic irq chip conversion The code which does the chained handler setup was overwriting chip_data. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Holger Hans Peter Freyther <holger@freyther.de> Signed-off-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Sekhar Nori <nsekhar@ti.com>	2011-06-08 14:33:52 +05:30
Paul Mundt	b030340161	serial: sh-sci: Kill off some more unused definitions. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2011-06-08 17:13:20 +09:00
Paul Mundt	a01cdc1068	serial: sh-sci: Tidy up ioread/write wrappers, kill off unused SCI helper. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2011-06-08 17:06:25 +09:00
Stefan Kriwanek	74bc695313	HID: Add driver to fix Speedlink VAD Cezanne support Speedlink VAD Cezanne have a hardware bug that makes the cursor "jump" from one place to another every now and then. The issue are relative motion events erroneously reported by the device, each having a distance value of +256. This 256 can in fact never occur due to real motion, therefore those events can safely be ignored. The driver also drops useless EV_REL events with a value of 0, that the device sends every time it sends an "real" EV_REL or EV_KEY event. Signed-off-by: Stefan Kriwanek <mail@stefankriwanek.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-06-08 09:45:37 +02:00
Stephen Rothwell	ffbc03bc75	net: add needed interrupt.h Fixes these errors after the removal of interrupt.h from netdevice.h: drivers/net/ll_temac_main.c: In function 'temac_open': drivers/net/ll_temac_main.c:859:2: error: implicit declaration of function 'request_irq' drivers/net/ll_temac_main.c:870:2: error: implicit declaration of function 'free_irq' drivers/net/ll_temac_main.c: In function 'temac_poll_controller': drivers/net/ll_temac_main.c:903:2: error: implicit declaration of function 'disable_irq' drivers/net/ll_temac_main.c:909:2: error: implicit declaration of function 'enable_irq' Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-08 00:15:34 -07:00
Srinivas KANDAGATLA	5bdbd4fa4d	sh: Fix up xchg/cmpxchg corruption with gUSA RB. gUSA special cases r15 for part of its login/out sequence, meaning that any parameters need to be explicitly prohibited from accidentally being assigned that particular register, and the compiler ultimately needs to use a temporary instead. Certain configurations have begun generating code paths that do indeed get allocated r15, resulting in immediate corruption of the exchanged value. This was observed in (amongst others) exit_mm() code generation where the xchg_u32 call was immediately corrupting a structure address. As this is a general gUSA restriction, the rest of the users likewise need to be updated to ensure sensible constraints. References: https://bugzilla.stlinux.com/show_bug.cgi?id=11229 Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Reviewed-by: Stuart Menefy <stuart.menefy@st.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2011-06-08 15:22:39 +09:00
Jonathan Brassow	076f968b37	MD: add sync_super to mddev_t struct Add the 'sync_super' function pointer to MD array structure (struct mddev_s) If device-mapper (dm-raid.c) is to define its own on-disk superblock and be able to load it, there must still be a way for MD to initiate superblock updates. The simplest way to make this happen is to provide a pointer in the MD array structure that can be set by device-mapper (or other module) with a function to do this. If the function has been set, it will be used; otherwise, the method with be looked up via 'super_types' as usual. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
Jonathan Brassow	1ed7242e59	MD: raid1 changes to allow use by device mapper MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c) Added the necessary congestion function and conditionalize calls requiring an array 'queue' or 'gendisk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
Jonathan Brassow	0fd018af37	MD: move thread wakeups into resume Move personality and sync/recovery thread starting outside md_run. Moving the wakeup's of the personality and sync/recovery threads out of md_run and into do_md_run and mddev_resume solves two issues: 1) It allows bitmap_load to be called before the sync_thread is run and 2) when MD personalities are used by device-mapper (dm-raid.c), the start-up of the array is better alligned with device-mapper primatives (CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should not happen until after a resume has taken place. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
Jonathan Brassow	ac42450c7c	MD: possible typo Make message a bit clearer by s/blocks/k/ I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the message. 'k' may be a bit ambigous, but I think it's better than "blocks" which normally means 512, but means 1024 in MD. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
Jonathan Brassow	68866e425b	MD: no sync IO while suspended Disallow resync I/O while the RAID array is suspended. Recovery, resync, and metadata I/O should not be allowed while a device is suspended. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:10:08 +10:00
Jonathan Brassow	629acb6aba	MD: no integrity register if no gendisk Don't attempt md_integrity_register if there is no gendisk struct available. When MD arrays are built via device-mapper, the gendisk structure is not available via mddev. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:10:08 +10:00
Sage Weil	0c1f91f271	ceph: unwind canceled flock state If we request a lock and then abort (e.g., ^C), we need to send a matching unlock request to the MDS to unwind our lock attempt to avoid indefinitely blocking other clients. Reported-by: Brian Chrisman <brchrisman@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:36:45 -07:00
Sage Weil	0e98728fa3	ceph: fix ENOENT logic in striped_read Getting ENOENT is equivalent to reading 0 bytes. Make that correction before setting up the hit_stripe and was_short flags. Fixes the following case: dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0 dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct Reported-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:16 -07:00
Sage Weil	c3cd62839a	ceph: fix short sync reads from the OSD If we get a short read from the OSD because the object is small, we need to zero the remainder of the buffer. For O_DIRECT reads, the attempted range is not trimmed to i_size by the VFS, so we were actually looping indefinitely. Fix by trimming by i_size, and the unconditionally zeroing the trailing range. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:14 -07:00
Sage Weil	2584547230	ceph: fix sync vs canceled write If we cancel a write, trigger the safe completions to prevent a sync from blocking indefinitely in ceph_osdc_sync(). Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:13 -07:00
Sage Weil	70b666c3b4	ceph: use ihold when we already have an inode ref We should use ihold whenever we already have a stable inode ref, even when we aren't holding i_lock. This avoids adding new and unnecessary locking dependencies. Signed-off-by: Sage Weil <sage@newdream.net>	2011-06-07 21:34:11 -07:00
Steffen Klassert	e756682c8b	xfrm: Fix off by one in the replay advance functions We may write 4 byte too much when we reinitialize the anti replay window in the replay advance functions. This patch fixes this by adjusting the last index of the initialization loop. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-07 21:14:39 -07:00
Linus Torvalds	cb0a02ecf9	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Ensure we locate the passed IRQ in irq_alloc_descs() genirq: Fix descriptor init on non-sparse IRQs irq: Handle spurios irq detection for threaded irqs genirq: Print threaded handler in spurious debug output	2011-06-07 19:21:11 -07:00
Linus Torvalds	d681f1204d	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86/amd-iommu: Fix boot crash with hidden PCI devices x86/amd-iommu: Use only per-device dma_ops x86/amd-iommu: Fix 3 possible endless loops	2011-06-07 19:20:53 -07:00
Linus Torvalds	6715a52a58	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix/clarify set_task_cpu() locking rules lockdep: Fix lock_is_held() on recursion sched: Fix schedstat.nr_wakeups_migrate sched: Fix cross-cpu clock sync on remote wakeups	2011-06-07 19:20:28 -07:00
Linus Torvalds	ef2398019b	Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6 * 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: drm/nv40: fall back to paged dma object for the moment drm/nouveau: fix leak of gart mm node drm/nouveau: fix vram page mapping when crossing page table boundaries drm/nv17-nv40: Fix modesetting failure when pitch == 4096px (fdo bug 35901). drm/nouveau: don't create accel engine objects when noaccel=1 drm/nvc0: recognise 0xdX chipsets as NV_C0 drm/i915: Add a no lvds quirk for the Asus EeeBox PC EB1007 drm/i915: Share the common force-audio property between connectors drm/i915: Remove unused enum "chip_family" drm/915: fix relaxed tiling on gen2: tile height drm/i915/crt: Explicitly return false if connected to a digital monitor drm/i915: Replace ironlake_compute_wm0 with g4x_compute_wm0 drm/i915: Only print out the actual number of fences for i915_error_state drm/i915: s/addr & ~PAGE_MASK/offset_in_page(addr)/ drm: i915: correct return status in intel_hdmi_mode_valid() drm/i915: fix regression after clock gating init split drm/i915: fix if statement in ivybridge irq handler	2011-06-07 19:09:26 -07:00
Linus Torvalds	12871a0bd6	Merge branch 'drm-radeon-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6 * 'drm-radeon-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: drm/radeon/kms/atom: fix PHY init drm/radeon/kms: add missing Evergreen texture formats to the CS parser drm/radeon/kms: viewport height has to be even drm/radeon/kms: remove duplicate reg from r600 safe regs drm/radeon/kms: add support for Llano Fusion APUs drm/radeon/kms: add llano pci ids drm/radeon/kms: fill in asic struct for llano drm/radeon/kms: add family ids for llano APUs drm/radeon: fix oops in ttm reserve when pageflipping (v2) drm/radeon/kms: clean up the radeon kms Kconfig drm/radeon/kms: fix thermal sensor reading on juniper drm/radeon/kms: add missing case for cayman thermal sensor drm/radeon/kms: add blit support for cayman (v2) drm/radeon/kms/blit: workaround some hw issues on evergreen+	2011-06-07 19:09:17 -07:00
Linus Torvalds	ecff4fcc7b	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: Consider slack value in mod_timer() clockevents: Handle empty cpumask gracefully	2011-06-07 19:07:22 -07:00
Linus Torvalds	58a9a36b54	Merge branch 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Initialize kvm before registering the mmu notifier KVM: x86: use proper port value when checking io instruction permission KVM: add missing void __user * cast to access_ok() call	2011-06-07 19:06:28 -07:00
Grant Likely	22b174f8b7	MAINTAINERS: Saying goodbye to David Brownell We had to say goodbye when David passed away recently. David had a huge impact on our community, both personally in the lives of the people he worked with, and technically in the design and maintenance of several subsystems. He is greatly missed. He also leaves behind a number of much loved subsystems now orphaned. This patch updates the MAINTAINERS file for the areas that David was responsible for and adds an entry for him to the CREDITS file. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Harry Wei <harryxiyou@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-07 19:05:26 -07:00
Linus Torvalds	9c125d2af8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6: fat: Fix corrupt inode flags when remove ATTR_SYS flag	2011-06-07 19:04:21 -07:00
David Howells	40182373ac	MN10300: Add missing _sdata declaration _sdata needs to be declared in the linker script now as of commit `a2d063ac21` ("extable, core_kernel_data(): Make sure all archs define _sdata") Signed-off-by: David Howells <dhowells@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-07 19:03:53 -07:00
David Howells	db1c9dfa64	MN10300: die_if_no_fixup() shouldn't use get_user() as it doesn't call set_fs() die_if_no_fixup() shouldn't use get_user() as it doesn't call set_fs() to indicate that it wants to probe a kernel address. Instead it should use probe_kernel_read(). This fixes the problem of gdb seeing SIGILL rather than SIGTRAP when hitting the KGDB special breakpoint upon SysRq+g being seen. The problem was that die_if_no_fixup() was failing to read the opcode of the instruction that caused the exception, and thus not fixing up the exception. This caused gdb to get a S04 response to the $? request in its remote protocol rather than S05 - which would then cause it to continue with $C04 rather than $c in an attempt to pass the signal onto the inferior process. The kernel, however, does not support $Cnn, and so objects by returning an E22 response, indicating an error. gdb does not expect this and prints: warning: Remote failure reply: E22 and then returns to the gdb command prompt unable to continue. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-07 19:03:52 -07:00
David Howells	2e65d1f6ee	MN10300: Fix one of the kernel debugger cacheflush variants One of the kernel debugger cacheflush variants escaped proper testing. Two of the labels are wrong, being derived from the code that was copied to construct the variant. The first label results in the following assembler message: AS arch/mn10300/mm/cache-dbg-flush-by-reg.o arch/mn10300/mm/cache-dbg-flush-by-reg.S: Assembler messages: arch/mn10300/mm/cache-dbg-flush-by-reg.S:123: Error: symbol `debugger_local_cache_flushinv_no_dcache' is already defined And the second label results in the following linker message: arch/mn10300/mm/built-in.o:(.text+0x1d39): undefined reference to `mn10300_local_icache_inv_range_reg_end' arch/mn10300/mm/built-in.o:(.text+0x1d39): relocation truncated to fit: R_MN10300_PCREL16 against undefined symbol `mn10300_local_icache_inv_range_reg_end' To test this file the following configuration pieces must be set: CONFIG_AM34=y CONFIG_MN10300_CACHE_WBACK=y CONFIG_MN10300_DEBUGGER_CACHE_FLUSH_BY_REG=y CONFIG_MN10300_CACHE_MANAGE_BY_REG=y CONFIG_AM34_HAS_CACHE_SNOOP=n Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-07 19:03:52 -07:00
Linus Torvalds	aec040e29e	Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: [S390] fix kvm defines for 31 bit compile [S390] use generic RCU page-table freeing code [S390] qdio: Split SBAL entry flags [S390] kvm-s390: fix stfle facilities numbers >=64 [S390] kvm-s390: Fix host crash on misbehaving guests	2011-06-07 18:47:53 -07:00
Linus Torvalds	8ea656bdc9	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6 * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6: (21 commits) ARM: OMAP4: MMC: increase delay for pbias arm: omap2plus: move NAND_BLOCK_SIZE out of boards omap4: hwmod: Enable the keypad omap3: Free Beagle rev gpios when they are read, so others can read them later arm: omap3: beagle: Ensure msecure is mux'd to be able to set the RTC omap: rx51: Don't power up speaker amplifier at bootup omap: rx51: Set regulator V28_A always on ARM: OMAP4: MMC: no regulator off during probe for eMMC arm: omap2plus: fix ads7846 pendown gpio request ARM: OMAP2: Add missing iounmap in omap4430_phy_init ARM: omap4: Pass core and wakeup mux tables to omap4_mux_init ARM: omap2+: mux: Allow board mux settings to be NULL OMAP4: fix return value of omap4_l3_init OMAP: iovmm: fix SW flags passed by user arch/arm/mach-omap1/dma.c: Invert calls to platform_device_put and platform_device_del OMAP2+: mux: fix compilation warnings OMAP: SRAM: Fix warning: format '%08lx' expects type 'long unsigned int' arm: omap3: cm-t3517: fix section mismatch warning OMAP2+: Fix 9 section mismatch(es) warnings from mach-omap2/built-in.o ARM: OMAP2: Add missing include of linux/gpio.h ...	2011-06-07 18:46:37 -07:00
Linus Torvalds	d205df9955	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Processes waiting on inode glock that no processes are holding	2011-06-07 18:44:10 -07:00
Linus Torvalds	24210071e0	Merge branch 'fbdev-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/fbdev-3.x * 'fbdev-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/fbdev-3.x: video: Fix use-after-free by vga16fb on rmmod video: Convert vmalloc/memset to vzalloc efifb: Disallow manual bind and unbind efifb: Fix mismatched request/release_mem_region efifb: Enable write-combining drivers/video/pxa168fb.c: add missing clk_put drivers/video/imxfb.c: add missing clk_put fbdev: bf537-lq035: add missing blacklight properties type savagefb: Use panel CVT mode as default fbdev: sh_mobile_lcdcfb: Fix up fallout from MERAM changes.	2011-06-07 18:41:26 -07:00
Linus Torvalds	8397345172	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: vfs: make unlink() and rmdir() return ENOENT in preference to EROFS lmLogOpen() broken failure exit usb: remove bad dput after dentry_unhash more conservative S_NOSEC handling	2011-06-07 18:36:59 -07:00
Wu Fengguang	e84d0a4f8e	writeback: trace event writeback_queue_io Note that it adds a little overheads to account the moved/enqueued inodes from b_dirty to b_io. The "moved" accounting may be later used to limit the number of inodes that can be moved in one shot, in order to keep spinlock hold time under control. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:23 +08:00
Wu Fengguang	251d6a471c	writeback: trace event writeback_single_inode It is valuable to know how the dirty inodes are iterated and their IO size. "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC\|I_SYNC age=414 index=0 to_write=1024 wrote=0" - "state" reflects inode->i_state at the end of writeback_single_inode() - "index" reflects mapping->writeback_index after the ->writepages() call - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode() - "wrote" is the number of pages actually written v2: add trace event writeback_single_inode_requeue as proposed by Dave. CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:23 +08:00
Wu Fengguang	846d5a091b	writeback: remove .nonblocking and .encountered_congestion Remove two unused struct writeback_control fields: .encountered_congestion (completely unused) .nonblocking (never set, checked/showed in XFS,NFS/btrfs) The .for_background check in nfs_write_inode() is also removed btw, as .for_background implies WB_SYNC_NONE. Reviewed-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:23 +08:00
Wu Fengguang	b7a2441f99	writeback: remove writeback_control.more_io When wbc.more_io was first introduced, it indicates whether there are at least one superblock whose s_more_io contains more IO work. Now with the per-bdi writeback, it can be replaced with a simple b_more_io test. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:23 +08:00
Wu Fengguang	3efaf0faba	writeback: skip balance_dirty_pages() for in-memory fs This avoids unnecessary checks and dirty throttling on tmpfs/ramfs. Notes about the tmpfs/ramfs behavior changes: As for 2.6.36 and older kernels, the tmpfs writes will sleep inside balance_dirty_pages() as long as we are over the (dirty+background)/2 global throttle threshold. This is because both the dirty pages and threshold will be 0 for tmpfs/ramfs. Hence this test will always evaluate to TRUE: dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) \|\| (nr_reclaimable + nr_writeback >= dirty_thresh); For 2.6.37, someone complained that the current logic does not allow the users to set vm.dirty_ratio=0. So commit `4cbec4c8b9` changed the test to dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh) \|\| (nr_reclaimable + nr_writeback > dirty_thresh); So 2.6.37 will behave differently for tmpfs/ramfs: it will never get throttled unless the global dirty threshold is exceeded (which is very unlikely to happen; once happen, will block many tasks). I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means for a busy writing server, tmpfs write()s may get livelocked! The "inadvertent" throttling can hardly bring help to any workload because of its "either no throttling, or get throttled to death" property. So based on 2.6.37, this patch won't bring more noticeable changes. CC: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:22 +08:00
Wu Fengguang	6f71865627	writeback: add bdi_dirty_limit() kernel-doc Clarify the bdi_dirty_limit() comment. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:22 +08:00
Wu Fengguang	e185dda89d	writeback: avoid extra sync work at enqueue time This removes writeback_control.wb_start and does more straightforward sync livelock prevention by setting .older_than_this to prevent extra inodes from being enqueued in the first place. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:22 +08:00
Wu Fengguang	e8dfc30582	writeback: elevate queue_io() into wb_writeback() Code refactor for more logical code layout. No behavior change. - remove the mis-named __writeback_inodes_sb() - wb_writeback()/writeback_inodes_wb() will decide when to queue_io() before calling __writeback_inodes_wb() Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:21 +08:00
Christoph Hellwig	f758eeabeb	writeback: split inode_wb_list_lock into bdi_writeback.list_lock Split the global inode_wb_list_lock into a per-bdi_writeback list_lock, as it's currently the most contended lock in the system for metadata heavy workloads. It won't help for single-filesystem workloads for which we'll need the I/O-less balance_dirty_pages, but at least we can dedicate a cpu to spinning on each bdi now for larger systems. Based on earlier patches from Nick Piggin and Dave Chinner. It reduces lock contentions to 1/4 in this test case: 10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram lock_stat version 0.3 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- vanilla 2.6.39-rc3: inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23 ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a ------------------ inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85 inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49 inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0 inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157 2.6.39-rc3 + patch: &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37 ------------------------ &(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150 &(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f &(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223 ------------------------ &(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b &(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86 &(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf &(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:21 +08:00
Wu Fengguang	424b351fe1	writeback: refill b_io iff empty There is no point to carry different refill policies between for_kupdate and other type of works. Use a consistent "refill b_io iff empty" policy which can guarantee fairness in an easy to understand way. A b_io refill will setup a _fixed_ work set with all currently eligible inodes and start a new round of walk through b_io. The "fixed" work set means no new inodes will be added to the work set during the walk. Only when a complete walk over b_io is done, new inodes that are eligible at the time will be enqueued and the walk be started over. This procedure provides fairness among the inodes because it guarantees each inode to be synced once and only once at each round. So all inodes will be free from starvations. This change relies on wb_writeback() to keep retrying as long as we made some progress on cleaning some pages and/or inodes. Without that ability, the old logic on background works relies on aggressively queuing all eligible inodes into b_io at every time. But that's not a guarantee. The below test script completes a slightly faster now: 2.6.39-rc3 2.6.39-rc3-dyn-expire+ ------------------------------------------------ all elapsed 256.043 252.367 stddev 24.381 12.530 tar elapsed 30.097 28.808 dd elapsed 13.214 11.782 #!/bin/zsh cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/ umount /dev/sda7 mkfs.xfs -f /dev/sda7 mount /dev/sda7 /fs echo 3 > /proc/sys/vm/drop_caches tic=$(cat /proc/uptime\|cut -d' ' -f2) cd /fs time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 & time dd if=/dev/zero of=/fs/zero bs=1M count=1000 & wait sync tac=$(cat /proc/uptime\|cut -d' ' -f2) echo elapsed: $((tac - tic)) It maintains roughly the same small vs. large file writeout shares, and offers large files better chances to be written in nice 4M chunks. Analyzes from Dave Chinner in great details: Let's say we have lots of inodes with 100 dirty pages being created, and one large writeback going on. We expire 8 new inodes for every 1024 pages we write back. With the old code, we do: b_more_io (large inode) -> b_io (1l) 8 newly expired inodes -> b_io (1l, 8s) writeback large inode 1024 pages -> b_more_io b_more_io (large inode) -> b_io (8s, 1l) 8 newly expired inodes -> b_io (8s, 1l, 8s) writeback 8 small inodes 800 pages 1 large inode 224 pages -> b_more_io b_more_io (large inode) -> b_io (8s, 1l) 8 newly expired inodes -> b_io (8s, 1l, 8s) ..... Your new code: b_more_io (large inode) -> b_io (1l) 8 newly expired inodes -> b_io (1l, 8s) writeback large inode 1024 pages -> b_more_io (b_io == 8s) writeback 8 small inodes 800 pages b_io empty: (1800 pages written) b_more_io (large inode) -> b_io (1l) 14 newly expired inodes -> b_io (1l, 14s) writeback large inode 1024 pages -> b_more_io (b_io == 14s) writeback 10 small inodes 1000 pages 1 small inode 24 pages -> b_more_io (1l, 1s(24)) writeback 5 small inodes 500 pages b_io empty: (2548 pages written) b_more_io (large inode) -> b_io (1l, 1s(24)) 20 newly expired inodes -> b_io (1l, 1s(24), 20s) ...... Rough progression of pages written at b_io refill: Old code: total large file % of writeback 1024 224 21.9% (fixed) New code: total large file % of writeback 1800 1024 ~55% 2550 1024 ~40% 3050 1024 ~33% 3500 1024 ~29% 3950 1024 ~26% 4250 1024 ~24% 4500 1024 ~22.7% 4700 1024 ~21.7% 4800 1024 ~21.3% 4800 1024 ~21.3% (pretty much steady state from here) Ok, so the steady state is reached with a similar percentage of writeback to the large file as the existing code. Ok, that's good, but providing some evidence that is doesn't change the shared of writeback to the large should be in the commit message ;) The other advantage to this is that we always write 1024 page chunks to the large file, rather than smaller "whatever remains" chunks. CC: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:21 +08:00
Wu Fengguang	ba9aa8399f	writeback: the kupdate expire timestamp should be a moving target Dynamically compute the dirty expire timestamp at queue_io() time. writeback_control.older_than_this used to be determined at entrance to the kupdate writeback work. This _static_ timestamp may go stale if the kupdate work runs on and on. The flusher may then stuck with some old busy inodes, never considering newly expired inodes thereafter. This has two possible problems: - It is unfair for a large dirty inode to delay (for a long time) the writeback of small dirty inodes. - As time goes by, the large and busy dirty inode may contain only _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks delaying the expired dirty pages to the end of LRU lists, triggering the evil pageout(). Nevertheless this patch merely addresses part of the problem. v2: keep policy changes inside wb_writeback() and keep the wbc.older_than_this visibility as suggested by Dave. CC: Dave Chinner <david@fromorbit.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:21 +08:00
Wu Fengguang	e6fb6da2e1	writeback: try more writeback as long as something was written writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that they only populate possibly a subset of eligible inodes into b_io at entrance time. When the queued set of inodes are all synced, they just return, possibly with all queued inode pages written but still wbc.nr_to_write > 0. For kupdate and background writeback, there may be more eligible inodes sitting in b_dirty when the current set of b_io inodes are completed. So it is necessary to try another round of writeback as long as we made some progress in this round. When there are no more eligible inodes, no more inodes will be enqueued in queue_io(), hence nothing could/will be synced and we may safely bail. For example, imagine 100 inodes i0, i1, i2, ..., i90, i91, i99 At queue_io() time, i90-i99 happen to be expired and moved to s_io for IO. When finished successfully, if their total size is less than MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will quit the background work (w/o this patch) while it's still over background threshold. This will be a fairly normal/frequent case I guess. Now that we do tagged sync and update inode->dirtied_when after the sync, this change won't livelock sync(1). I actually tried to write 1 page per 1ms with this command write-and-fsync -n10000 -S 1000 -c 4096 /fs/test and do sync(1) at the same time. The sync completes quickly on ext4, xfs, btrfs. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:20 +08:00
Wu Fengguang	cb9bd1159c	writeback: introduce writeback_control.inodes_written The flusher works on dirty inodes in batches, and may quit prematurely if the batch of inodes happen to be metadata-only dirtied: in this case wbc->nr_to_write won't be decreased at all, which stands for "no pages written" but also mis-interpreted as "no progress". So introduce writeback_control.inodes_written to count the inodes get cleaned from VFS POV. A non-zero value means there are some progress on writeback, in which case more writeback can be tried. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:20 +08:00

... 183 184 185 186 187 ...

263272 Commits