Commit Graph

263272 Commits

Author SHA1 Message Date
Paul Mundt
debf950716 serial: sh-sci: Generalize overrun handling.
This consolidates all of the broken out overrun handling and ensures that
we have sensible defaults per-port type, in addition to making sure that
overruns are flagged appropriately in the error mask for parts that
haven't explicitly disabled support for it.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2011-06-08 18:19:37 +09:00
Sekhar Nori
7e9f194521 davinci: make PCM platform devices static
Make the PCM device structures used in devices.c
and devices-da8xx.c static as they are used only
in the respective files.

This was found when trying to build a single image
for DaVinci and DA8x devices using runtime P2V support.

Signed-off-by: Sekhar Nori <nsekhar@ti.com>
2011-06-08 14:41:37 +05:30
Thomas Gleixner
7416401661 arm: davinci: Fix fallout from generic irq chip conversion
The code which does the chained handler setup was overwriting
chip_data.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Holger Hans Peter Freyther <holger@freyther.de>
Signed-off-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
2011-06-08 14:33:52 +05:30
Paul Mundt
b030340161 serial: sh-sci: Kill off some more unused definitions.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2011-06-08 17:13:20 +09:00
Paul Mundt
a01cdc1068 serial: sh-sci: Tidy up ioread/write wrappers, kill off unused SCI helper.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2011-06-08 17:06:25 +09:00
Stefan Kriwanek
74bc695313 HID: Add driver to fix Speedlink VAD Cezanne support
Speedlink VAD Cezanne have a hardware bug that makes the cursor "jump" from one
place to another every now and then. The issue are relative motion events
erroneously reported by the device, each having a distance value of +256. This
256 can in fact never occur due to real motion, therefore those events can
safely be ignored.  The driver also drops useless EV_REL events with a value of
0, that the device sends every time it sends an "real" EV_REL or EV_KEY event.

Signed-off-by: Stefan Kriwanek <mail@stefankriwanek.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-06-08 09:45:37 +02:00
Stephen Rothwell
ffbc03bc75 net: add needed interrupt.h
Fixes these errors after the removal of interrupt.h from netdevice.h:

drivers/net/ll_temac_main.c: In function 'temac_open':
drivers/net/ll_temac_main.c:859:2: error: implicit declaration of function 'request_irq'
drivers/net/ll_temac_main.c:870:2: error: implicit declaration of function 'free_irq'
drivers/net/ll_temac_main.c: In function 'temac_poll_controller':
drivers/net/ll_temac_main.c:903:2: error: implicit declaration of function 'disable_irq'
drivers/net/ll_temac_main.c:909:2: error: implicit declaration of function 'enable_irq'

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08 00:15:34 -07:00
Srinivas KANDAGATLA
5bdbd4fa4d sh: Fix up xchg/cmpxchg corruption with gUSA RB.
gUSA special cases r15 for part of its login/out sequence, meaning that
any parameters need to be explicitly prohibited from accidentally being
assigned that particular register, and the compiler ultimately needs to
use a temporary instead.

Certain configurations have begun generating code paths that do indeed
get allocated r15, resulting in immediate corruption of the exchanged
value. This was observed in (amongst others) exit_mm() code generation
where the xchg_u32 call was immediately corrupting a structure address.

As this is a general gUSA restriction, the rest of the users likewise
need to be updated to ensure sensible constraints.

References: https://bugzilla.stlinux.com/show_bug.cgi?id=11229
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com>
Reviewed-by: Stuart Menefy <stuart.menefy@st.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2011-06-08 15:22:39 +09:00
Jonathan Brassow
076f968b37 MD: add sync_super to mddev_t struct
Add the 'sync_super' function pointer to MD array structure (struct mddev_s)

If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates.  The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this.  If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-08 15:11:31 +10:00
Jonathan Brassow
1ed7242e59 MD: raid1 changes to allow use by device mapper
MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c)

Added the necessary congestion function and conditionalize calls requiring an
array 'queue' or 'gendisk'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-08 15:11:31 +10:00
Jonathan Brassow
0fd018af37 MD: move thread wakeups into resume
Move personality and sync/recovery thread starting outside md_run.

Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR).  I/O - in this case, recovery operations - should
not happen until after a resume has taken place.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-08 15:11:31 +10:00
Jonathan Brassow
ac42450c7c MD: possible typo
Make message a bit clearer by s/blocks/k/

I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message.  'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-08 15:11:31 +10:00
Jonathan Brassow
68866e425b MD: no sync IO while suspended
Disallow resync I/O while the RAID array is suspended.

Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-08 15:10:08 +10:00
Jonathan Brassow
629acb6aba MD: no integrity register if no gendisk
Don't attempt md_integrity_register if there is no gendisk struct available.

When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-08 15:10:08 +10:00
Sage Weil
0c1f91f271 ceph: unwind canceled flock state
If we request a lock and then abort (e.g., ^C), we need to send a matching
unlock request to the MDS to unwind our lock attempt to avoid indefinitely
blocking other clients.

Reported-by: Brian Chrisman <brchrisman@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:36:45 -07:00
Sage Weil
0e98728fa3 ceph: fix ENOENT logic in striped_read
Getting ENOENT is equivalent to reading 0 bytes.  Make that correction
before setting up the hit_stripe and was_short flags.

Fixes the following case:
 dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
 dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct

Reported-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:16 -07:00
Sage Weil
c3cd62839a ceph: fix short sync reads from the OSD
If we get a short read from the OSD because the object is small, we need to
zero the remainder of the buffer.  For O_DIRECT reads, the attempted range
is not trimmed to i_size by the VFS, so we were actually looping
indefinitely.

Fix by trimming by i_size, and the unconditionally zeroing the trailing
range.

Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:14 -07:00
Sage Weil
2584547230 ceph: fix sync vs canceled write
If we cancel a write, trigger the safe completions to prevent a sync from
blocking indefinitely in ceph_osdc_sync().

Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:13 -07:00
Sage Weil
70b666c3b4 ceph: use ihold when we already have an inode ref
We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock.  This avoids adding new and unnecessary
locking dependencies.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:11 -07:00
Steffen Klassert
e756682c8b xfrm: Fix off by one in the replay advance functions
We may write 4 byte too much when we reinitialize the anti replay
window in the replay advance functions. This patch fixes this by
adjusting the last index of the initialization loop.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-07 21:14:39 -07:00
Linus Torvalds
cb0a02ecf9 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Ensure we locate the passed IRQ in irq_alloc_descs()
  genirq: Fix descriptor init on non-sparse IRQs
  irq: Handle spurios irq detection for threaded irqs
  genirq: Print threaded handler in spurious debug output
2011-06-07 19:21:11 -07:00
Linus Torvalds
d681f1204d Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86/amd-iommu: Fix boot crash with hidden PCI devices
  x86/amd-iommu: Use only per-device dma_ops
  x86/amd-iommu: Fix 3 possible endless loops
2011-06-07 19:20:53 -07:00
Linus Torvalds
6715a52a58 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix/clarify set_task_cpu() locking rules
  lockdep: Fix lock_is_held() on recursion
  sched: Fix schedstat.nr_wakeups_migrate
  sched: Fix cross-cpu clock sync on remote wakeups
2011-06-07 19:20:28 -07:00
Linus Torvalds
ef2398019b Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
  drm/nv40: fall back to paged dma object for the moment
  drm/nouveau: fix leak of gart mm node
  drm/nouveau: fix vram page mapping when crossing page table boundaries
  drm/nv17-nv40: Fix modesetting failure when pitch == 4096px (fdo bug 35901).
  drm/nouveau: don't create accel engine objects when noaccel=1
  drm/nvc0: recognise 0xdX chipsets as NV_C0
  drm/i915: Add a no lvds quirk for the Asus EeeBox PC EB1007
  drm/i915: Share the common force-audio property between connectors
  drm/i915: Remove unused enum "chip_family"
  drm/915: fix relaxed tiling on gen2: tile height
  drm/i915/crt: Explicitly return false if connected to a digital monitor
  drm/i915: Replace ironlake_compute_wm0 with g4x_compute_wm0
  drm/i915: Only print out the actual number of fences for i915_error_state
  drm/i915: s/addr & ~PAGE_MASK/offset_in_page(addr)/
  drm: i915: correct return status in intel_hdmi_mode_valid()
  drm/i915: fix regression after clock gating init split
  drm/i915: fix if statement in ivybridge irq handler
2011-06-07 19:09:26 -07:00
Linus Torvalds
12871a0bd6 Merge branch 'drm-radeon-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-radeon-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
  drm/radeon/kms/atom: fix PHY init
  drm/radeon/kms: add missing Evergreen texture formats to the CS parser
  drm/radeon/kms: viewport height has to be even
  drm/radeon/kms: remove duplicate reg from r600 safe regs
  drm/radeon/kms: add support for Llano Fusion APUs
  drm/radeon/kms: add llano pci ids
  drm/radeon/kms: fill in asic struct for llano
  drm/radeon/kms: add family ids for llano APUs
  drm/radeon: fix oops in ttm reserve when pageflipping (v2)
  drm/radeon/kms: clean up the radeon kms Kconfig
  drm/radeon/kms: fix thermal sensor reading on juniper
  drm/radeon/kms: add missing case for cayman thermal sensor
  drm/radeon/kms: add blit support for cayman (v2)
  drm/radeon/kms/blit: workaround some hw issues on evergreen+
2011-06-07 19:09:17 -07:00
Linus Torvalds
ecff4fcc7b Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  timers: Consider slack value in mod_timer()
  clockevents: Handle empty cpumask gracefully
2011-06-07 19:07:22 -07:00
Linus Torvalds
58a9a36b54 Merge branch 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: Initialize kvm before registering the mmu notifier
  KVM: x86: use proper port value when checking io instruction permission
  KVM: add missing void __user * cast to access_ok() call
2011-06-07 19:06:28 -07:00
Grant Likely
22b174f8b7 MAINTAINERS: Saying goodbye to David Brownell
We had to say goodbye when David passed away recently.  David had a
huge impact on our community, both personally in the lives of the
people he worked with, and technically in the design and maintenance
of several subsystems.  He is greatly missed.

He also leaves behind a number of much loved subsystems now orphaned.
This patch updates the MAINTAINERS file for the areas that David was
responsible for and adds an entry for him to the CREDITS file.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Harry Wei <harryxiyou@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-07 19:05:26 -07:00
Linus Torvalds
9c125d2af8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
  fat: Fix corrupt inode flags when remove ATTR_SYS flag
2011-06-07 19:04:21 -07:00
David Howells
40182373ac MN10300: Add missing _sdata declaration
_sdata needs to be declared in the linker script now as of commit
a2d063ac21 ("extable, core_kernel_data(): Make sure all archs define
_sdata")

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-07 19:03:53 -07:00
David Howells
db1c9dfa64 MN10300: die_if_no_fixup() shouldn't use get_user() as it doesn't call set_fs()
die_if_no_fixup() shouldn't use get_user() as it doesn't call set_fs() to
indicate that it wants to probe a kernel address.  Instead it should use
probe_kernel_read().

This fixes the problem of gdb seeing SIGILL rather than SIGTRAP when hitting
the KGDB special breakpoint upon SysRq+g being seen.  The problem was that
die_if_no_fixup() was failing to read the opcode of the instruction that caused
the exception, and thus not fixing up the exception.

This caused gdb to get a S04 response to the $? request in its remote protocol
rather than S05 - which would then cause it to continue with $C04 rather than
$c in an attempt to pass the signal onto the inferior process.  The kernel,
however, does not support $Cnn, and so objects by returning an E22 response,
indicating an error.  gdb does not expect this and prints:

	warning: Remote failure reply: E22

and then returns to the gdb command prompt unable to continue.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-07 19:03:52 -07:00
David Howells
2e65d1f6ee MN10300: Fix one of the kernel debugger cacheflush variants
One of the kernel debugger cacheflush variants escaped proper testing.  Two of
the labels are wrong, being derived from the code that was copied to construct
the variant.

The first label results in the following assembler message:

    AS      arch/mn10300/mm/cache-dbg-flush-by-reg.o
  arch/mn10300/mm/cache-dbg-flush-by-reg.S: Assembler messages:
  arch/mn10300/mm/cache-dbg-flush-by-reg.S:123: Error: symbol `debugger_local_cache_flushinv_no_dcache' is already defined

And the second label results in the following linker message:

  arch/mn10300/mm/built-in.o:(.text+0x1d39): undefined reference to `mn10300_local_icache_inv_range_reg_end'
  arch/mn10300/mm/built-in.o:(.text+0x1d39): relocation truncated to fit: R_MN10300_PCREL16 against undefined symbol `mn10300_local_icache_inv_range_reg_end'

To test this file the following configuration pieces must be set:

	CONFIG_AM34=y
	CONFIG_MN10300_CACHE_WBACK=y
	CONFIG_MN10300_DEBUGGER_CACHE_FLUSH_BY_REG=y
	CONFIG_MN10300_CACHE_MANAGE_BY_REG=y
	CONFIG_AM34_HAS_CACHE_SNOOP=n

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-07 19:03:52 -07:00
Linus Torvalds
aec040e29e Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
  [S390] fix kvm defines for 31 bit compile
  [S390] use generic RCU page-table freeing code
  [S390] qdio: Split SBAL entry flags
  [S390] kvm-s390: fix stfle facilities numbers >=64
  [S390] kvm-s390: Fix host crash on misbehaving guests
2011-06-07 18:47:53 -07:00
Linus Torvalds
8ea656bdc9 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6: (21 commits)
  ARM: OMAP4: MMC: increase delay for pbias
  arm: omap2plus: move NAND_BLOCK_SIZE out of boards
  omap4: hwmod: Enable the keypad
  omap3: Free Beagle rev gpios when they are read, so others can read them later
  arm: omap3: beagle: Ensure msecure is mux'd to be able to set the RTC
  omap: rx51: Don't power up speaker amplifier at bootup
  omap: rx51: Set regulator V28_A always on
  ARM: OMAP4: MMC: no regulator off during probe for eMMC
  arm: omap2plus: fix ads7846 pendown gpio request
  ARM: OMAP2: Add missing iounmap in omap4430_phy_init
  ARM: omap4: Pass core and wakeup mux tables to omap4_mux_init
  ARM: omap2+: mux: Allow board mux settings to be NULL
  OMAP4: fix return value of omap4_l3_init
  OMAP: iovmm: fix SW flags passed by user
  arch/arm/mach-omap1/dma.c: Invert calls to platform_device_put and platform_device_del
  OMAP2+: mux: fix compilation warnings
  OMAP: SRAM: Fix warning: format '%08lx' expects type 'long unsigned int'
  arm: omap3: cm-t3517: fix section mismatch warning
  OMAP2+: Fix 9 section mismatch(es) warnings from mach-omap2/built-in.o
  ARM: OMAP2: Add missing include of linux/gpio.h
  ...
2011-06-07 18:46:37 -07:00
Linus Torvalds
d205df9955 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Processes waiting on inode glock that no processes are holding
2011-06-07 18:44:10 -07:00
Linus Torvalds
24210071e0 Merge branch 'fbdev-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/fbdev-3.x
* 'fbdev-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/fbdev-3.x:
  video: Fix use-after-free by vga16fb on rmmod
  video: Convert vmalloc/memset to vzalloc
  efifb: Disallow manual bind and unbind
  efifb: Fix mismatched request/release_mem_region
  efifb: Enable write-combining
  drivers/video/pxa168fb.c: add missing clk_put
  drivers/video/imxfb.c: add missing clk_put
  fbdev: bf537-lq035: add missing blacklight properties type
  savagefb: Use panel CVT mode as default
  fbdev: sh_mobile_lcdcfb: Fix up fallout from MERAM changes.
2011-06-07 18:41:26 -07:00
Linus Torvalds
8397345172 Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
  lmLogOpen() broken failure exit
  usb: remove bad dput after dentry_unhash
  more conservative S_NOSEC handling
2011-06-07 18:36:59 -07:00
Wu Fengguang
e84d0a4f8e writeback: trace event writeback_queue_io
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang
251d6a471c writeback: trace event writeback_single_inode
It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

v2: add trace event writeback_single_inode_requeue as proposed by Dave.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang
846d5a091b writeback: remove .nonblocking and .encountered_congestion
Remove two unused struct writeback_control fields:

	.encountered_congestion	(completely unused)
	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)

The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.

Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang
b7a2441f99 writeback: remove writeback_control.more_io
When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang
3efaf0faba writeback: skip balance_dirty_pages() for in-memory fs
This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

Notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                        || (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:22 +08:00
Wu Fengguang
6f71865627 writeback: add bdi_dirty_limit() kernel-doc
Clarify the bdi_dirty_limit() comment.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:22 +08:00
Wu Fengguang
e185dda89d writeback: avoid extra sync work at enqueue time
This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:22 +08:00
Wu Fengguang
e8dfc30582 writeback: elevate queue_io() into wb_writeback()
Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Christoph Hellwig
f758eeabeb writeback: split inode_wb_list_lock into bdi_writeback.list_lock
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Wu Fengguang
424b351fe1 writeback: refill b_io iff empty
There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.

A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.

This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.

This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.

The below test script completes a slightly faster now:

             2.6.39-rc3	  2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

	#!/bin/zsh

	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

	umount /dev/sda7
	mkfs.xfs -f /dev/sda7
	mount /dev/sda7 /fs

	echo 3 > /proc/sys/vm/drop_caches

	tic=$(cat /proc/uptime|cut -d' ' -f2)

	cd /fs
	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

	wait
	sync
	tac=$(cat /proc/uptime|cut -d' ' -f2)
	echo elapsed: $((tac - tic))

It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.

Analyzes from Dave Chinner in great details:

Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)

New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Wu Fengguang
ba9aa8399f writeback: the kupdate expire timestamp should be a moving target
Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Wu Fengguang
e6fb6da2e1 writeback: try more writeback as long as something was written
writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

For example, imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.

Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1).  I actually tried to write 1 page
per 1ms with this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:20 +08:00
Wu Fengguang
cb9bd1159c writeback: introduce writeback_control.inodes_written
The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV.  A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:20 +08:00