This consolidates all of the broken out overrun handling and ensures that
we have sensible defaults per-port type, in addition to making sure that
overruns are flagged appropriately in the error mask for parts that
haven't explicitly disabled support for it.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Make the PCM device structures used in devices.c
and devices-da8xx.c static as they are used only
in the respective files.
This was found when trying to build a single image
for DaVinci and DA8x devices using runtime P2V support.
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
The code which does the chained handler setup was overwriting
chip_data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Holger Hans Peter Freyther <holger@freyther.de>
Signed-off-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Speedlink VAD Cezanne have a hardware bug that makes the cursor "jump" from one
place to another every now and then. The issue are relative motion events
erroneously reported by the device, each having a distance value of +256. This
256 can in fact never occur due to real motion, therefore those events can
safely be ignored. The driver also drops useless EV_REL events with a value of
0, that the device sends every time it sends an "real" EV_REL or EV_KEY event.
Signed-off-by: Stefan Kriwanek <mail@stefankriwanek.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Fixes these errors after the removal of interrupt.h from netdevice.h:
drivers/net/ll_temac_main.c: In function 'temac_open':
drivers/net/ll_temac_main.c:859:2: error: implicit declaration of function 'request_irq'
drivers/net/ll_temac_main.c:870:2: error: implicit declaration of function 'free_irq'
drivers/net/ll_temac_main.c: In function 'temac_poll_controller':
drivers/net/ll_temac_main.c:903:2: error: implicit declaration of function 'disable_irq'
drivers/net/ll_temac_main.c:909:2: error: implicit declaration of function 'enable_irq'
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
gUSA special cases r15 for part of its login/out sequence, meaning that
any parameters need to be explicitly prohibited from accidentally being
assigned that particular register, and the compiler ultimately needs to
use a temporary instead.
Certain configurations have begun generating code paths that do indeed
get allocated r15, resulting in immediate corruption of the exchanged
value. This was observed in (amongst others) exit_mm() code generation
where the xchg_u32 call was immediately corrupting a structure address.
As this is a general gUSA restriction, the rest of the users likewise
need to be updated to ensure sensible constraints.
References: https://bugzilla.stlinux.com/show_bug.cgi?id=11229
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com>
Reviewed-by: Stuart Menefy <stuart.menefy@st.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Add the 'sync_super' function pointer to MD array structure (struct mddev_s)
If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates. The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this. If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c)
Added the necessary congestion function and conditionalize calls requiring an
array 'queue' or 'gendisk'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Move personality and sync/recovery thread starting outside md_run.
Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should
not happen until after a resume has taken place.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Make message a bit clearer by s/blocks/k/
I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message. 'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Disallow resync I/O while the RAID array is suspended.
Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Don't attempt md_integrity_register if there is no gendisk struct available.
When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If we request a lock and then abort (e.g., ^C), we need to send a matching
unlock request to the MDS to unwind our lock attempt to avoid indefinitely
blocking other clients.
Reported-by: Brian Chrisman <brchrisman@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Getting ENOENT is equivalent to reading 0 bytes. Make that correction
before setting up the hit_stripe and was_short flags.
Fixes the following case:
dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct
Reported-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If we get a short read from the OSD because the object is small, we need to
zero the remainder of the buffer. For O_DIRECT reads, the attempted range
is not trimmed to i_size by the VFS, so we were actually looping
indefinitely.
Fix by trimming by i_size, and the unconditionally zeroing the trailing
range.
Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
If we cancel a write, trigger the safe completions to prevent a sync from
blocking indefinitely in ceph_osdc_sync().
Signed-off-by: Sage Weil <sage@newdream.net>
We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock. This avoids adding new and unnecessary
locking dependencies.
Signed-off-by: Sage Weil <sage@newdream.net>
We may write 4 byte too much when we reinitialize the anti replay
window in the replay advance functions. This patch fixes this by
adjusting the last index of the initialization loop.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/amd-iommu: Fix boot crash with hidden PCI devices
x86/amd-iommu: Use only per-device dma_ops
x86/amd-iommu: Fix 3 possible endless loops
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/nv40: fall back to paged dma object for the moment
drm/nouveau: fix leak of gart mm node
drm/nouveau: fix vram page mapping when crossing page table boundaries
drm/nv17-nv40: Fix modesetting failure when pitch == 4096px (fdo bug 35901).
drm/nouveau: don't create accel engine objects when noaccel=1
drm/nvc0: recognise 0xdX chipsets as NV_C0
drm/i915: Add a no lvds quirk for the Asus EeeBox PC EB1007
drm/i915: Share the common force-audio property between connectors
drm/i915: Remove unused enum "chip_family"
drm/915: fix relaxed tiling on gen2: tile height
drm/i915/crt: Explicitly return false if connected to a digital monitor
drm/i915: Replace ironlake_compute_wm0 with g4x_compute_wm0
drm/i915: Only print out the actual number of fences for i915_error_state
drm/i915: s/addr & ~PAGE_MASK/offset_in_page(addr)/
drm: i915: correct return status in intel_hdmi_mode_valid()
drm/i915: fix regression after clock gating init split
drm/i915: fix if statement in ivybridge irq handler
* 'drm-radeon-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/radeon/kms/atom: fix PHY init
drm/radeon/kms: add missing Evergreen texture formats to the CS parser
drm/radeon/kms: viewport height has to be even
drm/radeon/kms: remove duplicate reg from r600 safe regs
drm/radeon/kms: add support for Llano Fusion APUs
drm/radeon/kms: add llano pci ids
drm/radeon/kms: fill in asic struct for llano
drm/radeon/kms: add family ids for llano APUs
drm/radeon: fix oops in ttm reserve when pageflipping (v2)
drm/radeon/kms: clean up the radeon kms Kconfig
drm/radeon/kms: fix thermal sensor reading on juniper
drm/radeon/kms: add missing case for cayman thermal sensor
drm/radeon/kms: add blit support for cayman (v2)
drm/radeon/kms/blit: workaround some hw issues on evergreen+
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: Consider slack value in mod_timer()
clockevents: Handle empty cpumask gracefully
* 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Initialize kvm before registering the mmu notifier
KVM: x86: use proper port value when checking io instruction permission
KVM: add missing void __user * cast to access_ok() call
We had to say goodbye when David passed away recently. David had a
huge impact on our community, both personally in the lives of the
people he worked with, and technically in the design and maintenance
of several subsystems. He is greatly missed.
He also leaves behind a number of much loved subsystems now orphaned.
This patch updates the MAINTAINERS file for the areas that David was
responsible for and adds an entry for him to the CREDITS file.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Harry Wei <harryxiyou@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
_sdata needs to be declared in the linker script now as of commit
a2d063ac21 ("extable, core_kernel_data(): Make sure all archs define
_sdata")
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
die_if_no_fixup() shouldn't use get_user() as it doesn't call set_fs() to
indicate that it wants to probe a kernel address. Instead it should use
probe_kernel_read().
This fixes the problem of gdb seeing SIGILL rather than SIGTRAP when hitting
the KGDB special breakpoint upon SysRq+g being seen. The problem was that
die_if_no_fixup() was failing to read the opcode of the instruction that caused
the exception, and thus not fixing up the exception.
This caused gdb to get a S04 response to the $? request in its remote protocol
rather than S05 - which would then cause it to continue with $C04 rather than
$c in an attempt to pass the signal onto the inferior process. The kernel,
however, does not support $Cnn, and so objects by returning an E22 response,
indicating an error. gdb does not expect this and prints:
warning: Remote failure reply: E22
and then returns to the gdb command prompt unable to continue.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of the kernel debugger cacheflush variants escaped proper testing. Two of
the labels are wrong, being derived from the code that was copied to construct
the variant.
The first label results in the following assembler message:
AS arch/mn10300/mm/cache-dbg-flush-by-reg.o
arch/mn10300/mm/cache-dbg-flush-by-reg.S: Assembler messages:
arch/mn10300/mm/cache-dbg-flush-by-reg.S:123: Error: symbol `debugger_local_cache_flushinv_no_dcache' is already defined
And the second label results in the following linker message:
arch/mn10300/mm/built-in.o:(.text+0x1d39): undefined reference to `mn10300_local_icache_inv_range_reg_end'
arch/mn10300/mm/built-in.o:(.text+0x1d39): relocation truncated to fit: R_MN10300_PCREL16 against undefined symbol `mn10300_local_icache_inv_range_reg_end'
To test this file the following configuration pieces must be set:
CONFIG_AM34=y
CONFIG_MN10300_CACHE_WBACK=y
CONFIG_MN10300_DEBUGGER_CACHE_FLUSH_BY_REG=y
CONFIG_MN10300_CACHE_MANAGE_BY_REG=y
CONFIG_AM34_HAS_CACHE_SNOOP=n
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6: (21 commits)
ARM: OMAP4: MMC: increase delay for pbias
arm: omap2plus: move NAND_BLOCK_SIZE out of boards
omap4: hwmod: Enable the keypad
omap3: Free Beagle rev gpios when they are read, so others can read them later
arm: omap3: beagle: Ensure msecure is mux'd to be able to set the RTC
omap: rx51: Don't power up speaker amplifier at bootup
omap: rx51: Set regulator V28_A always on
ARM: OMAP4: MMC: no regulator off during probe for eMMC
arm: omap2plus: fix ads7846 pendown gpio request
ARM: OMAP2: Add missing iounmap in omap4430_phy_init
ARM: omap4: Pass core and wakeup mux tables to omap4_mux_init
ARM: omap2+: mux: Allow board mux settings to be NULL
OMAP4: fix return value of omap4_l3_init
OMAP: iovmm: fix SW flags passed by user
arch/arm/mach-omap1/dma.c: Invert calls to platform_device_put and platform_device_del
OMAP2+: mux: fix compilation warnings
OMAP: SRAM: Fix warning: format '%08lx' expects type 'long unsigned int'
arm: omap3: cm-t3517: fix section mismatch warning
OMAP2+: Fix 9 section mismatch(es) warnings from mach-omap2/built-in.o
ARM: OMAP2: Add missing include of linux/gpio.h
...
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
lmLogOpen() broken failure exit
usb: remove bad dput after dentry_unhash
more conservative S_NOSEC handling
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
It is valuable to know how the dirty inodes are iterated and their IO size.
"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"
- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written
v2: add trace event writeback_single_inode_requeue as proposed by Dave.
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Remove two unused struct writeback_control fields:
.encountered_congestion (completely unused)
.nonblocking (never set, checked/showed in XFS,NFS/btrfs)
The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.
Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
Notes about the tmpfs/ramfs behavior changes:
As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold. This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:
dirty_exceeded =
(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
|| (nr_reclaimable + nr_writeback >= dirty_thresh);
For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0. So commit 4cbec4c8b9 changed the test to
dirty_exceeded =
(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
|| (nr_reclaimable + nr_writeback > dirty_thresh);
So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).
I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.
So based on 2.6.37, this patch won't bring more noticeable changes.
CC: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Clarify the bdi_dirty_limit() comment.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Code refactor for more logical code layout.
No behavior change.
- remove the mis-named __writeback_inodes_sb()
- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
before calling __writeback_inodes_wb()
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads. It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.
Based on earlier patches from Nick Piggin and Dave Chinner.
It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
------------------
inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 34 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 12893 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 10702 [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
------------------
inode_wb_list_lock 2 [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 19 [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 5550 [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 8511 [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
2.6.39-rc3 + patch:
&(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
------------------------
&(&wb->list_lock)->rlock 10 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 1493 [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
&(&wb->list_lock)->rlock 3652 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
&(&wb->list_lock)->rlock 1412 [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
------------------------
&(&wb->list_lock)->rlock 3 [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 2061 [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 2629 [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.
A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.
This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.
This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.
The below test script completes a slightly faster now:
2.6.39-rc3 2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed 256.043 252.367
stddev 24.381 12.530
tar elapsed 30.097 28.808
dd elapsed 13.214 11.782
#!/bin/zsh
cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
umount /dev/sda7
mkfs.xfs -f /dev/sda7
mount /dev/sda7 /fs
echo 3 > /proc/sys/vm/drop_caches
tic=$(cat /proc/uptime|cut -d' ' -f2)
cd /fs
time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
wait
sync
tac=$(cat /proc/uptime|cut -d' ' -f2)
echo elapsed: $((tac - tic))
It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.
Analyzes from Dave Chinner in great details:
Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.
With the old code, we do:
b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)
writeback large inode 1024 pages -> b_more_io
b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
writeback 8 small inodes 800 pages
1 large inode 224 pages -> b_more_io
b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
.....
Your new code:
b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)
writeback large inode 1024 pages -> b_more_io
(b_io == 8s)
writeback 8 small inodes 800 pages
b_io empty: (1800 pages written)
b_more_io (large inode) -> b_io (1l)
14 newly expired inodes -> b_io (1l, 14s)
writeback large inode 1024 pages -> b_more_io
(b_io == 14s)
writeback 10 small inodes 1000 pages
1 small inode 24 pages -> b_more_io (1l, 1s(24))
writeback 5 small inodes 500 pages
b_io empty: (2548 pages written)
b_more_io (large inode) -> b_io (1l, 1s(24))
20 newly expired inodes -> b_io (1l, 1s(24), 20s)
......
Rough progression of pages written at b_io refill:
Old code:
total large file % of writeback
1024 224 21.9% (fixed)
New code:
total large file % of writeback
1800 1024 ~55%
2550 1024 ~40%
3050 1024 ~33%
3500 1024 ~29%
3950 1024 ~26%
4250 1024 ~24%
4500 1024 ~22.7%
4700 1024 ~21.7%
4800 1024 ~21.3%
4800 1024 ~21.3%
(pretty much steady state from here)
Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)
The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.
CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Dynamically compute the dirty expire timestamp at queue_io() time.
writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.
This has two possible problems:
- It is unfair for a large dirty inode to delay (for a long time) the
writeback of small dirty inodes.
- As time goes by, the large and busy dirty inode may contain only
_freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
delaying the expired dirty pages to the end of LRU lists, triggering
the evil pageout(). Nevertheless this patch merely addresses part
of the problem.
v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.
CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.
For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.
For example, imagine 100 inodes
i0, i1, i2, ..., i90, i91, i99
At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.
Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1). I actually tried to write 1 page
per 1ms with this command
write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".
So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV. A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>