While tracking down the reason for an ioremap() failure I was
distracted by the WARN_ONCE() in __ioremap_caller().
Performing a WARN_ONCE() sanity check before the mapping
is successful seems pointless if the caller sends bad values.
A case in point is when the BIOS provides erroneous screen_info
values causing vesafb_probe() to request an outrageuous size.
The WARN_ONCE is then wasted on bogosity. Move the warning to a
point where the mapping has been successfully allocated.
Addresses:
http://bugs.launchpad.net/bugs/772042
Reviewed-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Link: http://lkml.kernel.org/r/4DB99D2E.9080106@canonical.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that remap allocator is cleaned up, update comments such that they
are in docbook function description format and reflect the actual
implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-15-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Remap area size can be determined from node_remap_start_vaddr[] and
node_remap_end_vaddr[] making node_remap_size[] redundant. Remove it.
While at it, make resume_map_numa_kva() use @nr_pages for number of
pages instead of @size.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-14-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
With lowmem address reservation moved into init_alloc_remap(),
node_remap_offset[] is no longer useful. Remove it and related offset
handling code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-13-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
pgdat allocation is handled differnetly from other remap allocations -
it's reserved during initialization. There's no reason to handle this
any differnetly. Remap allocator is initialized for every node and if
init failed the allocation will fail and pgdat allocation can fall
back to generic code like anyone else.
Remove special init-time pgdat reservation and make allocate_pgdat()
use alloc_remap() like everyone else.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-12-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There's no reason to perform the actual remapping separately.
Collapse remap_numa_kva() into init_alloc_remap() and, while at it,
make it less verbose.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-11-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Remap alloc init is done in the following stages.
1. init_alloc_remap() calculates how much memory is necessary for each
node and reserves node local memory.
2. initmem_init() collects how much each node needs and reserves a
single contiguous lowmem area which can contain all.
3. init_remap_allocator() initializes allocator parameters from the
determined lowmem address and per-node offsets.
4. Actual remap happens.
There is no reason for the lowmem remap area to be reserved as a
single contiguous area at one go. They don't interact with each other
and the memblock allocator will put them side-by-side anyway.
This patch breaks up the single lowmem address reservation and put
per-node lowmem address reservation into init_alloc_remap() and
initializes allocator parameters directly in the function as all the
addresses are determined there. This merges steps 2 and 3 into 1.
While at it, remove now largely irrelevant comments in
init_alloc_remap().
This change causes the following behavior changes.
* Remap lowmem areas are allocated in smaller per-node chunks.
* Remap lowmem area reservation failure fail future remap allocations
instead of panicking.
* Remap allocator initialization is less verbose.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-10-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Remap allocator failure isn't fatal. The callers are required to fall
back to regular early memory allocation mechanisms on failure anyway,
so there's no reason to panic on remap init failure. Whining and
returning are enough.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-9-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Only pgdat and memmap use remap area and there isn't much benefit in
allowing per-node override. In addition, the use of node_remap_size[]
is confusing in that it contains number of bytes before remap
initialization and then number of pages afterwards.
Move remap size calculation for memap from specific NUMA config
implementations to init_alloc_remap() and make node_remap_size[]
static.
The only behavior difference is that, before this patch, numaq_32
didn't consider max_pfn when calculating the memmap size but it's
enforced after this patch, which is the right thing to do.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-8-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
@size variable in init_alloc_remap() is confusing in that it starts as
number of bytes as its name implies and then becomes number of pages.
Make it consistently represent bytes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-7-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
init_alloc_remap() is about to do more and using _kva suffix for
physical address becomes confusing because the function will be
handling both physical and virtual addresses. Rename @node_kva to
@node_pa.
This is trivial rename and doesn't cause any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-6-git-send-email-tj@kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Separate the outer node walking loop and per-node logic from
calculate_numa_remap_pages(). The outer loop is collapsed into
initmem_init() and the per-node logic is moved into a new function -
init_alloc_remap().
The new function name is confusing with the existing
init_remap_allocator() and the behavior is the function isn't very
clean either at this point, but this is to prepare for further
cleanups and it will become prettier.
This function doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-5-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
memblock_find_in_range() now does top-down allocation by default, so
there's no reason for its callers to explicitly implement it by
gradually lowering the start address.
Remove redundant top-down allocation logic from init_meminit() and
calculate_numa_remap_pages().
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-4-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
When pgdat is reserved in init_remap_allocator(), PAGE_SIZE aligned
size will be used. Match the size alignment in initialization to
avoid allocation failure down the road.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-3-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
node_remap_{start|end}_vaddr[] describe [start, end) ranges; however,
alloc_remap() incorrectly failed when the current allocation + size
equaled the end but it should fail only when it goes over. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-2-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block:
ide: always ensure that blk_delay_queue() is called if we have pending IO
block: fix request sorting at unplug
dm: improve block integrity support
fs: export empty_aops
ide: ide_requeue_and_plug() reinstate "always plug" behaviour
blk-throttle: don't call xchg on bool
ufs: remove unessecary blk_flush_plug
block: make the flush insertion use the tail of the dispatch list
block: get rid of elv_insert() interface
block: dump request state on seeing a corrupted request completion
On an error path in inotify_init1 a normal user can trigger a double
free of struct user. This is a regression introduced by a2ae4cc9a1
("inotify: stop kernel memory leak on file creation failure").
We fix this by making sure that if a group exists the user reference is
dropped when the group is cleaned up. We should not explictly drop the
reference on error and also drop the reference when the group is cleaned
up.
The new lifetime rules are that an inotify group lives from
inotify_new_group to the last fsnotify_put_group. Since the struct user
and inotify_devs are directly tied to this lifetime they are only
changed/updated in those two locations. We get rid of all special
casing of struct user or user->inotify_devs.
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: stable@kernel.org (2.6.37 and up)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just because we are not requeuing a request does not mean that
some aren't pending. So always issue a blk_delay_queue() if
either we are requeueing OR there's pending IO.
This fixes a boot problem for some IDE boxes.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Comparison function for list_sort() must be anticommutative,
otherwise it is not sorting in ordinary meaning.
But fortunately list_sort() always check ((*cmp)(priv, a, b) <= 0)
it not distinguish negative and zero, so comparison function can
implement only less-or-equal instead of full three-way comparison.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The current block integrity (DIF/DIX) support in DM is verifying that
all devices' integrity profiles match during DM device resume (which
is past the point of no return). To some degree that is unavoidable
(stacked DM devices force this late checking). But for most DM
devices (which aren't stacking on other DM devices) the ideal time to
verify all integrity profiles match is during table load.
Introduce the notion of an "initialized" integrity profile: a profile
that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
template. Add blk_integrity_is_initialized() to allow checking if a
profile was initialized.
Update DM integrity support to:
- check all devices with _initialized_ integrity profiles match
during table load; uninitialized profiles (e.g. for underlying DM
device(s) of a stacked DM device) are ignored.
- disallow a table load that would result in an integrity profile that
conflicts with a DM device's existing (in-use) integrity profile
- avoid clearing an existing integrity profile
- validate all integrity profiles match during resume; but if they
don't all we can do is report the mismatch (during resume we're past
the point of no return)
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With the ->sync_page() hook gone, we have a few users that
add their own static address_space_operations without any
functions defined.
fs/inode.c already has an empty_aops that it uses for init
purposes. Lets export that and use it in the places where
an otherwise empty aops was defined.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We see stalls if we don't always ensure that the queue gets run
again. Even if rq == NULL, we could have other pending requests
in the queue.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
xchg does not work portably with smaller than 32bit types.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We already flush the per-process plugging list when context switching,
so a blk_flush_plug call just before a yield() is not needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Merge it with __elv_add_request(), it's pretty pointless to
have a function with only two callers. The main interface
is elv_add_request()/__elv_add_request().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently we just dump a non-informative 'request botched' message.
Lets actually try and print something sane to help debug issues
around this.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/pseries: Fix build without CONFIG_HOTPLUG_CPU
powerpc: Set nr_cpu_ids early and use it to free PACAs
powerpc/pseries: Don't register global initcall
powerpc/kexec: Fix mismatched ifdefs for PPC64/SMP.
edac/mpc85xx: Limit setting/clearing of HID1[RFXE] to e500v1/v2 cores
powerpc/85xx: Update dts for PCIe memory maps to match u-boot of Px020RDB
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: don't warn in btrfs_add_orphan
Btrfs: fix free space cache when there are pinned extents and clusters V2
Btrfs: Fix uninitialized root flags for subvolumes
btrfs: clear __GFP_FS flag in the space cache inode
Btrfs: fix memory leak in start_transaction()
Btrfs: fix memory leak in btrfs_ioctl_start_sync()
Btrfs: fix subvol_sem leak in btrfs_rename()
Btrfs: Fix oops for defrag with compression turned on
Btrfs: fix /proc/mounts info.
Btrfs: fix compiler warning in file.c
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
ipv6: Don't pass invalid dst_entry pointer to dst_release().
mlx4: fix kfree on error path in new_steering_entry()
tcp: len check is unnecessarily devastating, change to WARN_ON
sctp: malloc enough room for asconf-ack chunk
sctp: fix auth_hmacs field's length of struct sctp_cookie
net: Fix dev dev_ethtool_get_rx_csum() for forced NETIF_F_RXCSUM
usbnet: use eth%d name for known ethernet devices
starfire: clean up dma_addr_t size test
iwlegacy: fix bugs in change_interface
carl9170: Fix tx aggregation problems with some clients
iwl3945: disable hw scan by default
wireless: rt2x00: rt2800usb.c add and identify ids
iwl3945: do not deprecate software scan
mac80211: fix aggregation frame release during timeout
cfg80211: fix BSS double-unlinking (continued)
cfg80211:: fix possible NULL pointer dereference
mac80211: fix possible NULL pointer dereference
mac80211: fix NULL pointer dereference in ieee80211_key_alloc()
ath9k: fix a chip wakeup related crash in ath9k_start
mac80211: fix a crash in minstrel_ht in HT mode with no supported MCS rates
...
This is a revert of 428d2e828c.
This is broken in the same manner as for VGA: trying to write to an
invalid address on the (currently 7-bit) i2c bus.
One notable failure appears to be for MacBooks. The scary part was that
it gave the appearance of working (i.e. reporting the absence of the
panel) on various all-in-one machines with ghost LVDS panels and not
failing for laptops.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Dave Airlie <airlied@linux.ie>
Signed-off-by: Keith Packard <keithp@keithp.com>
This is a moral revert of 6ec3d0c0e9.
Following the fix to reset the GMBUS controller after a NAK, we finally
utilize the 0xa0 probe for a CRT connection. And discover that the code
is broken. Shock.
There are a number of issues, but following a key insight from Dave
Airlie, that 0xA0 is an invalid address on a 7-bit bus (though not if we
were to enable 10-bit addressing), and would look like the EDID port
0x50, it is possible to see where the confusion starts.
In short, a write to 0xA0 is accepted by the GMBUS controller which we
interpreted as meaning the existence of a connection (a slave on the
other end of the wire ACKing the write). That was false.
During testing with a broken GMBUS implementation, which never reset an
earlier NAK, this test always reported a NAK and so we proceeded on to
the next test.
Reported-and-tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35904
Reported-and-tested-by: Riccardo Magliocchetti <riccardo.magliocchetti@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=32612
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Dave Airlie <airlied@linux.ie>
Signed-off-by: Keith Packard <keithp@keithp.com>
Commit b3df895aeb "powerpc/kexec: Add support for FSL-BookE"
introduced the original PPC_STD_MMU_64 checks around the function
crash_kexec_wait_realmode(). Then commit c2be05481f
"powerpc: Fix default_machine_crash_shutdown #ifdef botch" changed
the ifdef around the calling site to add a check on SMP, but the
ifdef around the function itself was left unchanged, leaving an
unused function for PPC_STD_MMU_64=y and SMP=n
Rather than have two ifdefs that can get out of sync like this,
simply put the corrected conditional around the function and use
a stub to get rid of one set of ifdefs completely.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When I moved the orphan adding to btrfs_truncate I missed the fact that during
orphan cleanup we just add the orphan items to the orphan list without going
through btrfs_orphan_add, which results in lots of warnings on mount if you have
any orphan items that need to be truncated. Just remove this warning since it's
ok, this will allow all of the normal space accounting take place. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I noticed a huge problem with the free space cache that was presenting
as an early ENOSPC. Turns out when writing the free space cache out I
forgot to take into account pinned extents and more importantly
clusters. This would result in us leaking free space everytime we
unmounted the filesystem and remounted it.
I fix this by making sure to check and see if the current block group
has a cluster and writing out any entries that are in the cluster to the
cache, as well as writing any pinned extents we currently have to the
cache since those will be available for us to use the next time the fs
mounts.
This patch also adds a check to the end of load_free_space_cache to make
sure we got the right amount of free space cache, and if not make sure
to clear the cache and re-cache the old fashioned way.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
root_item->flags and root_item->byte_limit are not initialized when
a subvolume is created. This bug is not revealed until we added
readonly snapshot support - now you mount a btrfs filesystem and you
may find the subvolumes in it are readonly.
To work around this problem, we steal a bit from root_item->inode_item->flags,
and use it to indicate if those fields have been properly initialized.
When we read a tree root from disk, we check if the bit is set, and if
not we'll set the flag and initialize the two fields of the root item.
Reported-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Andreas Philipp <philipp.andreas@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
the object id of the space cache inode's key is allocated from the relative
root, just like the regular file. So we can't identify space cache inode by
checking the object id of the inode's key, and we have to clear __GFP_FS flag
at the time we look up the space cache inode.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Free btrfs_trans_handle when join_transaction() fails
in start_transaction()
Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_rename() does not release the subvol_sem if the transaction failed to start.
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we defrag a file, whose size can be fit into an inline extent,
with compression enabled, the compress type is set to be
fs_info->compress_type, which is 0 if the btrfs filesystem is mounted
without compress option. This leads to oops.
Reported-by: Daniel Blueman <daniel.blueman@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Some mount options are not displayed by /proc/mounts.
This patch displays the option such as compress_type by /proc/mounts.
Ex.
[before]
$ mount | grep sdc2
/dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo)
$ cat /proc/mounts | grep sdc2
/dev/sdc2 /test12 btrfs rw,relatime,compress 0 0
[after]
$ mount | grep sdc2
/dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo)
$ cat /proc/mounts | grep sdc2
/dev/sdc2 /test12 btrfs rw,relatime,compress=lzo,space_cache 0 0
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>