Currently, leapsecond adjustments are done at tick time. As a result,
the leapsecond was applied at the first timer tick *after* the
leapsecond (~1-10ms late depending on HZ), rather then exactly on the
second edge.
This was in part historical from back when we were always tick based,
but correcting this since has been avoided since it adds extra
conditional checks in the gettime fastpath, which has performance
overhead.
However, it was recently pointed out that ABS_TIME CLOCK_REALTIME
timers set for right after the leapsecond could fire a second early,
since some timers may be expired before we trigger the timekeeping
timer, which then applies the leapsecond.
This isn't quite as bad as it sounds, since behaviorally it is similar
to what is possible w/ ntpd made leapsecond adjustments done w/o using
the kernel discipline. Where due to latencies, timers may fire just
prior to the settimeofday call. (Also, one should note that all
applications using CLOCK_REALTIME timers should always be careful,
since they are prone to quirks from settimeofday() disturbances.)
However, the purpose of having the kernel do the leap adjustment is to
avoid such latencies, so I think this is worth fixing.
So in order to properly keep those timers from firing a second early,
this patch modifies the ntp and timekeeping logic so that we keep
enough state so that the update_base_offsets_now accessor, which
provides the hrtimer core the current time, can check and apply the
leapsecond adjustment on the second edge. This prevents the hrtimer
core from expiring timers too early.
This patch does not modify any other time read path, so no additional
overhead is incurred. However, this also means that the leap-second
continues to be applied at tick time for all other read-paths.
Apologies to Richard Cochran, who pushed for similar changes years
ago, which I resisted due to the concerns about the performance
overhead.
While I suspect this isn't extremely critical, folks who care about
strict leap-second correctness will likely want to watch
this. Potentially a -stable candidate eventually.
Originally-suggested-by: Richard Cochran <richardcochran@gmail.com>
Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1434063297-28657-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently the leapsecond logic uses what looks like magic values.
Improve this by defining SECS_PER_DAY and using that macro
to make the logic more clear.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1434063297-28657-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It was reported that 868a3e915f (hrtimer: Make offset
update smarter) was causing timer problems after suspend/resume.
The problem with that change is the modification to
clock_was_set_seq in timekeeping_update is done prior to
mirroring the time state to the shadow-timekeeper. Thus the
next time we do update_wall_time() the updated sequence is
overwritten by whats in the shadow copy.
This patch moves the shadow-timekeeper mirroring to the end
of the function, after all updates have been made, so all data
is kept in sync.
(This patch also affects the update_fast_timekeeper calls which
were also problematically done prior to the mirroring).
Reported-and-tested-by: Jeremiah Mahler <jmmahler@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1434063297-28657-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I noticed that my MPX tracepoints were producing garbage for the
lower and upper bounds:
mpx_bounds_register_exception: address referenced: 0x00007fffffffccb7 bounds: lower: 0x0 ~upper: 0xffffffffffffffff
mpx_bounds_register_exception: address referenced: 0x00007fffffffccbf bounds: lower: 0x0 ~upper: 0xffffffffffffffff
This is, of course, bogus because 0x00007fffffffccbf is *within*
the bounds. I assumed that my instruction decoder was bad and
went looking at it. But I eventually realized that I was
getting a '0' offset back from xstate_offsets[BNDREGS].
It was being skipped in the initialization, which is obviously
bogus, so remove the extra leaf++.
This also goes an initializes xstate_offsets/sizes[] to -1 so
so that bugs like this will oops instead of silently failing
in interesting ways.
This was introduced by:
39f1acd ("x86/fpu/xstate: Don't assume the first zero xfeatures zero bit means the end")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@sr71.net
Link: http://lkml.kernel.org/r/20150611193400.2E0B00DB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The previous patch tried to continue the probe if i915 binding fails.
For for simplicity reason, we haven't implemented abort even for
controller chips that are dedicated for HDMI/DP on HSW and BDW.
However, Mengdong suggested that this can be dangerous; BIOS may
disable gfx power well although the PCI entry for HD-audio is left,
and this may result in the unexpected behavior, kernel errors, etc.
For avoiding this situation, abort the probe at i915 binding failure
only for HSW/BDW chips selectively. For other chips, it still
continues.
Fixes: bf06848bdb ('ALSA: hda - Continue probing even if i915 binding fails')
Reported-by: Mengdong Lin <mengdong.lin@intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Building perf out of kernel tree is currently broken because the
MANIFEST file refers to kernel files that have been removed. With this
patch make perf-targz-src-pkg succeeds as does building perf using the
generated tarfile.
Signed-off-by: David Ahern <david.ahern@oracle.com>
Link: http://lkml.kernel.org/r/1433526173-172332-1-git-send-email-david.ahern@oracle.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull drm fixes from Dave Airlie:
"i915 and radeon fixes:
i915:
fix for connector oops regression
DDC probing fix
radeon:
two radeon reverts, along with a freeze workaround and a fix"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: Make sure radeon_vm_bo_set_addr always unreserves the BO
Revert "drm/radeon: adjust pll when audio is not enabled"
Revert "drm/radeon: don't share plls if monitors differ in audio support"
drm/radeon: fix freeze for laptop with Turks/Thames GPU.
drm/i915: Fix DDC probe for passive adapters
drm/i915: Properly initialize SDVO analog connectors
We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f76858
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.
This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.
alloc_skb_with_frags is the same.
The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.
V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer
Cc: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Debabrata Banerjee <dbavatar@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix for the regression Linus called out, and another for probing
dongles.
* tag 'drm-intel-fixes-2015-06-11' of git://anongit.freedesktop.org/drm-intel:
drm/i915: Fix DDC probe for passive adapters
drm/i915: Properly initialize SDVO analog connectors
Two regression reverts, and two fixes, one for a dpm boot freeze.
* 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: Make sure radeon_vm_bo_set_addr always unreserves the BO
Revert "drm/radeon: adjust pll when audio is not enabled"
Revert "drm/radeon: don't share plls if monitors differ in audio support"
drm/radeon: fix freeze for laptop with Turks/Thames GPU.
If a device is renamed and the original name is subsequently reused
for a new device, the following warning is generated:
sysctl duplicate entry: /net/mpls/conf/veth0//input
CPU: 3 PID: 1379 Comm: ip Not tainted 4.1.0-rc4+ #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
0000000000000000 0000000000000000 ffffffff81566aaf 0000000000000000
ffffffff81236279 ffff88002f7d7f00 0000000000000000 ffff88000db336d8
ffff88000db33698 0000000000000005 ffff88002e046000 ffff8800168c9280
Call Trace:
[<ffffffff81566aaf>] ? dump_stack+0x40/0x50
[<ffffffff81236279>] ? __register_sysctl_table+0x289/0x5a0
[<ffffffffa051a24f>] ? mpls_dev_notify+0x1ff/0x300 [mpls_router]
[<ffffffff8108db7f>] ? notifier_call_chain+0x4f/0x70
[<ffffffff81470e72>] ? register_netdevice+0x2b2/0x480
[<ffffffffa0524748>] ? veth_newlink+0x178/0x2d3 [veth]
[<ffffffff8147f84c>] ? rtnl_newlink+0x73c/0x8e0
[<ffffffff8147f27a>] ? rtnl_newlink+0x16a/0x8e0
[<ffffffff81459ff2>] ? __kmalloc_reserve.isra.30+0x32/0x90
[<ffffffff8147ccfd>] ? rtnetlink_rcv_msg+0x8d/0x250
[<ffffffff8145b027>] ? __alloc_skb+0x47/0x1f0
[<ffffffff8149badb>] ? __netlink_lookup+0xab/0xe0
[<ffffffff8147cc70>] ? rtnetlink_rcv+0x30/0x30
[<ffffffff8149e7a0>] ? netlink_rcv_skb+0xb0/0xd0
[<ffffffff8147cc64>] ? rtnetlink_rcv+0x24/0x30
[<ffffffff8149df17>] ? netlink_unicast+0x107/0x1a0
[<ffffffff8149e4be>] ? netlink_sendmsg+0x50e/0x630
[<ffffffff8145209c>] ? sock_sendmsg+0x3c/0x50
[<ffffffff81452beb>] ? ___sys_sendmsg+0x27b/0x290
[<ffffffff811bd258>] ? mem_cgroup_try_charge+0x88/0x110
[<ffffffff811bd5b6>] ? mem_cgroup_commit_charge+0x56/0xa0
[<ffffffff811d7700>] ? do_filp_open+0x30/0xa0
[<ffffffff8145336e>] ? __sys_sendmsg+0x3e/0x80
[<ffffffff8156c3f2>] ? system_call_fastpath+0x16/0x75
Fix this by unregistering the previous sysctl table (registered for
the path containing the original device name) and re-registering the
table for the path containing the new device name.
Fixes: 37bde79979 ("mpls: Per-device enabling of packet input")
Reported-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When programming the start of a periodic output, the code wrongly places
the seconds value into the "low" register and the nanoseconds into the
"high" register. Even though this is backwards, it slipped through my
testing, because the re-arming code in the interrupt service routine is
correct, and the signal does appear starting with the second edge.
This patch fixes the issue by programming the registers correctly.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not all architectures have io memory.
Fixes:
drivers/block/pmem.c: In function ‘pmem_alloc’:
drivers/block/pmem.c:146:2: error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]
pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
^
drivers/block/pmem.c:146:18: warning: assignment makes pointer from integer without a cast [enabled by default]
pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
^
drivers/block/pmem.c:182:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
iounmap(pmem->virt_addr);
^
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
ring buffer benchmark, where the produce_fifo was being ignored
and the producer thread's priority was being set with the consumer_fifo
parameter.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVeZEmAAoJEEjnJuOKh9ldrUYIAK9enlP7qdri5w3Urb9pNH81
gXqGINkEZWqbzwawb/b9avEXtcUB+pGGLE+ThB+s1DaEw4piLqaGyFRxlGXzU0F/
sFO/RxF+cPVtbEh8wAMHJD85g0j9kWB4Iy08rOezQiW9/YoATuk4QbrTlz6T++jD
6s4aqNUEQlxoCfWlkNmUbVIqRXrUuQGGc7bso1XY2/AAlSo1PjCDda/e5nDiCZ2d
pYr3CXiW+1xATZr1oS2aVgFcjIYqm5P3ijah1QlcvXEgD1ZYzsMsxxY7LQWCirZJ
GRFzXjZrCbTx6UnWc7CfcmtZVQpJhiKQ1Grum8/8uhjti7LwVCq99eFe5OsAe80=
=AC0N
-----END PGP SIGNATURE-----
Merge tag 'trace-rb-bm-fix-4.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ring buffer benchmark buglet fix from Steven Rostedt:
"Wang Long fixed a minor bug in the module parameter for the ring
buffer benchmark, where the produce_fifo was being ignored and the
producer thread's priority was being set with the consumer_fifo
parameter"
* tag 'trace-rb-bm-fix-4.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer-benchmark: Fix the wrong sched_priority of producer
=================================
[ INFO: inconsistent lock state ]
4.1.0-rc7+ #217 Tainted: G O
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
[<ffffffff810c1947>] lock_acquire+0xb7/0x290
[<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
[<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0 <-- take the lock in process context
[..]
[<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
[<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
[<ffffffff810c1947>] lock_acquire+0xb7/0x290
[<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
[<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
[<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
[<ffffffff8143a60c>] blk_free_devt+0x3c/0x70 <-- take the lock in softirq
[<ffffffff8143bfec>] part_release+0x1c/0x50
[<ffffffff8158edf6>] device_release+0x36/0xb0
[<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
[<ffffffff8145aad0>] kobject_put+0x30/0x70
[<ffffffff8158f147>] put_device+0x17/0x20
[<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
[<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
[<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
[<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
[<ffffffff81067e2e>] __do_softirq+0xde/0x600
Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests. This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092dd "block:
Fix dev_t minor allocation lifetime". Both this and 2da78092dd are
candidates for -stable.
Fixes: 2da78092dd ("block: Fix dev_t minor allocation lifetime")
Cc: <stable@vger.kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some error paths didn't unreserve the BO. This resulted in a deadlock
down the road on the next attempt to reserve the (still reserved) BO.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=90873
Cc: stable@vger.kernel.org
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Laptop with Turks/Thames GPU will freeze if dpm is enabled. It seems
the SMC engine is relying on some state inside the CP engine. CP needs
to chew at least one packet for it to get in good state for dynamic
power management.
This patch simply disabled and re-enable DPM after the ring test which
is enough to avoid the freeze.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Failed in 32bit arch build like this:
CC /opt/h00206996/output/perf/arm32/builtin-record.o
util/session.c: In function ‘perf_session__warn_about_errors’:
util/session.c:1304:9: error: format ‘%lu’ expects argument of type ‘long unsigned int’,
but argument 2 has type ‘long long unsigned int’ [-Werror=format=]
builtin-report.c: In function ‘perf_evlist__tty_browse_hists’:
builtin-report.c:323:2: error: format ‘%lu’ expects argument of type ‘long unsigned int’,
but argument 3 has type ‘u64’ [-Werror=format=]
Replace %lu format strings in warning message with PRIu64 for u64
'total_lost_samples' to fix this problem.
Signed-off-by: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1434026664-71642-1-git-send-email-hekuang@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf stat ignores the unsupported event and continue to count supported
event. But if the unsupported event is group leader, perf tool will
crash. After applying this patch, the unsupported group leader will
error out immediately.
Without this patch:
$ perf stat -x, -e '{node-prefetch-refs,cycles}' -- sleep 1
perf: util/evsel.c:1009: get_group_fd: Assertion `!(fd == -1)' failed.
Aborted (core dumped)
With this patch:
$ perf stat -x, -e '{node-prefetch-refs,cycles}' -- sleep 1
Error:
The node-prefetch-refs event is not supported.
Commiter note: Here I got a different output, but no core dump:
[acme@zoo linux]$ perf stat -x, -e '{node-prefetch-refs,cycles}' -- sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument)
for event (node-prefetch-refs).
/bin/dmesg may provide additional information.
No CONFIG_PERF_EVENTS=y kernel support configured?
Signed-off-by: Kan Liang <kan.liang@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Link: http://lkml.kernel.org/r/1434004360-8570-1-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Commit ab760a0 (ntb: Adding split BAR support for Haswell platforms)
changed ntb_device's mw from a fixed-size array into a pointer that is
allocated based on limits.max_mw; however, on Atom platforms, max_mw
is not initialized until ntb_device_setup(), which happens after the
allocation.
Fill out max_mw in ntb_atom_detect() to match ntb_xeon_detect(); this
happens before the use of max_mw in the ndev->mw allocation.
Fixes a null pointer dereference on Atom platforms with ntb hardware.
v2: fix typo (mw_max should be max_mw)
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Acked-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Yet another regression by the transition to regmap cache; for better
usability, we had the fake mute control using the zero amp value for
Conexant codecs, and this was forgotten in the new hda core code.
Since the bits 4-7 are unused for the amp registers (as we follow the
syntax of AMP_GET verb), the bit 4 is now used to indicate the fake
mute. For setting this flag, snd_hda_codec_amp_update() becomes a
function from a simple macro. The bonus is that it gained a proper
function description.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
We do not check the return value of enic_dev_stats_dump(). If allocation
fails, we will hit NULL pointer reference.
Return only if memory allocation fails. For other failures, we return the
previously recorded values.
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the crash kernel is loaded above 4GiB in memory, the
first kernel allocates only 72MiB of low-memory for the DMA
requirements of the second kernel. On systems with many
devices this is not enough and causes device driver
initialization errors and failed crash dumps. Testing by
SUSE and Redhat has shown that 256MiB is a good default
value for now and the discussion has lead to this value as
well. So set this default value to 256MiB to make sure there
is enough memory available for DMA.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
[ Reflow comment. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jörg Rödel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/1433500202-25531-4-git-send-email-joro@8bytes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we boot a kdump kernel in high memory, there is by
default only 72MB of low memory available. The swiotlb code
takes 64MB of it (by default) so that there are only 8MB
left to allocate from. On systems with many devices this
causes page allocator warnings from
dma_generic_alloc_coherent():
systemd-udevd: page allocation failure: order:0, mode:0x280d4
CPU: 0 PID: 197 Comm: systemd-udevd Tainted: G W
3.12.28-4-default #1 Hardware name: HP ProLiant DL980 G7, BIOS
P66 07/30/2012 ffff8800781335e0 ffffffff8150b1db 00000000000280d4 ffffffff8113af90
0000000000000000 0000000000000000 ffff88007efdbb00 0000000100000000
0000000000000000 0000000000000000 0000000000000000 0000000000000001
Call Trace:
dump_trace+0x7d/0x2d0
show_stack_log_lvl+0x94/0x170
show_stack+0x21/0x50
dump_stack+0x41/0x51
warn_alloc_failed+0xf0/0x160
__alloc_pages_slowpath+0x72f/0x796
__alloc_pages_nodemask+0x1ea/0x210
dma_generic_alloc_coherent+0x96/0x140
x86_swiotlb_alloc_coherent+0x1c/0x50
ttm_dma_pool_alloc_new_pages+0xab/0x320 [ttm]
ttm_dma_populate+0x3ce/0x640 [ttm]
ttm_tt_bind+0x36/0x60 [ttm]
ttm_bo_handle_move_mem+0x55f/0x5c0 [ttm]
ttm_bo_move_buffer+0x105/0x130 [ttm]
ttm_bo_validate+0xc1/0x130 [ttm]
ttm_bo_init+0x24b/0x400 [ttm]
radeon_bo_create+0x16c/0x200 [radeon]
radeon_ring_init+0x11e/0x2b0 [radeon]
r100_cp_init+0x123/0x5b0 [radeon]
r100_startup+0x194/0x230 [radeon]
r100_init+0x223/0x410 [radeon]
radeon_device_init+0x6af/0x830 [radeon]
radeon_driver_load_kms+0x89/0x180 [radeon]
drm_get_pci_dev+0x121/0x2f0 [drm]
local_pci_probe+0x39/0x60
pci_device_probe+0xa9/0x120
driver_probe_device+0x9d/0x3d0
__driver_attach+0x8b/0x90
bus_for_each_dev+0x5b/0x90
bus_add_driver+0x1f8/0x2c0
driver_register+0x5b/0xe0
do_one_initcall+0xf2/0x1a0
load_module+0x1207/0x1c70
SYSC_finit_module+0x75/0xa0
system_call_fastpath+0x16/0x1b
0x7fac533d2788
After these warnings the code enters a fall-back path and
allocated directly from the swiotlb aperture in the end.
So remove these warnings as this is not a fatal error.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
[ Simplify, reflow comment. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jörg Rödel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/1433500202-25531-3-git-send-email-joro@8bytes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Print a warning when all allocation tries have been failed
and the function is about to return NULL.
This prepares for calling the function with __GFP_NOWARN to
suppress allocation failure warnings before all fall-backs
have failed - which we'll do to improve kdump behavior.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jörg Rödel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/1433500202-25531-2-git-send-email-joro@8bytes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the addition of sysfs multicast router support if one set
multicast_router to "2" more than once, then the port would be added to
the hlist every time and could end up linking to itself and thus causing an
endless loop for rlist walkers.
So to reproduce just do:
echo 2 > multicast_router; echo 2 > multicast_router;
in a bridge port and let some igmp traffic flow, for me it hangs up
in br_multicast_flood().
Fix this by adding a check in br_multicast_add_router() if the port is
already linked.
The reason this didn't happen before the addition of multicast_router
sysfs entries is because there's a !hlist_unhashed check that prevents
it.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Fixes: 0909e11758 ("bridge: Add multicast_router sysfs entries")
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the TIPC connection timer expires in a probing state, a
self abort message is supposed to be generated and delivered
to the local socket. This is currently broken, and the abort
message is actually sent out to the peer node with invalid
addressing information. This will cause the link to enter
a constant retransmission state and eventually reset.
We fix this by removing the self-abort message creation and
tear down connection immediately instead.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently snd-hda-intel driver aborts the probing of Intel HD-audio
controller with i915 power well management when binding with i915
driver via hda_i915_init() fails. This is no big problem for Haswell
and Broadwell where the HD-audio controllers are dedicated to
HDMI/DP, thus i915 link is mandatory. However, Skylake, Baytrail and
Braswell have only one controller and both HDMI/DP and analog codecs
share the same bus. Thus, even if HDMI/DP isn't usable, we should
keep the controller working for other codecs.
For fixing this, this patch simply allows continuing the probing even
if hda_i915_init() call fails. This may leave stale sound components
for HDMI/DP devices that are unbound with graphics. We could abort
the probing selectively, but from the code simplicity POV, it's better
to continue in all cases.
Reported-by: Libin Yang <libin.yang@intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVeHo5AAoJEMsfJm/On5mBU8oQAJIAMPvFs0VqrzkgzILjDpmO
sDhNwrEMbRSyTHZMvcuvAbMpeSYAeGu4KZ1oQlxxr18Sw0QJ9Asw79drhWI9+5ic
pt6DBNaPbLlfAdbOgZaIdcETIUPE9SGPcmIMJLpVtgJ+9LmJnmSOWI358d1WFEJQ
u45OQfWM5tLxMFwFeMj9dFrAGP9f6Z+lSbdp5RoOqFZ4S6Ad/5mwy3i09Hr+n5pY
HUoSFd+Z/4WT2pFW3//x/Mwejy4FqMIp6OEk8XsrGz3S0wqKRyeuEuypLqBuRAjj
IgfVtUwlIlmRs1on3/20qOb7O/O9BYceRtFYZ7OraIf/ljaGn97tUlkcBq0BW2IK
MGLWppTngnpW5EJ6+h4qtF/KnWRPVNZKDC6hA4KMwBm8PVAdKr1bFxkG2dS3Ycz/
Tn24s8EtaSszQLNWacUhP8JblVh7OQk7QXEKTN8QT31qJHcHpS+ZY+Fk7KHHTDud
kta42WPau5GHQC4sVjmcqxdth1BpP2IWZAEUNdANTK8T1g1HOAcZqWGbYCBBiM2Q
ClP1pJ54+W8uhw3W44hm/bIs20YQ6rKYZColK8xOL1ykNUjV6HVxqN9q8+clrlo0
SQMWUtsCQn33SM48P+zE2V9NYLHaO4SSdSo6Li+KxUdU0u2IZG8IigJ+OzIWi43n
G6Lub3GQQTZg1qix7PmQ
=PGHg
-----END PGP SIGNATURE-----
Merge tag 'misc-for-linus-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull misc fixes from Guenter Roeck:
"There are two patches here. One fixes a build error affecting the
blackfin architecture, the other fixes a build error affecting the
score architecture.
The score maintainer (Lennox Wu) has a hard time sending you the score
patch, and the blackfin maintainer (Steven Miao) has been silent since
-rc1. Since 4.1 is about to be released, I figured it would be useful
to get the patches upstream to avoid the related build failures in the
final release"
* tag 'misc-for-linus-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
score: Fix exception handler label
blackfin: Fix build error
Merge misc fixes from Andrew Morton:
"The gcc-4.4.4 workaround has actually been merged into a KVM tree by
Paolo but it is stuck in linux-next and mainline needs it"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
arch/x86/kvm/mmu.c: work around gcc-4.4.4 bug
sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings
zsmalloc: fix a null pointer dereference in destroy_handle_cache()
mm: memcontrol: fix false-positive VM_BUG_ON() on -rt
checkpatch: fix "GLOBAL_INITIALISERS" test
zram: clear disk io accounting when reset zram device
memcg: do not call reclaim if !__GFP_WAIT
mm/memory_hotplug.c: set zone->wait_table to null after freeing it
Fix this compile issue with gcc-4.4.4:
arch/x86/kvm/mmu.c: In function 'kvm_mmu_pte_write':
arch/x86/kvm/mmu.c:4256: error: unknown field 'cr0_wp' specified in initializer
arch/x86/kvm/mmu.c:4257: error: unknown field 'cr4_pae' specified in initializer
arch/x86/kvm/mmu.c:4257: warning: excess elements in union initializer
...
gcc-4.4.4 (at least) has issues when using anonymous unions in
initializers.
Fixes: edc90b7dc4 ("KVM: MMU: fix SMAP virtualization")
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jovi Zhangwei reported the following problem
Below kernel vm bug can be triggered by tcpdump which mmaped a lot of pages
with GFP_COMP flag.
[Mon May 25 05:29:33 2015] page:ffffea0015414000 count:66 mapcount:1 mapping: (null) index:0x0
[Mon May 25 05:29:33 2015] flags: 0x20047580004000(head)
[Mon May 25 05:29:33 2015] page dumped because: VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page))
[Mon May 25 05:29:33 2015] ------------[ cut here ]------------
[Mon May 25 05:29:33 2015] kernel BUG at mm/migrate.c:1661!
[Mon May 25 05:29:33 2015] invalid opcode: 0000 [#1] SMP
In this case it was triggered by running tcpdump but it's not necessary
reproducible on all systems.
sudo tcpdump -i bond0.100 'tcp port 4242' -c 100000000000 -w 4242.pcap
Compound pages cannot be migrated and it was not expected that such pages
be marked for NUMA balancing. This did not take into account that drivers
such as net/packet/af_packet.c may insert compound pages into userspace
with vm_insert_page. This patch tells the NUMA balancing protection
scanner to skip all VM_MIXEDMAP mappings which avoids the possibility that
compound pages are marked for migration.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Jovi Zhangwei <jovi@cloudflare.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If zs_create_pool()->create_handle_cache()->kmem_cache_create() or
pool->name allocation fails, zs_create_pool()->destroy_handle_cache()
will dereference the NULL pool->handle_cachep.
Modify destroy_handle_cache() to avoid this.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg
swapout path because the spin_lock_irq(&mapping->tree_lock) in the
caller doesn't actually disable the hardware interrupts - which is fine,
because on -rt the tophalves run in process context and so we are still
safe from preemption while updating the statistics.
Remove the VM_BUG_ON() but keep the comment of what we rely on.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Clark Williams <williams@redhat.com>
Cc: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit d5e616fc1c ("checkpatch: add a few more --fix corrections")
broke the GLOBAL_INITIALISERS test with bad parentheses and optional
leading spaces.
Fix it.
Signed-off-by: Joe Perches <joe@perches.com>
Reported-by: Bandan Das <bsd@makefile.in>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clear zram disk io accounting when resetting the zram device. Otherwise
the residual io accounting stat will affect the diskstat in the next
zram active cycle.
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When trimming memcg consumption excess (see memory.high), we call
try_to_free_mem_cgroup_pages without checking if we are allowed to sleep
in the current context, which can result in a deadlock. Fix this.
Fixes: 241994ed86 ("mm: memcontrol: default hierarchy interface for memory")
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Izumi found the following oops when hot re-adding a node:
BUG: unable to handle kernel paging request at ffffc90008963690
IP: __wake_up_bit+0x20/0x70
Oops: 0000 [#1] SMP
CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
RIP: 0010:[<ffffffff810dff80>] [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
Call Trace:
unlock_page+0x6d/0x70
generic_write_end+0x53/0xb0
xfs_vm_write_end+0x29/0x80 [xfs]
generic_perform_write+0x10a/0x1e0
xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
xfs_file_write_iter+0x79/0x120 [xfs]
__vfs_write+0xd4/0x110
vfs_write+0xac/0x1c0
SyS_write+0x58/0xd0
system_call_fastpath+0x12/0x76
Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
RIP [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
RSP <ffff880017b97be8>
CR2: ffffc90008963690
Reproduce method (re-add a node)::
Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)
This seems an use-after-free problem, and the root cause is
zone->wait_table was not set to *NULL* after free it in
try_offline_node.
When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized. The
judgement of zone initialized is based on zone->wait_table:
static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone->wait_table;
}
so if we do not set the zone->wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The latest version of modinfo fails to compile score architecture
targets with the following error.
FATAL: The relocation at __ex_table+0x634 references
section "__ex_table" which is not executable, IOW
the kernel will fault if it ever tries to
jump to it. Something is seriously wrong
and should be fixed.
The probem is caused by a bad label in an __ex_table entry.
Acked-by: Lennox Wu <lennox.wu@gmail.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>