Fix a TCP loss recovery performance bug raised recently on the netdev
list, in two threads:
(i) July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
"[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
outstanding TLP retransmission"
The basic problem is that incoming TCP packets that did not indicate
forward progress could cause the xmit timer (TLP or RTO) to be rearmed
and pushed back in time. In certain corner cases this could result in
the following problems noted in these threads:
- Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
could cause TCP to repeatedly schedule TLPs forever. We kept
sending TLPs after every ~200ms, which elicited bogus SACKs, which
caused more TLPs, ad infinitum; we never fired an RTO to fill in
the holes.
- Incoming data segments could, in some cases, cause us to reschedule
our RTO or TLP timer further out in time, for no good reason. This
could cause repeated inbound data to result in stalls in outbound
data, in the presence of packet loss.
This commit fixes these bugs by changing the TLP and RTO ACK
processing to:
(a) Only reschedule the xmit timer once per ACK.
(b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
ACK indicates sufficient forward progress (a packet was
cumulatively ACKed, or we got a SACK for a packet that was sent
before the most recent retransmit of the write queue head).
This brings us back into closer compliance with the RFCs, since, as
the comment for tcp_rearm_rto() notes, we should only restart the RTO
timer after forward progress on the connection. Previously we were
restarting the xmit timer even in these cases where there was no
forward progress.
As a side benefit, this commit simplifies and speeds up the TCP timer
arming logic. We had been calling inet_csk_reset_xmit_timer() three
times on normal ACKs that cumulatively acknowledged some data:
1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);
2) Once in tcp_clean_rtx_queue(), to update the RTO:
if (flag & FLAG_ACKED) {
tcp_rearm_rto(sk);
3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
to TLP:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);
This commit, by only rescheduling the xmit timer once per ACK,
simplifies the code and reduces CPU overhead.
This commit was tested in an A/B test with Google web server
traffic. SNMP stats and request latency metrics were within noise
levels, substantiating that for normal web traffic patterns this is a
rare issue. This commit was also tested with packetdrill tests to
verify that it fixes the timer behavior in the corner cases discussed
in the netdev threads mentioned above.
This patch is a bug fix patch intended to be queued for -stable
relases.
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Reported-by: Klavs Klavsen <kl@vsen.dk>
Reported-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Have tcp_schedule_loss_probe() base the TLP scheduling decision based
on when the RTO *should* fire. This is to enable the upcoming xmit
timer fix in this series, where tcp_schedule_loss_probe() cannot
assume that the last timer installed was an RTO timer (because we are
no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
ACK). So tcp_schedule_loss_probe() must independently figure out when
an RTO would want to fire.
In the new TLP implementation following in this series, we cannot
assume that icsk_timeout was set based on an RTO; after processing a
cumulative ACK the icsk_timeout we see can be from a previous TLP or
RTO. So we need to independently recalculate the RTO time (instead of
reading it out of icsk_timeout). Removing this dependency on the
nature of icsk_timeout makes things a little easier to reason about
anyway.
Note that the old and new code should be equivalent, since they are
both saying: "if the RTO is in the future, but at an earlier time than
the normal TLP time, then set the TLP timer to fire when the RTO would
have fired".
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pure refactor. This helper will be required in the xmit timer fix
later in the patch series. (Because the TLP logic will want to make
this calculation.)
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit c2ed1880fd ("net: ipv6: check route protocol when
deleting routes"), ipv6 route checks rt protocol when trying to
remove a rt entry.
It introduced a side effect causing 'ip -6 route flush cache' not
to work well. When flushing caches with iproute, all route caches
get dumped from kernel then removed one by one by sending DELROUTE
requests to kernel for each cache.
The thing is iproute sends the request with the cache whose proto
is set with RTPROT_REDIRECT by rt6_fill_node() when kernel dumps
it. But in kernel the rt_cache protocol is still 0, which causes
the cache not to be matched and removed.
So the real reason is rt6i_protocol in the route is not set when
it is allocated. As David Ahern's suggestion, this patch is to
set rt6i_protocol properly in the route when it is installed and
remove the codes setting rtm_protocol according to rt6i_flags in
rt6_fill_node.
This is also an improvement to keep rt6i_protocol consistent with
rtm_protocol.
Fixes: c2ed1880fd ("net: ipv6: check route protocol when deleting routes")
Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge misc fixes from Andrew Morton:
"15 fixes"
[ This does not merge the "fortify: use WARN instead of BUG for now"
patch, which needs a bit of extra work to build cleanly with all
configurations. Arnd is on it. - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
ocfs2: don't clear SGID when inheriting ACLs
mm: allow page_cache_get_speculative in interrupt context
userfaultfd: non-cooperative: flush event_wqh at release time
ipc: add missing container_of()s for randstruct
cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
userfaultfd_zeropage: return -ENOSPC in case mm has gone
mm: take memory hotplug lock within numa_zonelist_order_handler()
mm/page_io.c: fix oops during block io poll in swapin path
zram: do not free pool->size_class
kthread: fix documentation build warning
kasan: avoid -Wmaybe-uninitialized warning
userfaultfd: non-cooperative: notify about unmap of destination during mremap
mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
pid: kill pidhash_size in pidhash_init()
mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
- Fix a device ID of Hisilicon Hip07/08 in the ACPI APD (AMD SoC)
driver (Hanjun Guo).
- Fix list corruption (introduced during the 4.11 cycle) in the ACPI
LPSS (Intel SoC) driver (Hans de Goede).
- Fix PCC mailbox handling code crash during initialization when
PCCT is not present and PCC channel 0 is requested (Hoan Tran).
- Fix a WDAT watchdog initialization issue causing platform device
creation to fail due to partially overlapping address ranges in
resources (Ryan Kennedy).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZg3FsAAoJEILEb/54YlRxmowQAIjFSnojQPMu1Gd8qSvnO5j/
5XqE1ohkb/0sfkY6uLPX/mDU7fEZHr+rz3bLVgU7BPumcqQ9rxi8P+1M0zl3ccIe
KxGaRF0zL3TVnLzzxcQUprdCLBVJH+bu7uiNMvFq0pLrKCKMT1SdVwG9HsZeDsQD
IdpACI4UkdQ840KwOXePpvlmOI03ThHZAcJn9TrmupA90teF/j/pqdV25A8+yWo3
4Yxmh7Exd1wW5B6IFMMxmM8WhOB8Q0X50Wehsg2Yi16HddjI9IEwg+KnsgRRiodw
L+eDQgg6wUNJ2swVu7G/7KdvA4BDV5ZYbEo6gc0NZAhs0pTm4M6kMwa4i0OQ572r
WzhFrHfIuqyDMzSPJp9l5D38wgmIykMkX2dYKMi+sxnyi3ZgagAOlmYbmsPzhFWL
TrEhd7fJ687xtC991mLfNyWM7cr7mpFJL+eB9sOospBmF/xjFBlwvtNZI6dYhoDh
ON6E3QQ6/G/5XQQPF+SKBrMNllCs+N1hefvZiG9SD2i5u+m31Vd3WLDHoqHMcASu
oX7+PuvasBJNAVM/lMtbKtXcyHNfYRvzWBqkMkQPNIkmMfdXgI8g8ZNpj2R+onfd
RPJFp/IsF8FSjLA4Q/BMduLd/qjz7wZdIjuZZO/9eFOHVh8wBOt2ZfaFUjA00OmE
lKG8r1pui+P6e4o8OkO1
=ZD/q
-----END PGP SIGNATURE-----
Merge tag 'acpi-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix two issues in the ACPI SoC drivers (Intel LPSS and AMD APD),
a crash in the PCC mailbox initialization code and a WDAT watchdog
initialization failure.
Specifics:
- Fix a device ID of Hisilicon Hip07/08 in the ACPI APD (AMD SoC)
driver (Hanjun Guo).
- Fix list corruption (introduced during the 4.11 cycle) in the ACPI
LPSS (Intel SoC) driver (Hans de Goede).
- Fix PCC mailbox handling code crash during initialization when PCCT
is not present and PCC channel 0 is requested (Hoan Tran).
- Fix a WDAT watchdog initialization issue causing platform device
creation to fail due to partially overlapping address ranges in
resources (Ryan Kennedy)"
* tag 'acpi-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: APD: Fix HID for Hisilicon Hip07/08
mailbox: pcc: Fix crash when request PCC channel 0
ACPI / watchdog: Fix init failure with overlapping register regions
ACPI / LPSS: Only call pwm_add_table() for the first PWM controller
- Fix the handling of the scaling_cur_freq cpufreq policy attribute
on x86 systems with the MPERF/APERF registers present to make it
behave more as expected after recent changes (Rafael Wysocki).
- Drop a leftover callback from the intel_pstate driver which also
prevents the cpuinfo_cur_freq cpufreq policy attribute from being
incorrectly exposed when intel_pstate works in the active mode
(Rafael Wysocki).
- Add a missing piece describing the cpuinfo_cur_freq policy
attribute to cpufreq documentation (Rafael Wysocki).
- Fix up a recently added part of the Thunderbolt driver to avoid
aborting system suspends if its mailbox commands time out (Rafael
Wysocki).
- Update device runtime PM framework documentation to reflect the
current behavior of the code (Johan Hovold).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZg3DiAAoJEILEb/54YlRxvgkP/RV5djXAUeUpoMnzJCnP7jpB
bYvzqIXN2olcZPNP3hNLM7W1yRAWiwL9F1aFF2apJPyMQ8DW7iBqltAxQY+h2Ggc
lLvQcT5NRCFbIlsRadBL3TMZMFmtYqBD9Q8C+LLgQDWmm3mgG8o0kFJdGWkOMHg7
4FaqTuF1np8ARd1jX3E4QGCVaSqIEKN9xMo7jzsWIHXruag3tJb92ryROVcq3uO7
5WspJSbVMKS0E6jbK3RQkJ9T/3U453p8JMDJNQfOUFr1KQNfAikyV3U0ZoQA6Hjk
1tmJE9v7jq75BDj0Dtfskl8EBFA4Sk+Pegyb3/hB8MR/oRzCWRAe9rVW5LweckcC
Q1Q+SDKcVKVIEQa3GZOH1nBfHP/pIMQb0+WGMIq614rr1Ug9IuF4cCn2AnzkIFLE
lpkVch3mCfrCwAKmsvYug5isHdazlS93MY9dOZr4D1lewgFp1piShsBSK3RFBT8F
Cu3lYTzBi/MJ+8T0vZysHVEkCNu3emrXwccjzubW1rSNINGgRF5AV9F2HwGYzDHe
KD3Kw1KH2LZEWZAlR6W711pBJny4INENPH0XE85hZ4Ryi+2SN7zvsA/PM0rGjqBq
TcuFc3jN+DMwWoOJryY8701gE8SMz+8Vam/B191XFUkdZSHFk1WWILAu/OfTZCRo
bLYhORgyemWPluAzzcsx
=j2UJ
-----END PGP SIGNATURE-----
Merge tag 'pm-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two cpufreq issues, one introduced recently and one related
to recent changes, fix cpufreq documentation, fix up recently added
code in the Thunderbolt driver and update runtime PM framework
documentation.
Specifics:
- Fix the handling of the scaling_cur_freq cpufreq policy attribute
on x86 systems with the MPERF/APERF registers present to make it
behave more as expected after recent changes (Rafael Wysocki).
- Drop a leftover callback from the intel_pstate driver which also
prevents the cpuinfo_cur_freq cpufreq policy attribute from being
incorrectly exposed when intel_pstate works in the active mode
(Rafael Wysocki).
- Add a missing piece describing the cpuinfo_cur_freq policy
attribute to cpufreq documentation (Rafael Wysocki).
- Fix up a recently added part of the Thunderbolt driver to avoid
aborting system suspends if its mailbox commands time out (Rafael
Wysocki).
- Update device runtime PM framework documentation to reflect the
current behavior of the code (Johan Hovold)"
* tag 'pm-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thunderbolt: icm: Ignore mailbox errors in icm_suspend()
cpufreq: x86: Make scaling_cur_freq behave more as expected
PM / runtime: Document new pm_runtime_set_suspended() constraint
cpufreq: docs: Add missing cpuinfo_cur_freq description
cpufreq: intel_pstate: Drop ->get from intel_pstate structure
Here are some new device ids for v4.13-rc4.
All have been in linux-next with no reported issues.
Signed-off-by: Johan Hovold <johan@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEEHszNKQClByu0A+9RQQ3kT97htJUFAlmDSXARHGpvaGFuQGtl
cm5lbC5vcmcACgkQQQ3kT97htJWgoA/+LGO5SSJZTjGlqilC90idyoa1Ee1CtDCY
Vuc/95wTFGcTs5i2QPt94/Pma8uxRsKI2zu1rFxef6jJjiSoRUsUaSjyJwaDkuEj
YkeHUFcAzS8Gu+8U5Pz8DsB60Kw8pR+rS3VJsyOdBSUWLw/t6AVK3HagiejdqXXn
L3lYbc8rohDFsj7UjdNd9Sgv5vS3pBRO1+lyAu0OqX8UpOUmndPdwbsmuQbcFDSE
F/YI+oCxrbqLYma52buKBLNRmj5eF6QU0mjVJIfo1vFzx/zsGq64m7K7PVZ3F25H
j354jmdONuuStqbLgRz3RkPHQK3ootQ7NEcyxYdN/Rs7bfWlnkfbMPK7LC9BCYwt
w2Gx87nhBBiGLfP0GZ5qytwHCOYXVzxecwy4u1Byf4bt5w2J6CC1qYF1yUbK5Dxj
jZUXA9Hd0bLHtlEldm5IcMlUeahseGHyYvH7OaXZXk/Uq4YhFH6vcTkuAQSJ0cTD
KZgkfHymL/ukWORX5cZT6iU9M1pDdmDXpCz7h35FgaVEKNHmX62WX9SzgAyvCO3a
lOSoVS//QyjaOSIboTm6a2NFE9YL4+5OTbooAj0qcU7vf5P3Lj2t3pm6sn2tpcvs
YN6IzNbpWI8iGUUeRc88bT0ZSx6qf6B0kgmJQvnqn4z/fQOq1LfYgaPRxWVfXqGO
K9rBZPH6svU=
=bRO0
-----END PGP SIGNATURE-----
Merge tag 'usb-serial-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus
Johan writes:
USB-serial fixes for v4.13-rc4
Here are some new device ids for v4.13-rc4.
All have been in linux-next with no reported issues.
Signed-off-by: Johan Hovold <johan@kernel.org>
- fix TT sync flag inconsistency problems, which can lead to excess packets,
by Linus Luessing
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlmB3ikWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoUCUEADIvPfBxzon12xBV4JbBpTMnIwl
09heUwE1FxCj2UoTV2QJ+O9ZQb1yc7ARJksymP30T96q9bNlDiNFVquXcuQLA+ba
aoSowejLfTHLvOlP3Wrik5PVLBdKqeBpg7zYNtAuNI09DSryey1ipaC4Xemq9ZmC
8Hjam9RAQh5FV3RCeviTzpusXt31/xjkTx3vrf1WGx2uY/MSKNwwfPcRAyfG+Ye5
9kxpR8sBwRxJ+6+MEWRBjqi2Lk/BuD4urViAwGj88uDkKKByXT/pfULD3foTgx0d
smZWWMun9jQl4XDk321oOp5HOqg29IFcF/mwuWQIvBDO8AJBRndYHw07lx0cB5tD
I8PzTG5+kE3QXBeTBEUAuszaavDI9DKVdnr6h/8VbeEXIr73WHA0VixnTIv3zgI0
AiYHIrvUstDB7Cl5457blKbljgzXMN1ssBtD/UMNFhrEUKet9qIZskwlLyhPt2s3
qti70ZoURPz9ve57pNgZDZUR5lvvUhuSNUD4ZfGt5BC+YErX59jAswrRXHU5Zx/i
V5pSsRQ2BfxZUZrnmKONg3RW+hmsJJYOSzNXiO5/0zeVuorju7C+0L2goONX2bpr
fEzICfDJi+IXmYke0AAjiJq7KwobU703Nd3rT8eovjVlSSJj+mdyYSVYQF9dD2uE
DdtwpJh37PwbHOeQqg==
=a4HS
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20170802' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- fix TT sync flag inconsistency problems, which can lead to excess packets,
by Linus Luessing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Felipe writes:
usb: fixes for v4.13-rc4
Another fix for isochronous transfers on dwc3. This time around, we're
making sure that we use correct PIDs in all transfer sizes.
MSM PHY driver got a fix for the use of devm_regulator_bulk_get() API
which will avoid kernel crashes.
Renesas DRD driver got 3 fixes: a fix on giveback, a fix for proper
controller programming and the removal of set-but-never-used variable.
- Yet another race with VM destruction plugged
- A set of small vgic fixes
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlmDOZ4VHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDnwQP/ji7d7bbWybmeuWKXm8TYt6CNPBr
JC57OQHytU1+ccagjeUPaAgUAqWSQWOwfbqPuS1afshJu1ZZsN851QNmcr5W6bXQ
EwjICGWcAdKSMLnhRDdf+uXQbFSkkcaxsXnJWjmD0EE8ylaXCCkdV278xhTx0hVO
yiov/xWNzLMa3O3W/l4SIKQ4UNyeQ7f+Od1Vkf7iUpaaFcz6s3RNsPAOwc7Thq7L
eLk4SgXMLDNHz5HwdLoOp3RwQsNe9A7uh05z9VOav0YAyLG5vf29JhMroCO/4P/x
1HgWvpGwzxAUqUcHbW4FSOsbydEltlWMdfSoAF7BtPCLudmmsxfMJJlK3FKLvV+P
MsO2FdXiz6/ZLI9/ds7YJDOfc1cGSVK07Efx+SLXD5FAnkFYq4SElmI4sK8BWc5U
Dugw3j8u9M6jTiVKBioFo7yRT5iHG9O0A+6DtsMQ0iREfLolC7EEzpfGB/vWpKk5
BbdwDUiu6JxXEBJJqyP948iGfuIrikAx2mcf3dmEHVKhEnKi+MN/H5c9BKGAJ3sf
Fm1xF99IG+m3L3zXiukQ11OqsCx32kqLiXaejVttscFg/C33OYmGOhsk066hHCzx
LtoxrHUl7rwF+ssnKHq4Az2mYzZtquEq13O5tgleQcVwH/2+6EgWDf3mDSy1em0h
5IXgqk+PsaQ5n0jL
=TMR9
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-v4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
KVM/ARM Fixes for v4.13-rc4
- Yet another race with VM destruction plugged
- A set of small vgic fixes
Commit 8fba54aebb ("fuse: direct-io: don't dirty ITER_BVEC pages") fixes
the ITER_BVEC page deadlock for direct io in fuse by checking in
fuse_direct_io(), whether the page is a bvec page or not, before locking
it. However, this check is missed when the "async_dio" mount option is
enabled. In this case, set_page_dirty_lock() is called from the req->end
callback in request_end(), when the fuse thread is returning from userspace
to respond to the read request. This will cause the same deadlock because
the bvec condition is not checked in this path.
Here is the stack of the deadlocked thread, while returning from userspace:
[13706.656686] INFO: task glusterfs:3006 blocked for more than 120 seconds.
[13706.657808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[13706.658788] glusterfs D ffffffff816c80f0 0 3006 1
0x00000080
[13706.658797] ffff8800d6713a58 0000000000000086 ffff8800d9ad7000
ffff8800d9ad5400
[13706.658799] ffff88011ffd5cc0 ffff8800d6710008 ffff88011fd176c0
7fffffffffffffff
[13706.658801] 0000000000000002 ffffffff816c80f0 ffff8800d6713a78
ffffffff816c790e
[13706.658803] Call Trace:
[13706.658809] [<ffffffff816c80f0>] ? bit_wait_io_timeout+0x80/0x80
[13706.658811] [<ffffffff816c790e>] schedule+0x3e/0x90
[13706.658813] [<ffffffff816ca7e5>] schedule_timeout+0x1b5/0x210
[13706.658816] [<ffffffff81073ffb>] ? gup_pud_range+0x1db/0x1f0
[13706.658817] [<ffffffff810668fe>] ? kvm_clock_read+0x1e/0x20
[13706.658819] [<ffffffff81066909>] ? kvm_clock_get_cycles+0x9/0x10
[13706.658822] [<ffffffff810f5792>] ? ktime_get+0x52/0xc0
[13706.658824] [<ffffffff816c6f04>] io_schedule_timeout+0xa4/0x110
[13706.658826] [<ffffffff816c8126>] bit_wait_io+0x36/0x50
[13706.658828] [<ffffffff816c7d06>] __wait_on_bit_lock+0x76/0xb0
[13706.658831] [<ffffffffa0545636>] ? lock_request+0x46/0x70 [fuse]
[13706.658834] [<ffffffff8118800a>] __lock_page+0xaa/0xb0
[13706.658836] [<ffffffff810c8500>] ? wake_atomic_t_function+0x40/0x40
[13706.658838] [<ffffffff81194d08>] set_page_dirty_lock+0x58/0x60
[13706.658841] [<ffffffffa054d968>] fuse_release_user_pages+0x58/0x70 [fuse]
[13706.658844] [<ffffffffa0551430>] ? fuse_aio_complete+0x190/0x190 [fuse]
[13706.658847] [<ffffffffa0551459>] fuse_aio_complete_req+0x29/0x90 [fuse]
[13706.658849] [<ffffffffa05471e9>] request_end+0xd9/0x190 [fuse]
[13706.658852] [<ffffffffa0549126>] fuse_dev_do_write+0x336/0x490 [fuse]
[13706.658854] [<ffffffffa054963e>] fuse_dev_write+0x6e/0xa0 [fuse]
[13706.658857] [<ffffffff812a9ef3>] ? security_file_permission+0x23/0x90
[13706.658859] [<ffffffff81205300>] do_iter_readv_writev+0x60/0x90
[13706.658862] [<ffffffffa05495d0>] ? fuse_dev_splice_write+0x350/0x350
[fuse]
[13706.658863] [<ffffffff812062a1>] do_readv_writev+0x171/0x1f0
[13706.658866] [<ffffffff810b3d00>] ? try_to_wake_up+0x210/0x210
[13706.658868] [<ffffffff81206361>] vfs_writev+0x41/0x50
[13706.658870] [<ffffffff81206496>] SyS_writev+0x56/0xf0
[13706.658872] [<ffffffff810257a1>] ? syscall_trace_leave+0xf1/0x160
[13706.658874] [<ffffffff816cbb2e>] system_call_fastpath+0x12/0x71
Fix this by making should_dirty a fuse_io_priv parameter that can be
checked in fuse_aio_complete_req().
Reported-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There is a small chance that the compiler could generate separate loads
for the dist->propbaser which could be modified from another CPU. As we
want to make sure we atomically update the entire value, and don't race
with other updates, guarantee that the cmpxchg operation compares
against the original value.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
------------[ cut here ]------------
WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
Call Trace:
vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
? kvm_arch_vcpu_load+0x62/0x230 [kvm]
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
do_syscall_64+0x8f/0x750
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL64_slow_path+0x25/0x25
This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
means that tells the kernel to not make use of any IOAPICs that may be present
in the system.
Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
variable passed from vcpu_enter_guest() which means that the L0's userspace
requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
L0's userspace reqeusts an irq window) is true, so there is no interrupt which
L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
interrupt on exit" for the irq window requirement in this scenario.
This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
if there is no L1 requirement to inject an interrupt to L2.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Added code comment to make it obvious that the behavior is not correct.
We should do a userspace exit with open interrupt window instead of the
nested VM exit. This patch still improves the behavior, so it was
accepted as a (temporary) workaround.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The commit b8b9c974af ("usb: renesas_usbhs: gadget: disable all eps
when the driver stops") causes the unused-but-set-variable warning.
But, if the usbhsg_ep_disable() will return non-zero value, udc/core.c
doesn't clear the ep->enabled flag. So, this driver should not return
non-zero value, if the pipe is zero because this means the pipe is
already disabled. Otherwise, the ep->enabled flag is never cleared
when the usbhsg_ep_disable() is called by the renesas_usbhs driver first.
Fixes: b8b9c974af ("usb: renesas_usbhs: gadget: disable all eps when the driver stops")
Fixes: 11432050f0 ("usb: renesas_usbhs: gadget: fix NULL pointer dereference in ep_disable()")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
The latest HW manual (Rev.0.55) shows us this UGCTRL2.VBUSSEL bit.
If the bit sets to 1, the VBUS drive is controlled by phy related
registers (called "UCOM Registers" on the manual). Since R-Car Gen3
environment will control VBUS by phy-rcar-gen3-usb2 driver,
the UGCTRL2.VBUSSEL bit should be set to 1. So, this patch fixes
the register's value. Otherwise, even if the ID pin indicates to
peripheral, the R-Car will output USBn_PWEN to 1 when a host driver
is running.
Fixes: de18757e27 ("usb: renesas_usbhs: add R-Car Gen3 power control"
Cc: <stable@vger.kernel.org> # v4.6+
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
The regulator_bulk_data pointer passed to devm_regulator_bulk_get()
is used to store the client handles for the regulators, which
is later used by devm_regulator_bulk_release() to free the
regulators.
Passing a local array as is done here means the memory used to
store the handles is freed causing the handles to be corrupted,
resulting in a crash when devm_regulator_bulk_release() tries to
free them.
Fix this my moving the array inside of the msm_otg structure.
Signed-off-by: Rajendra Nayak <rnayak@codeaurora.org>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
According to the gadget.h, a "complete" function will always be called
with interrupts disabled. However, sometimes usb3_request_done() function
is called with interrupts enabled. So, this function should be held
by spin_lock_irqsave() to disable interruption. Also, this driver has
to call spin_unlock() to avoid spinlock recursion by this driver before
calling usb_gadget_giveback_request().
Reported-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Tested-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Fixes: 746bfe63bb ("usb: gadget: renesas_usb3: add support for Renesas USB3.0 peripheral controller")
Cc: <stable@vger.kernel.org> # v4.5+
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
The PIDs for Isochronous data transfers are incorrect
for high bandwidth IN endpoints when the request length
is less than EP wMaxPacketSize.
As per spec correct PIDs for ISOC data transfers are:
1) For request length <= maxpacket
- DATA0,
2) For maxpacket < length <= (2 * maxpacket)
- DATA1, DATA0
3) For (2 * maxpacket) < length <= (3 * maxpacket)
- DATA2, DATA1, DATA0.
But driver always sets PCM fields based on wMaxPacketSize
due to which DATA2 happens even for small requests.
Fix this by setting the PCM field of trb->size depending
on request length rather than fixing it to the value
depending on wMaxPacketSize.
Ideally it shouldn't give any issues as dwc3 will send
0-length packet for next IN token if host sends (even
after receiving a short packet). Windows seems to ignore
this but with MacOS frame loss observed when using f_uvc.
Signed-off-by: Manu Gautam <mgautam@codeaurora.org>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
The commit 304419d8a7 ("mmc: core: Allocate per-request data using the
block layer core") refactored mechanism of queue handling caused
mmc_init_request() can be called just after mmc_cleanup_queue() caused null
pointer dereference.
Another commit bbdc74dc19 ("mmc: block: Prevent new req entering queue
after its cleanup") tried to fix the problem. However it actually miss one
corner case.
We could still reproduce the issue mentioned with these steps:
(1) insert a SD card and mount it
(2) hotplug it, so it will leave md->usage still be counted
(3) reboot the system which will sync data and umount the card
[Unable to handle kernel NULL pointer dereference at virtual address
00000000
[user pgtable: 4k pages, 48-bit VAs, pgd = ffff80007bab3000
[[0000000000000000] *pgd=000000007a828003, *pud=0000000078dce003,
*pmd=000000007aab6003, *pte=0000000000000000
[Internal error: Oops: 96000007 [#1] PREEMPT SMP
[Modules linked in:
[CPU: 3 PID: 3507 Comm: umount Tainted: G W
4.13.0-rc1-next-20170720-00012-g9d9bf45 #33
[Hardware name: Firefly-RK3399 Board (DT)
[task: ffff80007a1de200 task.stack: ffff80007a01c000
[PC is at mmc_init_request+0x14/0xc4
[LR is at alloc_request_size+0x4c/0x74
[pc : [<ffff0000087d7150>] lr : [<ffff000008378fe0>] pstate: 600001c5
[sp : ffff80007a01f8f0
....
[[<ffff0000087d7150>] mmc_init_request+0x14/0xc4
[[<ffff000008378fe0>] alloc_request_size+0x4c/0x74
[[<ffff00000817ac28>] mempool_create_node+0xb8/0x17c
[[<ffff00000837aadc>] blk_init_rl+0x9c/0x120
[[<ffff000008396580>] blkg_alloc+0x110/0x234
[[<ffff000008396ac8>] blkg_create+0x424/0x468
[[<ffff00000839877c>] blkg_lookup_create+0xd8/0x14c
[[<ffff0000083796bc>] generic_make_request_checks+0x368/0x3b0
[[<ffff00000837b050>] generic_make_request+0x1c/0x240
So mmc_blk_put wouldn't calling blk_cleanup_queue which actually the
QUEUE_FLAG_DYING and QUEUE_FLAG_BYPASS should stay. Block core expect
blk_queue_bypass_{start, end} internally to bypass/drain the queue before
actually dying the queue, so it didn't expose API to set the queue bypass.
I think we should set QUEUE_FLAG_BYPASS whenever queue is removed, although
the md->usage is still counted, as no dispatch queue could be found then.
Fixes: 304419d8a7 ("mmc: core: Allocate per-request data using the block layer core")
Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
When the device is non removable, the card detect signal is often used
for another purpose i.e. muxed to another SoC peripheral or used as a
GPIO. It could lead to wrong behaviors depending the default value of
this signal if not muxed to the SDHCI controller.
Fixes: bb5f8ea4d5 ("mmc: sdhci-of-at91: introduce driver for the Atmel SDMMC")
Signed-off-by: Ludovic Desroches <ludovic.desroches@microchip.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Add one more model to the Chromebook DMI quirk to make it working again.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=194945
Fixes: 2a8209fa68 ("pinctrl: cherryview: Extend the Chromebook DMI quirk to Intel_Strago systems")
Reported-by: mail@abhishek.geek.nz
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
A check is performed on the ipad/opad in the safexcel_hmac_sha1_setkey
function, but the index used by the loop doing it is wrong. It is
currently the size of the state array while it should be the size of a
sha1 state. This patch fixes it.
Fixes: 1b44c5a60c ("crypto: inside-secure - add SafeXcel EIP197 crypto engine driver")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The safexcel_hmac_sha1_setkey function checks if an invalidation command
should be issued, i.e. when the context ipad/opad change. This checks is
done after filling the ipad/opad which and it can't be true. The patch
fixes this by moving the check before the ipad/opad memcpy operations.
Fixes: 1b44c5a60c ("crypto: inside-secure - add SafeXcel EIP197 crypto engine driver")
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stable fix:
- Fix EXCHANGE_ID corrupt verifier issue
Other fix:
- Fix double frees in nfs4_test_session_trunk()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlmCNJAACgkQ18tUv7Cl
QOvlKhAAjjKMtdwYDV7B3SJvDfa5KNH9nNaMjKoD/OzWavDL7jRVRUWayeJiy4sf
wJAMsG280ztag1Mr/OnClESZThWNrHTnKjq/P8kuefJz+pPvWP+AaQ/8QSM/rulM
Y7qfh+FZb2t7lK80VAZTXEi4bvQzlkIV2285a34VIUm3VTpwl3HvltkBMLFztDSb
sdaQh+N48FOYdG9RMncxpzFfRvUILzP9GFMg6WAFAcbr1wSsT9MfLqd+ANgOidE8
X2Ch9l0jN51m1dFGmMUZ5HMmuhUh2m50SXhmUOTpS7fj+k11OWsV2IIWHKPk/LjI
5E0aUPDU2TOE9noLSfkguAyxvqAGeAHPDIE7QM9vTGgktzXWGzh++ODeSzkIDp+5
iZ+cYLkme1lOg1nEqj1/SrIZXHP7ZCvK8iqNgoqT8rIp6zE1tLPnNAbRDmnwC/VH
PrfINATV8llN/3vFojWJfD1enDe3+dALPINLVQOjwjafq6d5/hzRdCGqWpCDjmlU
esDoCkoWGUjadIr6EZGtDAzSZgafsUx6d6QnGtkIUsz1d0FIXH6holB5EKaQpOLf
dVb9iO5R2EcVP+WvB3KjA5y6MCNvoVqebxMvLBPCYsu3fI8fQ0sgSjgSJXi0fRzg
G7DUsKcBGVKarDMXTUReL2G6lN7h0f5EsM1WnA9KF0TV9aah/ZE=
=Ddq9
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Two fixes from Trond this time, now that he's back from his vacation.
The first is a stable fix for the EXCHANGE_ID issue on the mailing
list, and the other fixes a double-free situation that he found at the
same time.
Stable fix:
- Fix EXCHANGE_ID corrupt verifier issue
Other fix:
- Fix double frees in nfs4_test_session_trunk()"
* tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4: Fix double frees in nfs4_test_session_trunk()
NFSv4: Fix EXCHANGE_ID corrupt verifier issue
This fixes a potential buffer overflow in isdn_net.c caused by an
unbounded strcpy.
[ ISDN seems to be effectively unmaintained, and the I4L driver in
particular is long deprecated, but in case somebody uses this..
- Linus ]
Signed-off-by: Jiten Thakkar <jitenmt@gmail.com>
Signed-off-by: Annie Cherkaev <annie.cherk@gmail.com>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently a bug in the sci_clk_get implementation causes it to always
return a clock belonging to the last device in the static list of clock
data. This is due to a bug in the init code that causes the array
used by sci_clk_get to only be populated with the clocks for the last
device, as each device overwrites the entire array with its own clocks.
Fix this by calculating the actual number of clocks for the SoC, and
allocating the whole array in one go. Also, we don't need the handle
to the init data array anymore after doing this, instead we can
just compare the dev_id / clk_id against the registered clocks and
use binary search for speed.
Signed-off-by: Tero Kristo <t-kristo@ti.com>
Reported-by: Dave Gerlach <d-gerlach@ti.com>
Fixes: b745c0794e ("clk: keystone: Add sci-clk driver support")
Cc: Nishanth Menon <nm@ti.com>
Tested-by: Franklin Cooper <fcooper@ti.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
into ocfs2_iop_set_acl(). That way the function will not be called when
inheriting ACLs which is what we want as it prevents SGID bit clearing
and the mode has been properly set by posix_acl_create() anyway. Also
posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
mode itself.
Fixes: 073931017b ("posix_acl: Clear SGID bit when setting file permissions")
Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernel panic when calling the IRQ-safe __get_user_pages_fast in NMI
handler.
The bug was introduced by commit 2947ba054a ("x86/mm/gup: Switch GUP
to the generic get_user_page_fast() implementation").
The original x86 __get_user_page_fast used plain get_page() or
page_ref_add(). However, the generic __get_user_page_fast uses
page_cache_get_speculative(), which has VM_BUG_ON(in_interrupt()).
There is no reason to prevent page_cache_get_speculative from using in
interrupt context. According to the author, putting a BUG_ON there is
just because the code is not verifying correctness of interrupt races.
I did some tests in interrupt context. There is no issue found.
Removing VM_BUG_ON(in_interrupt()) for page_cache_get_speculative().
Link: http://lkml.kernel.org/r/1501609146-59730-1-git-send-email-kan.liang@intel.com
Fixes: 2947ba054a ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation")
Signed-off-by: Kan Liang <kan.liang@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ying Huang <ying.huang@intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There may still be threads waiting on event_wqh at the time the
userfault file descriptor is closed. Flush the events wait-queue to
prevent waiting threads from hanging.
Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 9cd75c3cd4 ("userfaultfd: non-cooperative: add ability to report
non-PF events from uffd descriptor")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When building with the randstruct gcc plugin, the layout of the IPC
structs will be randomized, which requires any sub-structure accesses to
use container_of(). The proc display handlers were missing the needed
container_of()s since the iterator is passing in the top-level struct
kern_ipc_perm.
This would lead to crashes when running the "lsipc" program after the
system had IPC registered (e.g. after starting up Gnome):
general protection fault: 0000 [#1] PREEMPT SMP
...
RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0
...
Call Trace:
sysvipc_shm_proc_show+0x5e/0x150
sysvipc_proc_show+0x1a/0x30
seq_read+0x2e9/0x3f0
...
Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast
Fixes: 3859a271a0 ("randstruct: Mark various structs for randomization")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.
This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial. The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.
The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch. This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites. If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.
This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.
The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.
The relevant stack traces of the two stuck threads:
CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
smp_call_function+0x3b/0x70
on_each_cpu+0x2f/0x90
text_poke_bp+0x87/0xd0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad
...
CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
__slab_alloc.isra.80+0x54/0x90
copy_process.part.40+0xf7/0x1de0
copy_process.part.40+0xf7/0x1de0
kmem_cache_alloc_node+0x8a/0x280
copy_process.part.40+0xf7/0x1de0
_do_fork+0xe7/0x6c0
_raw_spin_unlock_irq+0x2d/0x60
trace_hardirqs_on_caller+0x136/0x1d0
entry_SYSCALL_64_fastpath+0x5/0xad
do_syscall_64+0x27/0x350
SyS_clone+0x19/0x20
do_syscall_64+0x60/0x350
entry_SYSCALL64_slow_path+0x25/0x25
Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
Fixes: 46e700abc4 ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the non-cooperative userfaultfd case, the process exit may race with
outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC
instead of -EINVAL when mm is already gone will allow uffd monitor to
distinguish this case from other error conditions.
Unfortunately I overlooked userfaultfd_zeropage when updating
userfaultd_copy().
Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 96333187ab ("userfaultfd_copy: return -ENOSPC in case mm has gone")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andre Wild reported the following warning:
WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
Modules linked in:
CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
task: 00000000701d8100 task.stack: 0000000073594000
Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
...
Call Trace:
lockdep_assert_cpus_held+0x42/0x60)
stop_machine_cpuslocked+0x62/0xf0
build_all_zonelists+0x92/0x150
numa_zonelist_order_handler+0x102/0x150
proc_sys_call_handler.isra.12+0xda/0x118
proc_sys_write+0x34/0x48
__vfs_write+0x3c/0x178
vfs_write+0xbc/0x1a0
SyS_write+0x66/0xc0
system_call+0xc4/0x2b0
locks held by bash/1205:
#0: (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
#1: (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
#2: (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
Last Breaking-Event-Address:
lockdep_assert_cpus_held+0x48/0x60
This can be easily triggered with e.g.
echo n > /proc/sys/vm/numa_zonelist_order
In commit 3f906ba236 ("mm/memory-hotplug: switch locking to a percpu
rwsem") memory hotplug locking was changed to fix a potential deadlock.
This also switched the stop_machine() invocation within
build_all_zonelists() to stop_machine_cpuslocked() which now expects
that online cpus are locked when being called.
This assumption is not true if build_all_zonelists() is being called
from numa_zonelist_order_handler().
In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
pair to numa_zonelist_order_handler().
Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
Fixes: 3f906ba236 ("mm/memory-hotplug: switch locking to a percpu rwsem")
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Andre Wild <wild@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kerneldoc comment for kthread_create() had an incorrect argument
name, leading to a warning in the docs build.
Correct it, and make one more small step toward a warning-free build.
Link: http://lkml.kernel.org/r/20170724135916.7f486c6f@lwn.net
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc-7 produces this warning:
mm/kasan/report.c: In function 'kasan_report':
mm/kasan/report.c:351:3: error: 'info.first_bad_addr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
print_shadow_for_address(info->first_bad_addr);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/kasan/report.c:360:27: note: 'info.first_bad_addr' was declared here
The code seems fine as we only print info.first_bad_addr when there is a
shadow, and we always initialize it in that case, but this is relatively
hard for gcc to figure out after the latest rework.
Adding an intialization to the most likely value together with the other
struct members shuts up that warning.
Fixes: b235b9808664 ("kasan: unify report headers")
Link: https://patchwork.kernel.org/patch/9641417/
Link: http://lkml.kernel.org/r/20170725152739.4176967-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Alexander Potapenko <glider@google.com>
Suggested-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When mremap is called with MREMAP_FIXED it unmaps memory at the
destination address without notifying userfaultfd monitor.
If the destination were registered with userfaultfd, the monitor has no
way to distinguish between the old and new ranges and to properly relate
the page faults that would occur in the destination region.
Fixes: 897ab3e0c4 ("userfaultfd: non-cooperative: add event for memory unmaps")
Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.
He described the race as follows:
CPU0 CPU1
---- ----
user accesses memory using RW PTE
[PTE now cached in TLB]
try_to_unmap_one()
==> ptep_get_and_clear()
==> set_tlb_ubc_flush_pending()
mprotect(addr, PROT_READ)
==> change_pte_range()
==> [ PTE non-present - no flush ]
user writes using cached RW PTE
...
try_to_unmap_flush()
The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.
For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access. For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB. However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.
Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch. Other users such as gup as a race with
reclaim looks just at PTEs. huge page variants should be ok as they
don't race with reclaim. mincore only looks at PTEs. userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.
Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected. His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.
[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: <stable@vger.kernel.org> [v4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 3d375d7859 ("mm: update callers to use HASH_ZERO flag"),
drop unused pidhash_size in pidhash_init().
Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Pavel Tatashin <Pasha.Tatashin@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9a291a7c94 ("mm/hugetlb: report -EHWPOISON not -EFAULT when
FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
errors from follow_hugetlb_page. After such error, __get_user_pages
subsequently calls faultin_page on the same VMA and start address that
follow_hugetlb_page failed on instead of returning the error immediately
as it should.
In follow_hugetlb_page, when hugetlb_fault returns a value covered under
VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
to 0 as __get_user_pages expects in this case, which causes the
following to happen in __get_user_pages: the "while (nr_pages)" check
succeeds, we skip the "if (!vma..." check because we got a VMA the last
time around, we find no page with follow_page_mask, and we call
faultin_page, which calls hugetlb_fault for the second time.
This issue also slightly changes how __get_user_pages works. Before, it
only returned error if it had made no progress (i = 0). But now,
follow_hugetlb_page can clobber "i" with an error code since its new
return path doesn't check for progress. So if "i" is nonzero before a
failing call to follow_hugetlb_page, that indication of progress is lost
and __get_user_pages can return error even if some pages were
successfully pinned.
To fix this, change follow_hugetlb_page so that it updates nr_pages,
allowing __get_user_pages to fail immediately and restoring the "error
only if no progress" behavior to __get_user_pages.
Tested that __get_user_pages returns when expected on error from
hugetlb_fault in follow_hugetlb_page.
Fixes: 9a291a7c94 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: James Morse <james.morse@arm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: <stable@vger.kernel.org> [4.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The host physical addresses of L1's Virtual APIC Page and Posted
Interrupt descriptor are loaded into the VMCS02. The CPU may write
to these pages via their host physical address while L2 is running,
bypassing address-translation-based dirty tracking (e.g. EPT write
protection). Mark them dirty on every exit from L2 to prevent them
from getting out of sync with dirty tracking.
Also mark the virtual APIC page and the posted interrupt descriptor
dirty when KVM is virtualizing posted interrupt processing.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
According to the Intel SDM, software cannot rely on the current VMCS to be
coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
flushes.
24.11.1 Software Use of Virtual-Machine Control Structures
...
If a logical processor leaves VMX operation, any VMCSs active on
that logical processor may be corrupted (see below). To prevent
such corruption of a VMCS that may be used either after a return
to VMX operation or on another logical processor, software should
execute VMCLEAR for that VMCS before executing the VMXOFF instruction
or removing power from the processor (e.g., as part of a transition
to the S3 and S4 power states).
...
This fixes a "suspicious rcu_dereference_check() usage!" warning during
kvm_vm_release() because nested_release_vmcs12() calls
kvm_vcpu_write_guest_page() without holding kvm->srcu.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Since the current implementation of VMCS12 does a memcpy in and out
of guest memory, we do not need current_vmcs12 and current_vmcs12_page
anymore. current_vmptr is enough to read and write the VMCS12.
And David Matlack noted:
This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
VMCS12 page by using kvm_write_guest. nested_release_page() only marks
the struct page dirty.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Added David Matlack's note and nested_release_page_clean() fix.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>