Commit Graph

650103 Commits

Author SHA1 Message Date
Wei Wang
25776aa943 net: Remove __sk_dst_reset() in tcp_v6_connect()
Remove __sk_dst_reset() in the failure handling because __sk_dst_reset()
will eventually get called when sk is released. No need to handle it in
the protocol specific connect call.
This is also to make the code path consistent with ipv4.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 14:04:38 -05:00
Wei Wang
065263f40f net/tcp-fastopen: refactor cookie check logic
Refactor the cookie check logic in tcp_send_syn_data() into a function.
This function will be called else where in later changes.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 14:04:38 -05:00
David S. Miller
c0d9665f08 Merge branch 'bnxt_en-rtnl-fixes'
Michael Chan says:

====================
bnxt_en: Fix RTNL lock usage in bnxt_sp_task().

There are 2 function calls from bnxt_sp_task() that have buggy RTNL
usage.  These 2 functions take RTNL lock under some conditions, but
some callers (such as open, ethtool) have already taken RTNL.  These
3 patches fix the issue by making it clear that callers must take
RTNL.  If the caller is bnxt_sp_task() which does not automatically
take RTNL, we add a common scheme for bnxt_sp_task() to call these
functions properly under RTNL.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:27:14 -05:00
Michael Chan
90c694bb71 bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status().
bnxt_get_port_module_status() calls bnxt_update_link() which expects
RTNL to be held.  In bnxt_sp_task() that does not hold RTNL, we need to
call it with a prior call to bnxt_rtnl_lock_sp() and the call needs to
be moved to the end of bnxt_sp_task().

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:27:13 -05:00
Michael Chan
0eaa24b971 bnxt_en: Fix RTNL lock usage on bnxt_update_link().
bnxt_update_link() is called from multiple code paths.  Most callers,
such as open, ethtool, already hold RTNL.  Only the caller bnxt_sp_task()
does not.  So it is a bug to take RTNL inside bnxt_update_link().

Fix it by removing the RTNL inside bnxt_update_link().  The function
now expects the caller to always hold RTNL.

In bnxt_sp_task(), call bnxt_rtnl_lock_sp() before calling
bnxt_update_link().  We also need to move the call to the end of
bnxt_sp_task() since it will be clearing the BNXT_STATE_IN_SP_TASK bit.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:27:13 -05:00
Michael Chan
a551ee94ea bnxt_en: Fix bnxt_reset() in the slow path task.
In bnxt_sp_task(), we set a bit BNXT_STATE_IN_SP_TASK so that bnxt_close()
will synchronize and wait for bnxt_sp_task() to finish.  Some functions
in bnxt_sp_task() require us to clear BNXT_STATE_IN_SP_TASK and then
acquire rtnl_lock() to prevent race conditions.

There are some bugs related to this logic. This patch refactors the code
to have common bnxt_rtnl_lock_sp() and bnxt_rtnl_unlock_sp() to handle
the RTNL and the clearing/setting of the bit.  Multiple functions will
need the same logic.  We also need to move bnxt_reset() to the end of
bnxt_sp_task().  Functions that clear BNXT_STATE_IN_SP_TASK must be the
last functions to be called in bnxt_sp_task().  The common scheme will
handle the condition properly.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:27:12 -05:00
Linus Torvalds
49e555a932 virtio, vhost: fixes
ARM DMA fixes
 vhost vsock bugfix
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYh9ZlAAoJECgfDbjSjVRptqsIAJ03KvIYEUH2+kWr21UWWoFk
 wYGGtf2Voexnj6SVcY6JMTG4J0PF5y1K9XL1/Z0kXFQWy+PCqp4bdE7PUpxs8PlB
 ShapIoMTU7dfSL2cFfrYb3Nx4hFiube3HXSYz5LhGUBwWT7v5j9fRodoNZjyi3rs
 faNKR9psNHFpkSgRTJKVnj53/dU+HcSP1S+x9qpkS2bYHvLA2vQo/FaKg/M8i9jt
 JzCcX/RbOij9DlHgzT64cFTnIZVekauUsQAJcs8e/SHhwm7CLAlrVIrjcG4H4mvg
 XIKcL1YJKVrJzjnDcezSkfQpc+oPn2t4Qk+VOcnqbHPsg23Hr5Rj8krixCl6XmQ=
 =Yeqz
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio/vhost fixes from Michael Tsirkin:

 - ARM DMA fixes

 - vhost vsock bugfix

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vring: Force use of DMA API for ARM-based systems with legacy devices
  virtio_mmio: Set DMA masks appropriately
  vhost/vsock: handle vhost_vq_init_access() error
2017-01-25 10:25:36 -08:00
hayeswang
a9c54ad2c7 r8152: fix the wrong spelling
Replace rumtime with runtime.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:24:34 -05:00
Jason Baron
56d806222a tcp: correct memory barrier usage in tcp_check_space()
sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.

Fixes: 3c7151275c ("tcp: add memory barriers to write space paths")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:23:36 -05:00
Andrew Lunn
d234559952 Doc: DT: bindings: net: dsa: marvell.txt: Tabification
Replace spaces with tabs. Fix indentation to be multiples of tabs, not
a mixture or tabs and spaces.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:20:00 -05:00
David S. Miller
cca316f356 Merge branch 'bpf-tracepoints'
Daniel Borkmann says:

====================
BPF tracepoints

This set adds tracepoints to BPF for better introspection and
debugging. The first two patches are prerequisite for the actual
third patch that adds the tracepoints. I think the first two are
small and straight forward enough that they could ideally go via
net-next, but I'm also open to other suggestions on how to route
them in case that's not applicable (it would reduce potential
merge conflicts on BPF side, though). For details, please see
individual patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:48 -05:00
Daniel Borkmann
a67edbf4fb bpf: add initial bpf tracepoints
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.

For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.

Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:

  # ./perf record -a -e bpf:* sleep 10
  # ./perf script
  sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
  sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
  sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
  sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
  [...]
  sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
       swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER

  [1] https://lwn.net/Articles/705270/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Daniel Borkmann
0fe0559179 lib, traceevent: add PRINT_HEX_STR variant
Add support for the __print_hex_str() macro that was added for
tracing, so that user space tools such as perf can understand
it as well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Daniel Borkmann
2acae0d5b0 trace: add variant without spacing in trace_print_hex_seq
For upcoming tracepoint support for BPF, we want to dump the program's
tag. Format should be similar to __print_hex(), but without spacing.
Add a __print_hex_str() variant for exactly that purpose that reuses
trace_print_hex_seq().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Eric Dumazet
60b1af3300 tcp: reduce skb overhead in selected places
tcp_add_backlog() can use skb_condense() helper to get better
gains and less SKB_TRUESIZE() magic. This only happens when socket
backlog has to be used.

Some attacks involve specially crafted out of order tiny TCP packets,
clogging the ofo queue of (many) sockets.
Then later, expensive collapse happens, trying to copy all these skbs
into single ones.
This unfortunately does not work if each skb has no neighbor in TCP
sequence order.

By using skb_condense() if the skb could not be coalesced to a prior
one, we defeat these kind of threats, potentially saving 4K per skb
(or more, since this is one page fragment).

A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes
in skb->head, meaning the copy done by skb_condense() is limited to
about 200 bytes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:13:31 -05:00
David S. Miller
716dcaebed mlx5-updates-2017-24-01
The first seven patches from Or Gerlitz in this series further enhances
 the mlx5 SRIOV switchdev mode to support offloading IPv6 tunnels using the
 TC tunnel key set (encap) and unset (decap) actions.
 
 Or Gerlitz says:
 ========================
 As part of doing this change, few cleanups are done in the IPv4 code,
 later we move to use the full tunnel key info provided to the driver as
 the key for our internal hashing which is used to identify cases where
 the same tunnel is used for encapsulating multiple flows. As done in the
 IPv4 case, the control path for offloading IPv6 tunnels uses route/neigh
 lookups and construction of the IPv6 tunnel headers on the encap path and
 matching on the outer hears in the decap path.
 
 The last patch of the series enlarges the HW FDB size for the switchdev mode,
 so it has now room to contain offloaded flows as many as min(max number
 of HW flow counters supported, max HW table size supported).
 ========================
 
 Next to Or's series you can find several patches handling several topics.
 
 From Mohamad, add support for SRIOV VF min rate guarantee by using the
 TSAR BW share weights mechanism.
 
 From Or, Two patches to enable Eth VFs to query their min-inline value for
 user-space.
 for that we move a mlx5 low level min inline helper function from mlx5
 ethernet driver into the core driver and then use it in mlx5_ib to expose
 the inline mode to rdma applications through libmlx5.
 
 From Kamal Heib, Reduce memory consumption on kdump kernel.
 
 From Shaker Daibes, code reuse in CQE compression control logic
 
 Thanks,
 Saeed.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJYh7FNAAoJEEg/ir3gV/o+TjsIAL1e92+5eutBS9ZvhMARi+Tc
 c2V9V8bG8W1RWWTvx1G0aU4nNjWsr5L8Q8gzqpwhrQITBfgpWd+hlnxQCucyhxC3
 AC1qQ+AKREe/C+25D+WJRq34/61ZHEH2rbKZvpZ1O8SuicVPbcvJ9eM+wOEDxwwX
 u5C5kWQ0HRtCcnFiiOYkB+0CQPH7m3+ZzZek+jDowrexHMSE+yl8ZNtaSTX9c9QN
 bE2cPiCVZd7ufKPIwY8LWHBryyl7sh5P+NqzD633OeiqP/pkZsW9A+czyt+d330f
 6XTKOS1PCD+TfHE0sZJT4VMCjICMHrOFbNRZuwcxJQ6NfmwIJZfskX4NLbyGQTI=
 =vF7U
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2017-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2017-24-01

The first seven patches from Or Gerlitz in this series further enhances
the mlx5 SRIOV switchdev mode to support offloading IPv6 tunnels using the
TC tunnel key set (encap) and unset (decap) actions.

Or Gerlitz says:
========================
As part of doing this change, few cleanups are done in the IPv4 code,
later we move to use the full tunnel key info provided to the driver as
the key for our internal hashing which is used to identify cases where
the same tunnel is used for encapsulating multiple flows. As done in the
IPv4 case, the control path for offloading IPv6 tunnels uses route/neigh
lookups and construction of the IPv6 tunnel headers on the encap path and
matching on the outer hears in the decap path.

The last patch of the series enlarges the HW FDB size for the switchdev mode,
so it has now room to contain offloaded flows as many as min(max number
of HW flow counters supported, max HW table size supported).
========================

Next to Or's series you can find several patches handling several topics.

From Mohamad, add support for SRIOV VF min rate guarantee by using the
TSAR BW share weights mechanism.

From Or, Two patches to enable Eth VFs to query their min-inline value for
user-space.
for that we move a mlx5 low level min inline helper function from mlx5
ethernet driver into the core driver and then use it in mlx5_ib to expose
the inline mode to rdma applications through libmlx5.

From Kamal Heib, Reduce memory consumption on kdump kernel.

From Shaker Daibes, code reuse in CQE compression control logic
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:49:58 -05:00
Dan Carpenter
a08ef4768f tipc: uninitialized return code in tipc_setsockopt()
We shuffled some code around and added some new case statements here and
now "res" isn't initialized on all paths.

Fixes: 01fd12bb18 ("tipc: make replicast a user selectable option")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:41:34 -05:00
Jamal Hadi Salim
1045ba77a5 net sched actions: Add support for user cookies
Introduce optional 128-bit action cookie.
Like all other cookie schemes in the networking world (eg in protocols
like http or existing kernel fib protocol field, etc) the idea is to save
user state that when retrieved serves as a correlator. The kernel
_should not_ intepret it.  The user can store whatever they wish in the
128 bits.

Sample exercise(showing variable length use of cookie)

.. create an accept action with cookie a1b2c3d4
sudo $TC actions add action ok index 1 cookie a1b2c3d4

.. dump all gact actions..
sudo $TC -s actions ls action gact

    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 1 bind 0 installed 5 sec used 5 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    cookie a1b2c3d4

.. bind the accept action to a filter..
sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1

... send some traffic..
$ ping 127.0.0.1 -c 3
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:37:04 -05:00
Xin Long
5207f39963 sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segment
Now sctp gso puts segments into skb's frag_list, then processes these
segments in skb_segment. But skb_segment handles them only when gs is
enabled, as it's in the same branch with skb's frags.

Although almost all the NICs support sg other than some old ones, but
since commit 1e16aa3ddf ("net: gso: use feature flag argument in all
protocol gso handlers"), features &= skb->dev->hw_enc_features, and
xfrm_output_gso call skb_segment with features = 0, which means sctp
gso would call skb_segment with sg = 0, and skb_segment would not work
as expected.

This patch is to fix it by setting features param with NETIF_F_SG when
calling skb_segment so that it can go the right branch to process the
skb's frag_list.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:28:33 -05:00
Xin Long
6f29a13061 sctp: sctp_addr_id2transport should verify the addr before looking up assoc
sctp_addr_id2transport is a function for sockopt to look up assoc by
address. As the address is from userspace, it can be a v4-mapped v6
address. But in sctp protocol stack, it always handles a v4-mapped
v6 address as a v4 address. So it's necessary to convert it to a v4
address before looking up assoc by address.

This patch is to fix it by calling sctp_verify_addr in which it can do
this conversion before calling sctp_endpoint_lookup_assoc, just like
what sctp_sendmsg and __sctp_connect do for the address from users.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:26:55 -05:00
Christoph Hellwig
493611ebd6 xfs: extsize hints are not unlikely in xfs_bmap_btalloc
With COW files they are the hotpath, just like for files with the
extent size hint attribute.  We really shouldn't micro-manage anything
but failure cases with unlikely.

Additionally Arnd Bergmann recently reported that one of these two
unlikely annotations causes link failures together with an upcoming
kernel instrumentation patch, so let's get rid of it ASAP.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25 08:59:43 -08:00
Brian Foster
5a93790d4e xfs: remove racy hasattr check from attr ops
xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize
away a lock cycle in cases where the fork does not exist or is otherwise
empty. This check is not safe, however, because an attribute fork short
form to extent format conversion includes a transient state that causes
the xfs_inode_hasattr() check to fail. Specifically,
xfs_attr_shortform_to_leaf() creates an empty extent format attribute
fork and then adds the existing shortform attributes to it.

This means that lookup of an existing xattr can spuriously return
-ENOATTR when racing against a setxattr that causes the associated
format conversion. This was originally reproduced by an untar on a
particularly configured glusterfs volume, but can also be reproduced on
demand with properly crafted xattr requests.

The format conversion occurs under the exclusive ilock. xfs_attr_get()
and xfs_attr_remove() already have the proper locking and checks further
down in the functions to handle this situation correctly. Drop the
unlocked checks to avoid the spurious failure and rely on the existing
logic.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25 07:53:43 -08:00
Christoph Hellwig
76d771b4cb xfs: use per-AG reservations for the finobt
Currently we try to rely on the global reserved block pool for block
allocations for the free inode btree, but I have customer reports
(fairly complex workload, need to find an easier reproducer) where that
is not enough as the AG where we free an inode that requires a new
finobt block is entirely full.  This causes us to cancel a dirty
transaction and thus a file system shutdown.

I think the right way to guard against this is to treat the finot the same
way as the refcount btree and have a per-AG reservations for the possible
worst case size of it, and the patch below implements that.

Note that this could increase mount times with large finobt trees.  In
an ideal world we would have added a field for the number of finobt
fields to the AGI, similar to what we did for the refcount blocks.
We should do add it next time we rev the AGI or AGF format by adding
new fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25 07:49:35 -08:00
Christoph Hellwig
4dfa2b8411 xfs: only update mount/resv fields on success in __xfs_ag_resv_init
Try to reserve the blocks first and only then update the fields in
or hanging off the mount structure.  This way we can call __xfs_ag_resv_init
again after a previous failure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25 07:49:34 -08:00
Hans de Goede
fd25ea2909 Revert "ACPI / video: Add force_native quirk for HP Pavilion dv6"
Revert commit 6276e53fa8 (ACPI / video: Add force_native quirk for
HP Pavilion dv6).

In the commit message for the quirk this revert removes I wrote:

"Note that there are quite a few HP Pavilion dv6 variants, some
woth ATI and some with NVIDIA hybrid gfx, both seem to need this
quirk to have working backlight control. There are also some versions
with only Intel integrated gfx, these may not need this quirk, but it
should not hurt there."

Unfortunately that seems wrong, I've already received 2 reports of
this commit causing regressions on some dv6 variants (at least one
of which actually has a nvidia GPU). So it seems that HP has made a
mess here by using the same model-name both in marketing and in the
DMI data for many different variants. Some of which need
acpi_backlight=native for functional backlight control (as the
quirk this commit reverts was doing), where as others are broken by
it. So lets get back to the old sitation so as to avoid regressing
on models which used to work without any kernel cmdline arguments
before.

Fixes: 6276e53fa8 (ACPI / video: Add force_native quirk for HP Pavilion dv6)
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-01-25 14:04:13 +01:00
Jani Nikula
45d9f43911 Merge tag 'gvt-fixes-2017-01-25' of https://github.com/01org/gvt-linux into drm-intel-fixes
gvt-fixes-2017-01-25

- re-enable shadow batch buffer for security that was falsely turned off.
- kvmgt/mdev typo fix for correct ABI
- gvt mail list change

Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 13:05:39 +02:00
Daniele Ceraolo Spurio
2f5db26c2e drm/i915: reinstate call to trace_i915_vma_bind
The call went away in:

commit 3b16525cc4
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 4 16:32:25 2016 +0100

    drm/i915: Split insertion/binding of an object into the VM

It is useful to have this trace as it pairs nicely with the vma_unbind
one to track vma activity.
Added inside the i915_vma_bind function (was outside before) to keep a
similar placement as trace_i915_vma_unbind.

v2: print bind_flags instead of flags (Chris)

Fixes: 3b16525cc4 ("drm/i915: Split insertion/binding of an object into the VM")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1484949083-11430-1-git-send-email-daniele.ceraolospurio@intel.com
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
(cherry picked from commit 6146e6da5c)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:57:22 +02:00
Chris Wilson
6f0f02dc56 drm/i915: Move atomic state free from out of fence release
Fences are required to support being released from under an atomic context.
The drm_atomic_state struct may take a mutex when being released and so
we cannot drop a reference to the drm_atomic_state from the fence release
path directly, and so we need to defer that unreference to a worker.

[  326.576697] WARNING: CPU: 2 PID: 366 at kernel/sched/core.c:7737 __might_sleep+0x5d/0x80
[  326.576816] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffffc0359549>] intel_breadcrumbs_signaler+0x59/0x270 [i915]
[  326.576818] Modules linked in: rfcomm fuse snd_hda_codec_hdmi bnep snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer input_leds led_class snd punit_atom_debug btusb btrtl btbcm btintel intel_rapl bluetooth i915 drm_kms_helper syscopyarea sysfillrect iwlwifi sysimgblt soundcore fb_sys_fops mei_txe cfg80211 drm pwm_lpss_platform pwm_lpss pinctrl_cherryview fjes acpi_pad parport_pc ppdev parport autofs4
[  326.576899] CPU: 2 PID: 366 Comm: i915/signal:0 Tainted: G     U          4.10.0-rc3-patser+ #5030
[  326.576902] Hardware name:                  /NUC5PPYB, BIOS PYBSWCEL.86A.0031.2015.0601.1712 06/01/2015
[  326.576905] Call Trace:
[  326.576920]  dump_stack+0x4d/0x6d
[  326.576926]  __warn+0xc0/0xe0
[  326.576931]  warn_slowpath_fmt+0x5a/0x80
[  326.577004]  ? intel_breadcrumbs_signaler+0x59/0x270 [i915]
[  326.577075]  ? intel_breadcrumbs_signaler+0x59/0x270 [i915]
[  326.577079]  __might_sleep+0x5d/0x80
[  326.577087]  mutex_lock+0x1b/0x40
[  326.577133]  drm_property_free_blob+0x1e/0x80 [drm]
[  326.577167]  ? drm_property_destroy+0xe0/0xe0 [drm]
[  326.577200]  drm_mode_object_unreference+0x5c/0x70 [drm]
[  326.577233]  drm_property_unreference_blob+0xe/0x10 [drm]
[  326.577260]  __drm_atomic_helper_crtc_destroy_state+0x14/0x40 [drm_kms_helper]
[  326.577278]  drm_atomic_helper_crtc_destroy_state+0x10/0x20 [drm_kms_helper]
[  326.577352]  intel_crtc_destroy_state+0x9/0x10 [i915]
[  326.577388]  drm_atomic_state_default_clear+0xea/0x1d0 [drm]
[  326.577462]  intel_atomic_state_clear+0xd/0x20 [i915]
[  326.577497]  drm_atomic_state_clear+0x1a/0x30 [drm]
[  326.577532]  __drm_atomic_state_free+0x13/0x60 [drm]
[  326.577607]  intel_atomic_commit_ready+0x6f/0x78 [i915]
[  326.577670]  i915_sw_fence_release+0x3a/0x50 [i915]
[  326.577733]  dma_i915_sw_fence_wake+0x39/0x80 [i915]
[  326.577741]  dma_fence_signal+0xda/0x120
[  326.577812]  ? intel_breadcrumbs_signaler+0x59/0x270 [i915]
[  326.577884]  intel_breadcrumbs_signaler+0xb1/0x270 [i915]
[  326.577889]  kthread+0x127/0x130
[  326.577961]  ? intel_engine_remove_wait+0x1a0/0x1a0 [i915]
[  326.577964]  ? kthread_stop+0x120/0x120
[  326.577970]  ret_from_fork+0x22/0x30

Fixes: c004a90b72 ("drm/i915: Restore nonblocking awaits for modesetting")
Reported-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/20170123212939.30345-1-chris@chris-wilson.co.uk
Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.10-rc1+
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
(cherry picked from commit eb955eee27)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:42:58 +02:00
Ander Conselvan de Oliveira
6d1d427a4e drm/i915: Check for NULL atomic state in intel_crtc_disable_noatomic()
In intel_crtc_disable_noatomic(), bail on a failure to allocate an
atomic state to avoid a NULL pointer dereference.

Fixes: 4a80655827 ("drm/i915: Pass atomic state to crtc enable/disable functions")
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v4.9+
Signed-off-by: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1484922525-6131-4-git-send-email-ander.conselvan.de.oliveira@intel.com
(cherry picked from commit 31bb2ef97e)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:42:50 +02:00
Ander Conselvan de Oliveira
3781bd6e7d drm/i915: Fix calculation of rotated x and y offsets for planar formats
Parameters tile_size, tile_width and tile_height were passed in the
wrong order to _intel_adjust_tile_offset() when calculating the rotated
offsets.

This doesn't fix any user visible bug, since for packed formats new
and old offset are the same and the rotated offsets are within a tile
before they are fed to _intel_adjust_tile_offset(). In that case, the
offsets are unchanged. That is not true for planar formats, but those
are currently not supported.

Fixes: 66a2d927cb ("drm/i915: Make intel_adjust_tile_offset() work for linear buffers")
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Sivakumar Thulasimani <sivakumar.thulasimani@intel.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v4.9+
Signed-off-by: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1484922525-6131-3-git-send-email-ander.conselvan.de.oliveira@intel.com
(cherry picked from commit 46a1bd2895)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:42:41 +02:00
Ander Conselvan de Oliveira
21d6e0bde5 drm/i915: Don't init hpd polling for vlv and chv from runtime_suspend()
An error in the condition for avoiding the call to intel_hpd_poll_init()
for valleyview and cherryview from intel_runtime_suspend() caused it to
be called unconditionally. Fix it.

Fixes: 19625e85c6 ("drm/i915: Enable polling when we don't have hpd")
Cc: stable@vger.kernel.org
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Lyude <cpaul@redhat.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v4.9+
Signed-off-by: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1484922525-6131-2-git-send-email-ander.conselvan.de.oliveira@intel.com
(cherry picked from commit 04313b00b7)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:42:36 +02:00
Ander Conselvan de Oliveira
c34f078675 drm/i915: Don't leak edid in intel_crt_detect_ddc()
In the path where intel_crt_detect_ddc() detects a CRT, if would return
true without freeing the edid.

Fixes: a2bd1f541f ("drm/i915: check whether we actually received an edid in detect_ddc")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v3.6+
Signed-off-by: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1484922525-6131-1-git-send-email-ander.conselvan.de.oliveira@intel.com
(cherry picked from commit c96b63a6a7)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:42:30 +02:00
Chris Wilson
a38a7bd176 drm/i915: Release temporary load-detect state upon switching
After we call drm_atomic_commit() on the load-detect state, we can free
our local reference. Upon restore, we only apply and free the previous state.

Fixes: 0853695c3b ("drm: Add reference counting to drm_atomic_state")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.10-rc1+
Link: http://patchwork.freedesktop.org/patch/msgid/20170119113749.2517-1-chris@chris-wilson.co.uk
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
(cherry picked from commit 7abbd11f34)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:42:21 +02:00
Clint Taylor
27892bbdc9 drm/i915: prevent crash with .disable_display parameter
The .disable_display parameter was causing a fatal crash when fbdev
was dereferenced during driver init.

V1: protection in i915_drv.c
V2: Moved protection to intel_fbdev.c

Fixes: 43cee31434 ("drm/i915/fbdev: Limit the global async-domain synchronization")
Testcase: igt/drv_module_reload/basic-no-display
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Clint Taylor <clinton.a.taylor@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1484775523-29428-1-git-send-email-clinton.a.taylor@intel.com
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
(cherry picked from commit 5b8cd0755f)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:42:16 +02:00
Chris Wilson
b78671591a drm/i915: Avoid drm_atomic_state_put(NULL) in intel_display_resume
intel_display_resume() may be called without an atomic state to restore,
i.e. dev_priv->modeset_reset_restore state is NULL. One such case is
following a lid open/close event and the forced modeset in
intel_lid_notify().

Reported-by: Stefan Seyfried <stefan.seyfried@googlemail.com>
Tested-by: Stefan Seyfried <stefan.seyfried@googlemail.com>
Fixes: 0853695c3b ("drm: Add reference counting to drm_atomic_state")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.10-rc1+
Link: http://patchwork.freedesktop.org/patch/msgid/20170115125825.18597-1-chris@chris-wilson.co.uk
Reviewed-by: Ander Conselvan de Oliveira <conselvan2@gmail.com>
(cherry picked from commit 3c5e37f169)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2017-01-25 12:41:00 +02:00
Zhenyu Wang
ba7addcd80 MAINTAINERS: update new mail list for intel gvt driver
We've moved to lists.freedesktop.org from lists.01.org.
Update info in MAINTAINERS.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-01-25 10:30:02 +08:00
Alex Williamson
7283accfae drm/i915/gvt: Fix kmem_cache_create() name
According to kmem_cache_sanity_check(), spaces are not allowed in the
name of a cache and results in a kernel oops with CONFIG_DEBUG_VM.
Convert to underscores.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-01-25 10:28:34 +08:00
Alex Williamson
bdbfd5196d drm/i915/gvt/kvmgt: mdev ABI is available_instances, not available_instance
Per the ABI specification[1], each mdev_supported_types entry should
have an available_instances, with an "s", not available_instance.

[1] Documentation/ABI/testing/sysfs-bus-vfio-mdev

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-01-25 10:28:34 +08:00
Fabio Estevam
3feb479cea Revert "thermal: thermal_hwmon: Convert to hwmon_device_register_with_info()"
This reverts commit 7611fb6806 ("thermal: thermal_hwmon: Convert to
hwmon_device_register_with_info()").

Pavel Machek reported breakage in the Nokia N900 due to this commit.

We can revisit a proper fix for the warning later.

Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
2017-01-25 09:51:08 +08:00
Linus Torvalds
883af14e67 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "26 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  MAINTAINERS: add Dan Streetman to zbud maintainers
  MAINTAINERS: add Dan Streetman to zswap maintainers
  mm: do not export ioremap_page_range symbol for external module
  mn10300: fix build error of missing fpu_save()
  romfs: use different way to generate fsid for BLOCK or MTD
  frv: add missing atomic64 operations
  mm, page_alloc: fix premature OOM when racing with cpuset mems update
  mm, page_alloc: move cpuset seqcount checking to slowpath
  mm, page_alloc: fix fast-path race with cpuset update or removal
  mm, page_alloc: fix check for NULL preferred_zone
  kernel/panic.c: add missing \n
  fbdev: color map copying bounds checking
  frv: add atomic64_add_unless()
  mm/mempolicy.c: do not put mempolicy before using its nodemask
  radix-tree: fix private list warnings
  Documentation/filesystems/proc.txt: add VmPin
  mm, memcg: do not retry precharge charges
  proc: add a schedule point in proc_pid_readdir()
  mm: alloc_contig: re-allow CMA to compact FS pages
  mm/slub.c: trace free objects at KERN_INFO
  ...
2017-01-24 16:54:39 -08:00
Dan Streetman
aab45453ff MAINTAINERS: add Dan Streetman to zbud maintainers
Add myself as zbud maintainer.

Link: http://lkml.kernel.org/r/20170124221705.26523-1-ddstreet@ieee.org
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:15 -08:00
Dan Streetman
534c9dc982 MAINTAINERS: add Dan Streetman to zswap maintainers
Add myself as zswap maintainer.

Link: http://lkml.kernel.org/r/20170124212200.19052-1-ddstreet@ieee.org
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
zhong jiang
3277953de2 mm: do not export ioremap_page_range symbol for external module
Recently, I've found cases in which ioremap_page_range was used
incorrectly, in external modules, leading to crashes.  This can be
partly attributed to the fact that ioremap_page_range is lower-level,
with fewer protections, as compared to the other functions that an
external module would typically call.  Those include:

     ioremap_cache
     ioremap_nocache
     ioremap_prot
     ioremap_uc
     ioremap_wc
     ioremap_wt

...each of which wraps __ioremap_caller, which in turn provides a safer
way to achieve the mapping.

Therefore, stop EXPORT-ing ioremap_page_range.

Link: http://lkml.kernel.org/r/1485173220-29010-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Randy Dunlap
3705ccfdd1 mn10300: fix build error of missing fpu_save()
When CONFIG_FPU is not enabled on arch/mn10300, <asm/switch_to.h> causes
a build error with a call to fpu_save():

  kernel/built-in.o: In function `.L410':
  core.c:(.sched.text+0x28a): undefined reference to `fpu_save'

Fix this by including <asm/fpu.h> in <asm/switch_to.h> so that an empty
static inline fpu_save() is defined.

Link: http://lkml.kernel.org/r/dc421c4f-4842-4429-1b99-92865c2f24b6@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Coly Li
f598f82e20 romfs: use different way to generate fsid for BLOCK or MTD
Commit 8a59f5d252 ("fs/romfs: return f_fsid for statfs(2)") generates
a 64bit id from sb->s_bdev->bd_dev.  This is only correct when romfs is
defined with CONFIG_ROMFS_ON_BLOCK.  If romfs is only defined with
CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev
will triger an oops.

Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y,
both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined.
Therefore when calling huge_encode_dev() to generate a 64bit id, I use
the follow order to choose parameter,

- CONFIG_ROMFS_ON_BLOCK defined
  use sb->s_bdev->bd_dev
- CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined
  use sb->s_dev when,
- both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined
  leave id as 0

When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev
is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index,
otherwise sb->s_dev is 0.

This is a try-best effort to generate a uniq file system ID, if all the
above conditions are not meet, f_fsid of this romfs instance will be 0.
Generally only one romfs can be built on single MTD block device, this
method is enough to identify multiple romfs instances in a computer.

Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.de
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Nong Li <nongli1031@gmail.com>
Tested-by: Nong Li <nongli1031@gmail.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Sudip Mukherjee
4180c4c170 frv: add missing atomic64 operations
Some more atomic64 operations were missing and as a result frv
allmodconfig was failing.  Add the missing operations.

Link: http://lkml.kernel.org/r/1485193844-12850-1-git-send-email-sudip.mukherjee@codethink.co.uk
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Vlastimil Babka
e47483bca2 mm, page_alloc: fix premature OOM when racing with cpuset mems update
Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode
triggers OOM killer in few seconds, despite lots of free memory.  The
test attempts to repeatedly fault in memory in one process in a cpuset,
while changing allowed nodes of the cpuset between 0 and 1 in another
process.

The problem comes from insufficient protection against cpuset changes,
which can cause get_page_from_freelist() to consider all zones as
non-eligible due to nodemask and/or current->mems_allowed.  This was
masked in the past by sufficient retries, but since commit 682a3385e7
("mm, page_alloc: inline the fast path of the zonelist iterator") we fix
the preferred_zoneref once, and don't iterate over the whole zonelist in
further attempts, thus the only eligible zones might be placed in the
zonelist before our starting point and we always miss them.

A previous patch fixed this problem for current->mems_allowed.  However,
cpuset changes also update the task's mempolicy nodemask.  The fix has
two parts.  We have to repeat the preferred_zoneref search when we
detect cpuset update by way of seqcount, and we have to check the
seqcount before considering OOM.

[akpm@linux-foundation.org: fix typo in comment]
Link: http://lkml.kernel.org/r/20170120103843.24587-5-vbabka@suse.cz
Fixes: c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Vlastimil Babka
5ce9bfef1d mm, page_alloc: move cpuset seqcount checking to slowpath
This is a preparation for the following patch to make review simpler.
While the primary motivation is a bug fix, this also simplifies the fast
path, although the moved code is only enabled when cpusets are in use.

Link: http://lkml.kernel.org/r/20170120103843.24587-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Vlastimil Babka
16096c25bf mm, page_alloc: fix fast-path race with cpuset update or removal
Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode
triggers OOM killer in few seconds, despite lots of free memory.  The
test attempts to repeatedly fault in memory in one process in a cpuset,
while changing allowed nodes of the cpuset between 0 and 1 in another
process.

One possible cause is that in the fast path we find the preferred
zoneref according to current mems_allowed, so that it points to the
middle of the zonelist, skipping e.g.  zones of node 1 completely.  If
the mems_allowed is updated to contain only node 1, we never reach it in
the zonelist, and trigger OOM before checking the cpuset_mems_cookie.

This patch fixes the particular case by redoing the preferred zoneref
search if we switch back to the original nodemask.  The condition is
also slightly changed so that when the last non-root cpuset is removed,
we don't miss it.

Note that this is not a full fix, and more patches will follow.

Link: http://lkml.kernel.org/r/20170120103843.24587-3-vbabka@suse.cz
Fixes: 682a3385e7 ("mm, page_alloc: inline the fast path of the zonelist iterator")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Vlastimil Babka
ea57485af8 mm, page_alloc: fix check for NULL preferred_zone
Patch series "fix premature OOM regression in 4.7+ due to cpuset races".

This is v2 of my attempt to fix the recent report based on LTP cpuset
stress test [1].  The intention is to go to stable 4.9 LTSS with this,
as triggering repeated OOMs is not nice.  That's why the patches try to
be not too intrusive.

Unfortunately why investigating I found that modifying the testcase to
use per-VMA policies instead of per-task policies will bring the OOM's
back, but that seems to be much older and harder to fix problem.  I have
posted a RFC [2] but I believe that fixing the recent regressions has a
higher priority.

Longer-term we might try to think how to fix the cpuset mess in a better
and less error prone way.  I was for example very surprised to learn,
that cpuset updates change not only task->mems_allowed, but also
nodemask of mempolicies.  Until now I expected the parameter to
alloc_pages_nodemask() to be stable.  I wonder why do we then treat
cpusets specially in get_page_from_freelist() and distinguish HARDWALL
etc, when there's unconditional intersection between mempolicy and
cpuset.  I would expect the nodemask adjustment for saving overhead in
g_p_f(), but that clearly doesn't happen in the current form.  So we
have both crazy complexity and overhead, AFAICS.

[1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
[2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz

This patch (of 4):

Since commit c33d6c06f6 ("mm, page_alloc: avoid looking up the first
zone in a zonelist twice") we have a wrong check for NULL preferred_zone,
which can theoretically happen due to concurrent cpuset modification.  We
check the zoneref pointer which is never NULL and we should check the zone
pointer.  Also document this in first_zones_zonelist() comment per Michal
Hocko.

Fixes: c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00