Commit Graph

44260 Commits

Author SHA1 Message Date
Jozsef Kadlecsik
722a94519a netfilter: ipset: Separate memsize calculation code into dedicated function
Hash types already has it's memsize calculation code in separate
functions. Clean up and do the same for *bitmap* and *list* sets.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:44 +01:00
Jozsef Kadlecsik
bec810d973 netfilter: ipset: Improve skbinfo get/init helpers
Use struct ip_set_skbinfo in struct ip_set_ext instead of open
coded fields and assign structure members in get/init helpers
instead of copying members one by one. Explicitly note that
struct ip_set_skbinfo must be padded to prevent non-aligned
access in the extension blob.

Ported from a patch proposed by Sergey Popovich <popovich_sergei@mail.ua>.

Suggested-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2016-11-10 13:28:42 +01:00
Eric Dumazet
f522a5fcf6 tcp: remove unaligned accesses from tcp_get_info()
After commit 6ed46d1247 ("sock_diag: align nlattr properly when
needed"), tcp_get_info() gets 64bit aligned memory, so we can avoid
the unaligned helpers.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 22:51:33 -05:00
David Ahern
9b6c14d51b net: tcp response should set oif only if it is L3 master
Lorenzo noted an Android unit test failed due to e0d56fdd7342:
"The expectation in the test was that the RST replying to a SYN sent to a
closed port should be generated with oif=0. In other words it should not
prefer the interface where the SYN came in on, but instead should follow
whatever the routing table says it should do."

Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
that the oif in the flow is set to the skb_iif only if skb_iif is an L3
master.

Fixes: e0d56fdd73 ("net: l3mdev: remove redundant calls")
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 22:32:10 -05:00
David S. Miller
d401c1d1e8 This feature and cleanup patchset includes the following changes:
- netlink and code cleanups by Sven Eckelmann (3 patches)
 
  - Cleanup and minor fixes by Linus Luessing (3 patches)
 
  - Speed up multicast update intervals, by Linus Luessing
 
  - Avoid (re)broadcast in meshes for some easy cases,
    by Linus Luessing
 
  - Clean up tx return state handling, by Sven Eckelmann (6 patches)
 
  - Fix some special mac address handling cases, by Sven Eckelmann
    (3 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJYI6GFFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqHyEw/9GkYNRQJOk1JMuW0cDvj9uWqoendvXRNPVkCvqh4gjX4o+aQaeyumv1/v
 eYqpslQWmSsrIlGQ6UGCegzyzZ7jXo6ZijM7wvz2bWwB2C0NzUGlBBCzOeA6Bui/
 3Fq+Xmx0Xcf5+c82YmrLor/yYp4FIFTao4+a80vHzQeI/Hg8RuJTOFJdtVNV3JPP
 VrfzMAPLLXJPPKHjt1PN3lfANWqX6nWLUMhHBNkMpYB+mMdyaCve6X+MxPF+WYBH
 wBO8spU35chW7dp8HOncof5nRDv2xVHWs6TN2kdJ762YrZ1oL0GXwWXViKhWskSQ
 QEeOLboyj3IuwPsxOQOLQEbAMrp6jqj3L/6lYWRkV2U6Bbi8EYdozW8L3utxMcvA
 Dft8D2U5JAD5ja0VUFyGhwNaBFien2B9JSEwsyOLtUbaQSASNyvym75WrN2Ey/d7
 JhBzUt6Iwh8RNJylY3nG5OkoNnyXYv3VrQLsIW4QTHc8Um9eaiOeFHtuAi6WNBtI
 HgMwPcdErNbmPd3w9OM6kk6aBQ/DTUK/7CNUKYVoGDayGxKYGDwqhoog9zm18wrt
 wc/TtdIY+q95hgm8fDCJefrnkaIxDJrVtChs30N/pJ24MeKcHuibop3HzxIngze2
 zPZTuXRKA2VSt79+EV4KORAutexi1WQIN7nRH1a8zMsYfyMKG8Y=
 =1xrj
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20161108-v2' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
pull request for net-next: batman-adv 2016-11-08 v2

This feature and cleanup patchset includes the following changes:

 - netlink and code cleanups by Sven Eckelmann (3 patches)

 - Cleanup and minor fixes by Linus Luessing (3 patches)

 - Speed up multicast update intervals, by Linus Luessing

 - Avoid (re)broadcast in meshes for some easy cases,
   by Linus Luessing

 - Clean up tx return state handling, by Sven Eckelmann (6 patches)

 - Fix some special mac address handling cases, by Sven Eckelmann
   (3 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 22:15:28 -05:00
Eric Dumazet
149d6ad836 net: napi_hash_add() is no longer exported
There are no more users except from net/core/dev.c
napi_hash_add() can now be static.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 21:16:05 -05:00
David Lebrun
a149e7c7ce ipv6: sr: add support for SRH injection through setsockopt
This patch adds support for per-socket SRH injection with the setsockopt
system call through the IPPROTO_IPV6, IPV6_RTHDR options.
The SRH is pushed through the ipv6_push_nfrag_opts function.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
613fa3ca9e ipv6: add source address argument for ipv6_push_nfrag_opts
This patch prepares for insertion of SRH through setsockopt().
The new source address argument is used when an HMAC field is
present in the SRH, which must be filled. The HMAC signature
process requires the source address as input text.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
9baee83406 ipv6: sr: add calls to verify and insert HMAC signatures
This patch enables the verification of the HMAC signature for transiting
SR-enabled packets, and its insertion on encapsulated/injected SRH.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
4f4853dc1c ipv6: sr: implement API to control SR HMAC structure
This patch provides an implementation of the genetlink commands
to associate a given HMAC key identifier with an hashing algorithm
and a secret.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
bf355b8d2c ipv6: sr: add core files for SR HMAC support
This patch adds the necessary functions to compute and check the HMAC signature
of an SR-enabled packet. Two HMAC algorithms are supported: hmac(sha1) and
hmac(sha256).

In order to avoid dynamic memory allocation for each HMAC computation,
a per-cpu ring buffer is allocated for this purpose.

A new per-interface sysctl called seg6_require_hmac is added, allowing a
user-defined policy for processing HMAC-signed SR-enabled packets.
A value of -1 means that the HMAC field will always be ignored.
A value of 0 means that if an HMAC field is present, its validity will
be enforced (the packet is dropped is the signature is incorrect).
Finally, a value of 1 means that any SR-enabled packet that does not
contain an HMAC signature or whose signature is incorrect will be dropped.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
6c8702c60b ipv6: sr: add support for SRH encapsulation and injection with lwtunnels
This patch creates a new type of interfaceless lightweight tunnel (SEG6),
enabling the encapsulation and injection of SRH within locally emitted
packets and forwarded packets.

>From a configuration viewpoint, a seg6 tunnel would be configured as follows:

  ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0

Any packet whose destination address is fc00::1 would thus be encapsulated
within an outer IPv6 header containing the SRH with three segments, and would
actually be routed to the first segment of the list. If `mode inline' was
specified instead of `mode encap', then the SRH would be directly inserted
after the IPv6 header without outer encapsulation.

The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This
feature was made configurable because direct header insertion may break
several mechanisms such as PMTUD or IPSec AH.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
915d7e5e59 ipv6: sr: add code base for control plane support of SR-IPv6
This patch adds the necessary hooks and structures to provide support
for SR-IPv6 control plane, essentially the Generic Netlink commands
that will be used for userspace control over the Segment Routing
kernel structures.

The genetlink commands provide control over two different structures:
tunnel source and HMAC data. The tunnel source is the source address
that will be used by default when encapsulating packets into an
outer IPv6 header + SRH. If the tunnel source is set to :: then an
address of the outgoing interface will be selected as the source.

The HMAC commands currently just return ENOTSUPP and will be implemented
in a future patch.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David Lebrun
1ababeba4a ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)
Implement minimal support for processing of SR-enabled packets
as described in
https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02.

This patch implements the following operations:
- Intermediate segment endpoint: incrementation of active segment and rerouting.
- Egress for SR-encapsulated packets: decapsulation of outer IPv6 header + SRH
  and routing of inner packet.
- Cleanup flag support for SR-inlined packets: removal of SRH if we are the
  penultimate segment endpoint.

A per-interface sysctl seg6_enabled is provided, to accept/deny SR-enabled
packets. Default is deny.

This patch does not provide support for HMAC-signed packets.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:40:06 -05:00
David S. Miller
9fa684ec86 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a larger than usual batch of Netfilter
fixes for your net tree. This series contains a mixture of old bugs and
recently introduced bugs, they are:

1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
   support the set element updates from the packet path. From Liping
   Zhang.

2) Fix leak when nft_expr_clone() fails, from Liping Zhang.

3) Fix a race when inserting new elements to the set hash from the
   packet path, also from Liping.

4) Handle segmented TCP SIP packets properly, basically avoid that the
   INVITE in the allow header create bogus expectations by performing
   stricter SIP message parsing, from Ulrich Weber.

5) nft_parse_u32_check() should return signed integer for errors, from
   John Linville.

6) Fix wrong allocation instead of connlabels, allocate 16 instead of
   32 bytes, from Florian Westphal.

7) Fix compilation breakage when building the ip_vs_sync code with
   CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.

8) Destroy the new set if the transaction object cannot be allocated,
   also from Liping Zhang.

9) Use device to route duplicated packets via nft_dup only when set by
   the user, otherwise packets may not follow the right route, again
   from Liping.

10) Fix wrong maximum genetlink attribute definition in IPVS, from
    WANG Cong.

11) Ignore untracked conntrack objects from xt_connmark, from Florian
    Westphal.

12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
    via CT target, otherwise we cannot use the h.245 helper, from
    Florian.

13) Revisit garbage collection heuristic in the new workqueue-based
    timer approach for conntrack to evict objects earlier, again from
    Florian.

14) Fix crash in nf_tables when inserting an element into a verdict map,
    from Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:38:18 -05:00
Mathias Krause
f567e950bf rtnl: reset calcit fptr in rtnl_unregister()
To avoid having dangling function pointers left behind, reset calcit in
rtnl_unregister(), too.

This is no issue so far, as only the rtnl core registers a netlink
handler with a calcit hook which won't be unregistered, but may become
one if new code makes use of the calcit hook.

Fixes: c7ac8679be ("rtnetlink: Compute and store minimum ifinfo...")
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 20:18:19 -05:00
Asbjørn Sloth Tønnesen
3f9b9770b4 net: l2tp: fix negative assignment to unsigned int
recv_seq, send_seq and lns_mode mode are all defined as
unsigned int foo:1;

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen
57ceb8611d net: l2tp: cleanup: remove redundant condition
These assignments follow this pattern:

	unsigned int foo:1;
	struct nlattr *nla = info->attrs[bar];

	if (nla)
		foo = nla_get_flag(nla); /* expands to: foo = !!nla */

This could be simplified to: if (nla) foo = 1;
but lets just remove the condition and use the macro,

	foo = nla_get_flag(nla);

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen
97b7af097e net: l2tp: netlink: l2tp_nl_tunnel_send: set UDP6 checksum flags
This patch causes the proper attribute flags to be set,
in the case that IPv6 UDP checksums are disabled, so that
userspace ie. `ip l2tp show tunnel` knows about it.

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
Asbjørn Sloth Tønnesen
7ff516ffe4 net: l2tp: only set L2TP_ATTR_UDP_CSUM if AF_INET
Only set L2TP_ATTR_UDP_CSUM in l2tp_nl_tunnel_send()
when it's running over IPv4.

This prepares the code to also have IPv6 specific attributes.

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:55:36 -05:00
David Ahern
9d1a6c4ea4 net: icmp_route_lookup should use rt dev to determine L3 domain
icmp_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have an rt.
Update icmp_route_lookup to use the rt on the skb to determine L3
domain.

Fixes: 613d09b30f ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:49:39 -05:00
Eric Dumazet
d61d072e87 net-gro: avoid reorders
Receiving a GSO packet in dev_gro_receive() is not uncommon
in stacked devices, or devices partially implementing LRO/GRO
like bnx2x. GRO is implementing the aggregation the device
was not able to do itself.

Current code causes reorders, like in following case :

For a given flow where sender sent 3 packets P1,P2,P3,P4

Receiver might receive P1 as a single packet, stored in GRO engine.

Then P2-P4 are received as a single GSO packet, immediately given to
upper stack, while P1 is held in GRO engine.

This patch will make sure P1 is given to upper stack, then P2-P4
immediately after.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 18:48:54 -05:00
Arnd Bergmann
56a62e2218 netfilter: conntrack: fix NF_REPEAT handling
gcc correctly identified a theoretical uninitialized variable use:

net/netfilter/nf_conntrack_core.c: In function 'nf_conntrack_in':
net/netfilter/nf_conntrack_core.c:1125:14: error: 'l4proto' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This could only happen when we 'goto out' before looking up l4proto,
and then enter the retry, implying that l3proto->get_l4proto()
returned NF_REPEAT. This does not currently get returned in any
code path and probably won't ever happen, but is not good to
rely on.

Moving the repeat handling up a little should have the same
behavior as today but avoids the warning by making that case
impossible to enter.

[ I have mangled this original patch to remove the check for tmpl, we
  should inconditionally jump back to the repeat label in case we hit
  NF_REPEAT instead. I have also moved the comment that explains this
  where it belongs. --pablo ]

Fixes: 08733a0cb7 ("netfilter: handle NF_REPEAT from nf_conntrack_in()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-10 00:19:33 +01:00
Arnd Bergmann
30f5815848 udp: provide udp{4,6}_lib_lookup for nf_socket_ipv{4,6}
Since commit ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
the udp6_lib_lookup and udp4_lib_lookup functions are only
provided when it is actually possible to call them.

However, moving the callers now caused a link error:

net/built-in.o: In function `nf_sk_lookup_slow_v6':
(.text+0x131a39): undefined reference to `udp6_lib_lookup'
net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4':
nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup'

This extends the #ifdef so we also provide the functions when
CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively
are set.

Fixes: 8db4c5be88 ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-10 00:05:14 +01:00
Davide Caratti
0e54d2179f netfilter: conntrack: simplify init/uninit of L4 protocol trackers
modify registration and deregistration of layer-4 protocol trackers to
facilitate inclusion of new elements into the current list of builtin
protocols. Both builtin (TCP, UDP, ICMP) and non-builtin (DCCP, GRE, SCTP,
UDPlite) layer-4 protocol trackers usually register/deregister themselves
using consecutive calls to nf_ct_l4proto_{,pernet}_{,un}register(...).
This sequence is interrupted and rolled back in case of error; in order to
simplify addition of builtin protocols, the input of the above functions
has been modified to allow registering/unregistering multiple protocols.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:49:25 +01:00
Liping Zhang
4e24877e61 netfilter: nf_tables: simplify the basic expressions' init routine
Some basic expressions are built into nf_tables.ko, such as nft_cmp,
nft_lookup, nft_range and so on. But these basic expressions' init
routine is a little ugly, too many goto errX labels, and we forget
to call nft_range_module_exit in the exit routine, although it is
harmless.

Acctually, the init and exit routines of these basic expressions
are same, i.e. do nft_register_expr in the init routine and do
nft_unregister_expr in the exit routine.

So it's better to arrange them into an array and deal with them
together.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:42:23 +01:00
Pablo Neira Ayuso
f86dab3aa6 netfilter: nft_hash: get random bytes if seed is not specified
If the user doesn't specify a seed, generate one at configuration time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09 23:41:09 +01:00
Hadar Hen Zion
75bfbca01e net/sched: act_tunnel_key: Add UDP dst port option
The current tunnel set action supports only IP addresses and key
options. Add UDP dst port option.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:55 -05:00
Hadar Hen Zion
24ba898d43 net/dst: Add dst port to dst_metadata utility functions
Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst
utility functions.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion
f4d997fd61 net/sched: cls_flower: Add UDP port to tunnel parameters
The current IP tunneling classification supports only IP addresses and key.
Enhance UDP based IP tunneling classification parameters by adding UDP
src and dst port.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion
519d10521c net/sched: cls_flower: Allow setting encapsulation fields as used key
When encapsulation field is set, mark it as used key for the flow
dissector. This will be used by offloading drivers.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:54 -05:00
Hadar Hen Zion
9ce183b4c4 net/sched: act_tunnel_key: add helper inlines to access tcf_tunnel_key
Needed for drivers to pick the relevant action when offloading tunnel
key act.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:41:53 -05:00
Lorenzo Colitti
35b80733b3 net: core: add missing check for uid_range in rule_exists.
Without this check, it is not possible to create two rules that
are identical except for their UID ranges. For example:

root@net-test:/# ip rule add prio 1000 lookup 300
root@net-test:/# ip rule add prio 1000 uidrange 100-200 lookup 300
RTNETLINK answers: File exists
root@net-test:/# ip rule add prio 1000 uidrange 100-199 lookup 100
root@net-test:/# ip rule add prio 1000 uidrange 200-299 lookup 200
root@net-test:/# ip rule add prio 1000 uidrange 300-399 lookup 100
RTNETLINK answers: File exists

Tested: https://android-review.googlesource.com/#/c/299980/
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:28:10 -05:00
Maciej Żenczykowski
fb56be83e4 net-ipv6: on device mtu change do not add mtu to mtu-less routes
Routes can specify an mtu explicitly or inherit the mtu from
the underlying device - this inheritance is implemented in
dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().

Currently changing the mtu of a device adds mtu explicitly
to routes using that device.

ie.
  # ip link set dev lo mtu 65536
  # ip -6 route add local 2000::1 dev lo
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium

  # ip link set dev lo mtu 65535
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65535 pref medium

  # ip link set dev lo mtu 65536
  # ip -6 route get 2000::1
  local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium

  # ip -6 route del local 2000::1

After this patch the route entry no longer changes unless it already has an mtu.
There is no need: this inheritance is already done in ip6_mtu()

  # ip link set dev lo mtu 65536
  # ip -6 route add local 2000::1 dev lo
  # ip -6 route add local 2000::2 dev lo mtu 2000
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium

  # ip link set dev lo mtu 65535
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium

  # ip link set dev lo mtu 1501
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 1501 pref medium

  # ip link set dev lo mtu 65536
  # ip -6 route get 2000::1; ip -6 route get 2000::2
  local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
  local 2000::2 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium

  # ip -6 route del local 2000::1
  # ip -6 route del local 2000::2

This is desirable because changing device mtu and then resetting it
to the previous value shouldn't change the user visible routing table.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:19:32 -05:00
Soheil Hassas Yeganeh
3023898b7d sock: fix sendmmsg for partial sendmsg
Do not send the next message in sendmmsg for partial sendmsg
invocations.

sendmmsg assumes that it can continue sending the next message
when the return value of the individual sendmsg invocations
is positive. It results in corrupting the data for TCP,
SCTP, and UNIX streams.

For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
of "aefgh" if the first sendmsg invocation sends only the first
byte while the second sendmsg goes through.

Datagram sockets either send the entire datagram or fail, so
this patch affects only sockets of type SOCK_STREAM and
SOCK_SEQPACKET.

Fixes: 228e548e60 ("net: Add sendmmsg socket system call")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:18:12 -05:00
Eric Dumazet
67db3e4bfb tcp: no longer hold ehash lock while calling tcp_get_info()
We had various problems in the past in tcp_get_info() and used
specific synchronization to avoid deadlocks.

We would like to add more instrumentation points for TCP, and
avoiding grabing socket lock in tcp_getinfo() was too costly.

Being able to lock the socket allows to provide consistent set
of fields.

inet_diag_dump_icsk() can make sure ehash locks are not
held any more when tcp_get_info() is called.

We can remove syncp added in commit d654976cbf
("tcp: fix a potential deadlock in tcp_get_info()"), but we need
to use lock_sock_fast() instead of spin_lock_bh() since TCP input
path can now be run from process context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:02:27 -05:00
Eric Dumazet
ccbf3bfaee tcp: shortcut listeners in tcp_get_info()
Being lockless in tcp_get_info() is hard, because we need to add
specific synchronization in TCP fast path, like seqcount.

Following patch will change inet_diag_dump_icsk() to no longer
hold any lock for non listeners, so that we can properly acquire
socket lock in get_tcp_info() and let it return more consistent counters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:02:27 -05:00
Sowmini Varadhan
117d15bbfd RDS: TCP: start multipath acceptor loop at 0
The for() loop in rds_tcp_accept_one() assumes that the 0'th
rds_tcp_conn_path is UP and starts multipath accepts at index 1.
But this assumption may not always be true: if the 0'th path
has failed (ERROR or DOWN state) an incoming connection request
should be used to resurrect this path.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 12:47:49 -05:00
Sowmini Varadhan
1ac507d4ff RDS: TCP: report addr/port info based on TCP socket in rds-info
The socket argument passed to rds_tcp_tc_info() is a PF_RDS socket,
so it is incorrect to report the address port info based on
rds_getname() as part of TCP state report.

Invoke inet_getname() for the t_sock associated with the
rds_tcp_connection instead.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 12:47:49 -05:00
Liping Zhang
58c78e104d netfilter: nf_tables: fix oops when inserting an element into a verdict map
Dalegaard says:
 The following ruleset, when loaded with 'nft -f bad.txt'
 ----snip----
 flush ruleset
 table ip inlinenat {
   map sourcemap {
     type ipv4_addr : verdict;
   }

   chain postrouting {
     ip saddr vmap @sourcemap accept
   }
 }
 add chain inlinenat test
 add element inlinenat sourcemap { 100.123.10.2 : jump test }
 ----snip----

 results in a kernel oops:
 BUG: unable to handle kernel paging request at 0000000000001344
 IP: [<ffffffffa07bf704>] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
 [...]
 Call Trace:
  [<ffffffffa07c2aae>] ? nft_data_init+0x13e/0x1a0 [nf_tables]
  [<ffffffffa07c1950>] nft_validate_register_store+0x60/0xb0 [nf_tables]
  [<ffffffffa07c74b5>] nft_add_set_elem+0x545/0x5e0 [nf_tables]
  [<ffffffffa07bfdd0>] ? nft_table_lookup+0x30/0x60 [nf_tables]
  [<ffffffff8132c630>] ? nla_strcmp+0x40/0x50
  [<ffffffffa07c766e>] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
  [<ffffffff8132c400>] ? nla_validate+0x60/0x80
  [<ffffffffa030d9b4>] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]

Because we forget to fill the net pointer in bind_ctx, so dereferencing
it may cause kernel crash.

Reported-by: Dalegaard <dalegaard@gmail.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:39 +01:00
Florian Westphal
e0df8cae6c netfilter: conntrack: refine gc worker heuristics
Nicolas Dichtel says:
  After commit b87a2f9199 ("netfilter: conntrack: add gc worker to
  remove timed-out entries"), netlink conntrack deletion events may be
  sent with a huge delay.

Nicolas further points at this line:

  goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

and indeed, this isn't optimal at all.  Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge.  But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.

We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).

So, after some discussion with Nicolas:

1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)

2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries

3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead

4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.

The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.

Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:38 +01:00
Florian Westphal
6114cc516d netfilter: conntrack: fix CT target for UNSPEC helpers
Thomas reports its not possible to attach the H.245 helper:

iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
iptables: No chain/target/match by that name.
xt_CT: No such helper "H.245"

This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.

We should treat UNSPEC as wildcard and ignore the l3num instead.

Reported-by: Thomas Woerner <twoerner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:37 +01:00
Florian Westphal
fb9c9649a1 netfilter: connmark: ignore skbs with magic untracked conntrack objects
The (percpu) untracked conntrack entries can end up with nonzero connmarks.

The 'untracked' conntrack objects are merely a way to distinguish INVALID
(i.e. protocol connection tracker says payload doesn't meet some
requirements or packet was never seen by the connection tracking code)
from packets that are intentionally not tracked (some icmpv6 types such as
neigh solicitation, or by using 'iptables -j CT --notrack' option).

Untracked conntrack objects are implementation detail, we might as well use
invalid magic address instead to tell INVALID and UNTRACKED apart.

Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.

Reported-by: XU Tianwen <evan.xu.tianwen@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:36 +01:00
WANG Cong
8fbfef7f50 ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr
family.maxattr is the max index for policy[], the size of
ops[] is determined with ARRAY_SIZE().

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-08 23:53:30 +01:00
Sven Eckelmann
93bbaab455 batman-adv: Reject unicast packet with zero/mcast dst address
An unicast batman-adv packet cannot be transmitted to a multicast or zero
mac address. So reject incoming packets which still have these classes of
addresses as destination mac address in the outer ethernet header.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:36 +01:00
Sven Eckelmann
92eef520d7 batman-adv: Disallow zero and mcast src address for mgmt frames
The routing check for management frames is validating the source mac
address in the outer ethernet header. It rejects every source mac address
which is a broadcast address. But it also has to reject the zero-mac
address and multicast mac addresses.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:35 +01:00
Sven Eckelmann
9f75c8e1c8 batman-adv: Disallow mcast src address for data frames
The routing checks are validating the source mac address of the outer
ethernet header. They reject every source mac address which is a broadcast
address. But they also have to reject any multicast mac addresses.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
[sw@simonwunderlich.de: fix commit message typo]
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:35 +01:00
Sven Eckelmann
7d72d174c7 batman-adv: Remove dev_queue_xmit return code exception
No caller of batadv_send_skb_to_orig is expecting the results to be -1
(-EPERM) anymore when the skbuff was not consumed. They will instead expect
that the skbuff is always consumed. Having such return code filter is
therefore not needed anymore.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:34 +01:00
Sven Eckelmann
b91a2543b4 batman-adv: Consume skb in receive handlers
Receiving functions in Linux consume the supplied skbuff. Doing the same in
the batadv_rx_handler functions makes the behavior more similar to the rest
of the Linux network code.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08 19:02:34 +01:00
Alexander Duyck
fd0285a39b fib_trie: Correct /proc/net/route off by one error
The display of /proc/net/route has had a couple issues due to the fact that
when I originally rewrote most of fib_trie I made it so that the iterator
was tracking the next value to use instead of the current.

In addition it had an off by 1 error where I was tracking the first piece
of data as position 0, even though in reality that belonged to the
SEQ_START_TOKEN.

This patch updates the code so the iterator tracks the last reported
position and key instead of the next expected position and key.  In
addition it shifts things so that all of the leaves start at 1 instead of
trying to report leaves starting with offset 0 as being valid.  With these
two issues addressed this should resolve any off by one errors that were
present in the display of /proc/net/route.

Fixes: 25b97c016b ("ipv4: off-by-one in continuation handling in /proc/net/route")
Cc: Andy Whitcroft <apw@canonical.com>
Reported-by: Jason Baron <jbaron@akamai.com>
Tested-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:40:27 -05:00
David Ahern
5d41ce29e3 net: icmp6_send should use dst dev to determine L3 domain
icmp6_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have a dst set.
Update icmp6_send to use the dst on the skb to determine L3 domain.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:30:19 -05:00
Soheil Hassas Yeganeh
f5f99309fa sock: do not set sk_err in sock_dequeue_err_skb
Do not set sk_err when dequeuing errors from the error queue.
Doing so results in:
a) Bugs: By overwriting existing sk_err values, it possibly
   hides legitimate errors. It is also incorrect when local
   errors are queued with ip_local_error. That happens in the
   context of a system call, which already returns the error
   code.
b) Inconsistent behavior: When there are pending errors on
   the error queue, sk_err is sometimes 0 (e.g., for
   the first timestamp on the error queue) and sometimes
   set to an error code (after dequeuing the first
   timestamp).
c) Suboptimality: Setting sk_err to ENOMSG on simple
   TX timestamps can abort parallel reads and writes.

Removing this line doesn't break userspace. This is because
userspace code cannot rely on sk_err for detecting whether
there is something on the error queue. Except for ICMP messages
received for UDP and RAW, sk_err is not set at enqueue time,
and as a result sk_err can be 0 while there are plenty of
errors on the error queue.

For ICMP packets in UDP and RAW, sk_err is set when they are
enqueued on the error queue, but that does not result in aborting
reads and writes. For such cases, sk_err is only readable via
getsockopt(SO_ERROR) which will reset the value of sk_err on
its own. More importantly, prior to this patch,
recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
sk_err is set by sock_dequeue_err_skb without atomic ops or
locks) which can store 0 in sk_err even when we have ICMP
messages pending. Removing this line from sock_dequeue_err_skb
eliminates that race.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:29:10 -05:00
Jesper Dangaard Brouer
84c46dd865 qdisc: catch misconfig of attaching qdisc to tx_queue_len zero device
It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] https://github.com/docker/libcontainer/pull/193
[2] https://github.com/openshift/origin/pull/11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Jesper Dangaard Brouer
1159708432 net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue len
The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a
default qdisc attached, given they will be backed by physical device,
that already have a qdisc attached for pushback.

It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as
this can be useful for difference policy reasons (e.g. bandwidth
limiting containers).  For this to work, the tx_queue_len need to have
a sane value, because some qdiscs inherit/copy the tx_queue_len
(namely, pfifo, bfifo, gred, htb, plug and sfb).

Commit a813104d92 ("IFF_NO_QUEUE: Fix for drivers not calling
ether_setup()") caught situations where some drivers didn't initialize
tx_queue_len.  The problem with the commit was choosing 1 as the
fallback value.

A qdisc queue length of 1 causes more harm than good, because it
creates hard to debug situations for userspace. It gives userspace a
false sense of a working config after attaching a qdisc.  As low
volume traffic (that doesn't activate the qdisc policy) works,
like ping, while traffic that e.g. needs shaping cannot reach the
configured policy levels, given the queue length is too small.

This patch change the value to DEFAULT_TX_QUEUE_LEN, given other
IFF_NO_QUEUE devices (that call ether_setup()) also use this value.

Fixes: a813104d92 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Jesper Dangaard Brouer
d0a81f67cd net: make default TX queue length a defined constant
The default TX queue length of Ethernet devices have been a magic
constant of 1000, ever since the initial git import.

Looking back in historical trees[1][2] the value used to be 100,
with the same comment "Ethernet wants good queues". The commit[3]
that changed this from 100 to 1000 didn't describe why, but from
conversations with Robert Olsson it seems that it was changed
when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the
link speed increased x10 the queue size were also adjusted.  This
value later caused much heartache for the bufferbloat community.

This patch merely moves the value into a defined constant.

[1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/
[2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/
[3] https://git.kernel.org/tglx/history/c/98921832c232

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 20:15:55 -05:00
Anna Schumaker
bb29dd8433 SUNRPC: Fix suspicious RCU usage
We need to hold the rcu_read_lock() when calling rcu_dereference(),
otherwise we can't guarantee that the object being dereferenced still
exists.

Fixes: 39e5d2df ("SUNRPC search xprt switch for sockaddr")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 14:35:59 -05:00
Paolo Abeni
7c13f97ffd udp: do fwd memory scheduling on dequeue
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.

Overall, this allows acquiring only once the receive queue
lock on dequeue.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr sinks	vanilla		patched
1		440		560
3		2150		2300
6		3650		3800
9		4450		4600
12		6250		6450

v1 -> v2:
 - do rmem and allocated memory scheduling under the receive lock
 - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
 - avoid the typdef for the dequeue callback

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
Paolo Abeni
ad959036a7 net/sock: add an explicit sk argument for ip_cmsg_recv_offset()
So that we can use it even after orphaining the skbuff.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:24:41 -05:00
Marcelo Ricardo Leitner
7233bc84a3 sctp: assign assoc_id earlier in __sctp_connect
sctp_wait_for_connect() currently already holds the asoc to keep it
alive during the sleep, in case another thread release it. But Andrey
Konovalov and Dmitry Vyukov reported an use-after-free in such
situation.

Problem is that __sctp_connect() doesn't get a ref on the asoc and will
do a read on the asoc after calling sctp_wait_for_connect(), but by then
another thread may have closed it and the _put on sctp_wait_for_connect
will actually release it, causing the use-after-free.

Fix is, instead of doing the read after waiting for the connect, do it
before so, and avoid this issue as the socket is still locked by then.
There should be no issue on returning the asoc id in case of failure as
the application shouldn't trust on that number in such situations
anyway.

This issue doesn't exist in sctp_sendmsg() path.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:18:37 -05:00
David Ahern
cd2c0f4540 net: Update raw socket bind to consider l3 domain
Binding a raw socket to a local address fails if the socket is bound
to an L3 domain:

    $ vrf-test  -s -l 10.100.1.2 -R -I red
    error binding socket: 99: Cannot assign requested address

Update raw_bind to look consider if sk_bound_dev_if is bound to an L3
domain and use inet_addr_type_table to lookup the address.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:14:34 -05:00
Linus Torvalds
fb415f222c Fixes for some recent regressions including fallout from the vmalloc'd
stack change (after which we can no longer encrypt stuff on the stack).
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYHNwpAAoJECebzXlCjuG++DMP/3mUUAF09DfFR/EHl7knDT1f
 kZ53UVHYzr02w0wXfwxVLlp2H7TdSAufgsSvPT6qksA3eY7gL6nJ9zHkl+Nv5yCx
 y6vsFWjO1QEUWFOZWCKcmT2dAI3Ddt9IhK13pfZEKN1XKvK2zWB16HEVzSg6fR2K
 NwHlpMnQUI4HWThURzwTZb1M5YhxRCAnyiv8BTAAPjbEfzPzdL7j3jxwqtH8bOWp
 qIcDDvjC744b9zy0YuAEY/NyGBhYZPdM6gWsBBes1TRzBWUL9qsUYTWDJTmg/F1l
 Or0Jz7CUEN9uOHLGnkATPDc+eBg9YFV+bSsSnJu1/W4Er7dX1Af/lol79zEp/Zw1
 Snd9FelSPj3vxmYAFTCLnHRTRgsyiDhbbb7gVrzH9bxnCrRNR6p2kY018s1Cl9Td
 uWQoNNFQwwnYxWYEeZdO5PgX+pcgoCzhHACNk5oA93YaBE0GuLHHugwwIrYE8TM1
 1iY20sLC5lJcnPqxdgnoprZnnHMuL6rx5KRbvBeflNZ4huK2PIcPJyeB83XH6s12
 G67PjJ0rfWzSBF14O/ZtQA6he+kXvnH3pKqpNnaMiBxZZ2J8E1eQvrKTLLIwmtlP
 18KKJpZIzh7jTTZ/99nAMAt/BGw97P9TToLdnI8dCxYygHEaywpEYtcsE8IWFAvA
 3XkS5QdlJhhAaAUUYBXy
 =oPbZ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfixes from Bruce Fields:
 "Fixes for some recent regressions including fallout from the vmalloc'd
  stack change (after which we can no longer encrypt stuff on the
  stack)"

* tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix general protection fault in release_lock_stateid()
  svcrdma: backchannel cannot share a page for send and rcv buffers
  sunrpc: fix some missing rq_rbuffer assignments
  sunrpc: don't pass on-stack memory to sg_set_buf
  nfsd: move blocked lock handling under a dedicated spinlock
2016-11-04 20:12:10 -07:00
Lorenzo Colitti
e2d118a1cb net: inet: Support UID-based routing in IP protocols.
- Use the UID in routing lookups made by protocol connect() and
  sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
  (e.g., Path MTU discovery) take the UID of the socket into
  account.
- For packets not associated with a userspace socket, (e.g., ping
  replies) use UID 0 inside the user namespace corresponding to
  the network namespace the socket belongs to. This allows
  all namespaces to apply routing and iptables rules to
  kernel-originated traffic in that namespaces by matching UID 0.
  This is better than using the UID of the kernel socket that is
  sending the traffic, because the UID of kernel sockets created
  at namespace creation time (e.g., the per-processor ICMP and
  TCP sockets) is the UID of the user that created the socket,
  which might not be mapped in the namespace.

Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:23 -04:00
Lorenzo Colitti
622ec2c9d5 net: core: add UID to flows, rules, and routes
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
  range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
  rtnetlink. The value INVALID_UID indicates no UID was
  specified.
- Add a UID field to the flow structures.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:23 -04:00
Lorenzo Colitti
86741ec254 net: core: Add a UID field to struct sock.
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.

Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().

Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:

1. Whenever sk_socket is non-null: sk_uid is the same as the UID
   in sk_socket, i.e., matches the return value of sock_i_uid.
   Specifically, the UID is set when userspace calls socket(),
   fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
   - For a socket that no longer has a sk_socket because
     userspace has called close(): the previous UID.
   - For a cloned socket (e.g., an incoming connection that is
     established but on which userspace has not yet called
     accept): the UID of the socket it was cloned from.
   - For a socket that has never had an sk_socket: UID 0 inside
     the user namespace corresponding to the network namespace
     the socket belongs to.

Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:22 -04:00
WANG Cong
00ffc1ba02 genetlink: fix a memory leak on error path
In __genl_register_family(), when genl_validate_assign_mc_groups()
fails, we forget to free the memory we possibly allocate for
family->attrbuf.

Note, some callers call genl_unregister_family() to clean up
on error path, it doesn't work because the family is inserted
to the global list in the nearly last step.

Cc: Jakub Kicinski <kubakici@wp.pl>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:52:29 -04:00
Eric Dumazet
990ff4d844 ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped
While fuzzing kernel with syzkaller, Andrey reported a nasty crash
in inet6_bind() caused by DCCP lacking a required method.

Fixes: ab1e0a13d7 ("[SOCK] proto: Add hashinfo member to struct proto")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:50:27 -04:00
Simon Horman
5976c5f45c net/sched: cls_flower: Support matching on SCTP ports
Support matching on SCTP ports in the same way that matching
on TCP and UDP ports is already supported.

Example usage:

tc qdisc add dev eth0 ingress

tc filter add dev eth0 protocol ip parent ffff: \
        flower indev eth0 ip_proto sctp dst_port 80 \
        action drop

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:26:39 -04:00
Eric Dumazet
1aa9d1a0e7 ipv6: dccp: fix out of bound access in dccp_v6_err()
dccp_v6_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmpv6_notify() are more than enough.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet
93636d1f1f netlink: netlink_diag_dump() runs without locks
A recent commit removed locking from netlink_diag_dump() but forgot
one error case.

=====================================
[ BUG: bad unlock balance detected! ]
4.9.0-rc3+ #336 Not tainted
-------------------------------------
syz-executor/4018 is trying to release lock ([   36.220068] nl_table_lock
) at:
[<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
but there are no more locks to release!

other info that might help us debug this:
3 locks held by syz-executor/4018:
 #0: [   36.220068]  (
sock_diag_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40
 #1: [   36.220068]  (
sock_diag_table_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0
 #2: [   36.220068]  (
nlk->cb_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82db6600>] netlink_dump+0x50/0xac0

stack backtrace:
CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
 ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
 dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [<ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0
kernel/locking/lockdep.c:3388
 [<     inline     >] __lock_release kernel/locking/lockdep.c:3512
 [<ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
 [<     inline     >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
 [<ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
 [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
 [<ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110

Fixes: ad20207432 ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet
6706a97fec dccp: fix out of bound access in dccp_v4_err()
dccp_v4_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmp_socket_deliver() are more than enough.

This patch might allow to process more ICMP messages, as some routers
are still limiting the size of reflected bytes to 28 (RFC 792), instead
of extended lengths (RFC 1812 4.3.2.3)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet
346da62cc1 dccp: do not send reset to already closed sockets
Andrey reported following warning while fuzzing with syzkaller

WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000
 ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a
 0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81b474f4>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [<ffffffff8140c06a>] panic+0x1bc/0x39d kernel/panic.c:179
 [<ffffffff8111125c>] __warn+0x1cc/0x1f0 kernel/panic.c:542
 [<ffffffff8111144c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
 [<ffffffff8389e5d9>] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
 [<ffffffff838a0aa2>] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
 [<ffffffff8316bf1f>] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
 [<ffffffff82b6e89e>] sock_release+0x8e/0x1d0 net/socket.c:570
 [<ffffffff82b6e9f6>] sock_close+0x16/0x20 net/socket.c:1017
 [<ffffffff815256ad>] __fput+0x29d/0x720 fs/file_table.c:208
 [<ffffffff81525bb5>] ____fput+0x15/0x20 fs/file_table.c:244
 [<ffffffff811727d8>] task_work_run+0xf8/0x170 kernel/task_work.c:116
 [<     inline     >] exit_task_work include/linux/task_work.h:21
 [<ffffffff8111bc53>] do_exit+0x883/0x2ac0 kernel/exit.c:828
 [<ffffffff811221fe>] do_group_exit+0x10e/0x340 kernel/exit.c:931
 [<ffffffff81143c94>] get_signal+0x634/0x15a0 kernel/signal.c:2307
 [<ffffffff81054aad>] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
 [<ffffffff81003a05>] exit_to_usermode_loop+0xe5/0x130
arch/x86/entry/common.c:156
 [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
 [<ffffffff81006298>] syscall_return_slowpath+0x1a8/0x1e0
arch/x86/entry/common.c:259
 [<ffffffff83fc1a62>] entry_SYSCALL_64_fastpath+0xc0/0xc2
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

Fix this the same way we did for TCP in commit 565b7b2d2e
("tcp: do not send reset to already closed sockets")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Eric Dumazet
c3f24cfb3e dccp: do not release listeners too soon
Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[<ffffffff819ead5f>]  [<ffffffff819ead5f>]
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
 ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
 [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [<     inline     >] dst_input ./include/net/dst.h:507
 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
 [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [<     inline     >] new_sync_write fs/read_write.c:499
 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
 [<     inline     >] SYSC_write fs/read_write.c:607
 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:50 -04:00
Eric Dumazet
79d8665b95 tcp: fix return value for partial writes
After my commit, tcp_sendmsg() might restart its loop after
processing socket backlog.

If sk_err is set, we blindly return an error, even though we
copied data to user space before.

We should instead return number of bytes that could be copied,
otherwise user space might resend data and corrupt the stream.

This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
to process timestamps.

Issue was diagnosed by Soheil and Willem, big kudos to them !

Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:12:06 -04:00
Lance Richardson
9ee6c5dc81 ipv4: allow local fragmentation in ip_finish_output_gso()
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b6 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:10:26 -04:00
Willem de Bruijn
dbd1759e6a ipv6: on reassembly, record frag_max_size
IP6CB and IPCB have a frag_max_size field. In IPv6 this field is
filled in when packets are reassembled by the connection tracking
code. Also fill in when reassembling in the input path, to expose
it through cmsg IPV6_RECVFRAGSIZE in all cases.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Willem de Bruijn
0cc0aa614b ipv6: add IPV6_RECVFRAGSIZE cmsg
When reading a datagram or raw packet that arrived fragmented, expose
the maximum fragment size if recorded to allow applications to
estimate receive path MTU.

At this point, the field is only recorded when ipv6 connection
tracking is enabled. A follow-up patch will record this field also
in the ipv6 input path.

Tested using the test for IP_RECVFRAGSIZE plus

  ip netns exec to ip addr add dev veth1 fc07::1/64
  ip netns exec from ip addr add dev veth0 fc07::2/64

  ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
  ip netns exec from nc -q 1 -u fc07::1 6000 < payload

Both with and without enabling connection tracking

  ip6tables -A INPUT -m state --state NEW -p udp -j LOG

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Willem de Bruijn
70ecc24841 ipv4: add IP_RECVFRAGSIZE cmsg
The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.

Tested:
  Sent data over a veth pair of which the source has a small mtu.
  Sent data using netcat, received using a dedicated process.

  Verified that the cmsg IP_RECVFRAGSIZE is returned only when
  data arrives fragmented, and in that cases matches the veth mtu.

    ip link add veth0 type veth peer name veth1

    ip netns add from
    ip netns add to

    ip link set dev veth1 netns to
    ip netns exec to ip addr add dev veth1 192.168.10.1/24
    ip netns exec to ip link set dev veth1 up

    ip link set dev veth0 netns from
    ip netns exec from ip addr add dev veth0 192.168.10.2/24
    ip netns exec from ip link set dev veth0 up
    ip netns exec from ip link set dev veth0 mtu 1300
    ip netns exec from ethtool -K veth0 ufo off

    dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload

    ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
    ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload

  using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:41:11 -04:00
Eric Dumazet
ac9e70b17e tcp: fix potential memory corruption
Imagine initial value of max_skb_frags is 17, and last
skb in write queue has 15 frags.

Then max_skb_frags is lowered to 14 or smaller value.

tcp_sendmsg() will then be allowed to add additional page frags
and eventually go past MAX_SKB_FRAGS, overflowing struct
skb_shared_info.

Fixes: 5f74f82ea3 ("net:Add sysctl_max_skb_frags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Cc: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:33:30 -04:00
Cyrill Gorcunov
9999370fae net: ip, raw_diag -- Use jump for exiting from nested loop
I managed to miss that sk_for_each is called under "for"
cycle so need to use goto here to return matching socket.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:25:26 -04:00
Cyrill Gorcunov
cd05a0eca8 net: ip, raw_diag -- Fix socket leaking for destroy request
In raw_diag_destroy the helper raw_sock_get returns
with sock_hold call, so we have to put it then.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:25:26 -04:00
WANG Cong
14135f30e3 inet: fix sleeping inside inet_wait_for_connect()
Andrey reported this kernel warning:

  WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
  __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
  do not call blocking ops when !TASK_RUNNING; state=1 set at
  [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210
  kernel/sched/wait.c:178
  Modules linked in:
  CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
   ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
   ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
  Call Trace:
   [<     inline     >] __dump_stack lib/dump_stack.c:15
   [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
   [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550
   [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
   [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
   [<     inline     >] slab_pre_alloc_hook mm/slab.h:393
   [<     inline     >] slab_alloc_node mm/slub.c:2634
   [<     inline     >] slab_alloc mm/slub.c:2716
   [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
   [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113
   [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
   [<     inline     >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
   [<     inline     >] dccp_feat_change_recv net/dccp/feat.c:1141
   [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
   [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
   [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
   [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
   [<     inline     >] sk_backlog_rcv ./include/net/sock.h:872
   [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044
   [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502
   [<     inline     >] inet_wait_for_connect net/ipv4/af_inet.c:547
   [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
   [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
   [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533
   [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514
   [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2
  arch/x86/entry/entry_64.S:209

Unlike commit 26cabd3125
("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
sleeping function is called before schedule_timeout(), this is indeed
a bug. Fix this by moving the wait logic to the new API, it is similar
to commit ff960a7317
("netdev, sched/wait: Fix sleeping inside wait event").

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 15:18:07 -04:00
Pablo Neira Ayuso
08733a0cb7 netfilter: handle NF_REPEAT from nf_conntrack_in()
NF_REPEAT is only needed from nf_conntrack_in() under a very specific
case required by the TCP protocol tracker, we can handle this case
without returning to the core hook path. Handling of NF_REPEAT from the
nf_reinject() is left untouched.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:53:00 +01:00
Pablo Neira Ayuso
26dfab7216 netfilter: merge nf_iterate() into nf_hook_slow()
nf_iterate() has become rather simple, we can integrate this code into
nf_hook_slow() to reduce the amount of LOC in the core path.

However, we still need nf_iterate() around for nf_queue packet handling,
so move this function there where we only need it. I think it should be
possible to refactor nf_queue code to get rid of it definitely, but
given this is slow path anyway, let's have a look this later.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:59 +01:00
Pablo Neira Ayuso
01886bd91f netfilter: remove hook_entries field from nf_hook_state
This field is only useful for nf_queue, so store it in the
nf_queue_entry structure instead, away from the core path. Pass
hook_head to nf_hook_slow().

Since we always have a valid entry on the first iteration in
nf_iterate(), we can use 'do { ... } while (entry)' loop instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:58 +01:00
Pablo Neira Ayuso
c63cbc4604 netfilter: use switch() to handle verdict cases from nf_hook_slow()
Use switch() for verdict handling and add explicit handling for
NF_STOLEN and other non-conventional verdicts.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:57 +01:00
Pablo Neira Ayuso
0e5a1c7eb3 netfilter: nf_tables: use hook state from xt_action_param structure
Don't copy relevant fields from hook state structure, instead use the
one that is already available in struct xt_action_param.

This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 11:52:34 +01:00
Pablo Neira Ayuso
613dbd9572 netfilter: x_tables: move hook state into xt_action_param structure
Place pointer to hook state in xt_action_param structure instead of
copying the fields that we need. After this change xt_action_param fits
into one cacheline.

This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:21 +01:00
Pablo Neira Ayuso
06fd3a392b netfilter: deprecate NF_STOP
NF_STOP is only used by br_netfilter these days, and it can be emulated
with a combination of NF_STOLEN plus explicit call to the ->okfn()
function as Florian suggests.

To retain binary compatibility with userspace nf_queue application, we
have to keep NF_STOP around, so libnetfilter_queue userspace userspace
applications still work if they use NF_STOP for some exotic reason.

Out of tree modules using NF_STOP would break, but we don't care about
those.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:17 +01:00
Pablo Neira Ayuso
1610a73c41 netfilter: kill NF_HOOK_THRESH() and state->tresh
Patch c5136b15ea ("netfilter: bridge: add and use br_nf_hook_thresh")
introduced br_nf_hook_thresh().

Replace NF_HOOK_THRESH() by br_nf_hook_thresh from
br_nf_forward_finish(), so we have no more callers for this macro.

As a result, state->thresh and explicit thresh parameter in the hook
state structure is not required anymore. And we can get rid of
skip-hook-under-thresh loop in nf_iterate() in the core path that is
only used by br_netfilter to search for the filter hook.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:12 +01:00
Pablo Neira Ayuso
d2be66f685 netfilter: remove comments that predate rcu days
We cannot block/sleep on nf_iterate because netfilter runs under rcu
read lock these days, where blocking is well-known to be illegal. So
let's remove these old comments.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:56:05 +01:00
Pablo Neira Ayuso
b250a7fc3b netfilter: get rid of useless debugging from core
This patch remove compile time code to catch inconventional verdicts.
We have better ways to handle this case these days, eg. pr_debug() but
even though I don't think this is useful at all, so let's remove this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03 10:55:54 +01:00
Tom Herbert
1913540a13 ila: Fix crash caused by rhashtable changes
commit ca26893f05 ("rhashtable: Add rhlist interface")
added a field to rhashtable_iter so that length became 56 bytes
and would exceed the size of args in netlink_callback (which is
48 bytes). The netlink diag dump function already has been
allocating a iter structure and storing the pointed to that
in the args of netlink_callback. ila_xlat also uses
rhahstable_iter but is still putting that directly in
the arg block. Now since rhashtable_iter size is increased
we are overwriting beyond the structure. The next field
happens to be cb_mutex pointer in netlink_sock and hence the crash.

Fix is to alloc the rhashtable_iter and save it as pointer
in arg.

Tested:

  modprobe ila
  ./ip ila add loc 3333:0:0:0 loc_match 2222:0:0:1,
  ./ip ila list  # NO crash now

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:26:02 -04:00
Cyrill Gorcunov
3de864f8c9 net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect
While being preparing patches for killing raw sockets via
diag netlink interface I noticed that my runs are stuck:

 | [root@pcs7 ~]# cat /proc/`pidof ss`/stack
 | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4
 | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95
 | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33
 | [<ffffffff8179b517>] raw_abort+0x33/0x42
 | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52

which has not been the case before. I narrowed it down to the commit

 | commit 286c72deab
 | Author: Eric Dumazet <edumazet@google.com>
 | Date:   Thu Oct 20 09:39:40 2016 -0700
 |
 |     udp: must lock the socket in udp_disconnect()

where we start locking the socket for different reason.

So the raw_abort escaped the renaming and we have to
fix this typo using __udp_disconnect instead.

Fixes: 286c72deab ("udp: must lock the socket in udp_disconnect()")
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:25:16 -04:00
Eric Dumazet
2331ccc5b3 tcp: enhance tcp collapsing
As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
time even if the skb at the right has fragments.

We simply have to use more generic skb_copy_bits() instead of
skb_copy_from_linear_data() in tcp_collapse_retrans()

Also need to guard this skb_copy_bits() in case there is nothing to
copy, otherwise skb_put() could panic if left skb has frags.

Tested:

Used following packetdrill test

// Establish a connection.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.100 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 write(4, ..., 200) = 200
   +0 > P. 1:201(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 201:401(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 401:601(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 601:801(200) ack 1
+.001 write(4, ..., 200) = 200
   +0 > P. 801:1001(200) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1001:1101(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1101:1201(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1201:1301(100) ack 1
+.001 write(4, ..., 100) = 100
   +0 > P. 1301:1401(100) ack 1

+.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401>
// Check that TCP collapse works :
   +0 > P. 1:1001(1000) ack 1

Reported-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:21:36 -04:00
Eli Cooper
4fd19c15de ip6_udp_tunnel: remove unused IPCB related codes
Some IPCB fields are currently set in udp_tunnel6_xmit_skb(), which are
never used before it reaches ip6tunnel_xmit(), and past that point the
control buffer is no longer interpreted as IPCB.

This clears these unused IPCB related codes. Currently there is no skb
scrubbing in ip6_udp_tunnel, otherwise IPCB(skb)->opt might need to be
cleared for IPv4 packets, as shown in 5146d1f151
("tunnel: Clear IPCB(skb)->opt before dst_link_failure called").

Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:18:36 -04:00
Xin Long
e4ff952a7e sctp: clean up sctp_packet_transmit
After adding sctp gso, sctp_packet_transmit is a quite big function now.

This patch is to extract the codes for packing packet to sctp_packet_pack
from sctp_packet_transmit, and add some comments, simplify the err path by
freeing auth chunk when freeing packet chunk_list in out path and freeing
head skb early if it fails to pack packet.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:03:13 -04:00
Roi Dayan
13fa876ebd net/sched: cls_flower: merge filter delete/destroy common code
Move common code from fl_delete and fl_detroy to __fl_delete.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:00:47 -04:00
Roi Dayan
a1a8f7fe92 net/sched: cls_flower: add missing unbind call when destroying flows
tcf_unbind was called in fl_delete but was missing in fl_destroy when
force deleting flows.

Fixes: 77b9900ef5 ('tc: introduce Flower classifier')
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 15:00:47 -04:00
David S. Miller
4cb551a100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. This includes better integration with the routing subsystem for
nf_tables, explicit notrack support and smaller updates. More
specifically, they are:

1) Add fib lookup expression for nf_tables, from Florian Westphal. This
   new expression provides a native replacement for iptables addrtype
   and rp_filter matches. This is more flexible though, since we can
   populate the kernel flowi representation to inquire fib to
   accomodate new usecases, such as RTBH through skb mark.

2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
   new expression allow you to access skbuff route metadata, more
   specifically nexthop and classid fields.

3) Add notrack support for nf_tables, to skip conntracking, requested by
   many users already.

4) Add boilerplate code to allow to use nf_log infrastructure from
   nf_tables ingress.

5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
   the xtables cluster match, from Liping Zhang.

6) Move socket lookup code into generic nf_socket_* infrastructure so
   we can provide a native replacement for the xtables socket match.

7) Make sure nfnetlink_queue data that is updated on every packets is
   placed in a different cache from read-only data, from Florian Westphal.

8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.

9) Start round robin number generation in nft_numgen from zero,
   instead of n-1, for consistency with xtables statistics match,
   patch from Liping Zhang.

10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
    given we retry with a smaller allocation on failure, from Calvin Owens.

11) Cleanup xt_multiport to use switch(), from Gao feng.

12) Remove superfluous check in nft_immediate and nft_cmp, from
    Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02 14:57:47 -04:00
Florian Westphal
886bc50348 netfilter: nf_queue: place volatile data in own cacheline
As the comment indicates, the data at the end of nfqnl_instance struct is
written on every queue/dequeue, so it should reside in its own cacheline.

Before this change, 'lock' was in first cacheline so we dirtied both.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01 20:50:33 +01:00