Commit Graph

38554 Commits

Author SHA1 Message Date
Sowmini Varadhan
467fa15356 RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
Register pernet subsys init/stop functions that will set up
and tear down per-net RDS-TCP listen endpoints. Unregister
pernet subusys functions on 'modprobe -r' to clean up these
end points.

Enable keepalive on both accept and connect socket endpoints.
The keepalive timer expiration will ensure that client socket
endpoints will be removed as appropriate from the netns when
an interface is removed from a namespace.

Register a device notifier callback that will clean up all
sockets (and thus avoid the need to wait for keepalive timeout)
when the loopback device is unregistered from the netns indicating
that the netns is getting deleted.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 11:29:58 -07:00
Sowmini Varadhan
d5a8ac28a7 RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
Open the sockets calling sock_create_kern() with the correct struct net
pointer, and use that struct net pointer when verifying the
address passed to rds_bind().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 11:29:57 -07:00
Nikolay Aleksandrov
786c2077ec bridge: netlink: account for the IFLA_BRPORT_PROXYARP_WIFI attribute size and policy
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 842a9ae08a ("bridge: Extend Proxy ARP design to allow optional rules for Wi-Fi")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 23:54:42 -07:00
Nikolay Aleksandrov
355b9f9df1 bridge: netlink: account for the IFLA_BRPORT_PROXYARP attribute size and policy
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 23:54:42 -07:00
Oleg Nesterov
7ba8bd75dd net: pktgen: don't abuse current->state in pktgen_thread_worker()
Commit 1fbe4b46ca "net: pktgen: kill the Wait for kthread_stop
code in pktgen_thread_worker()" removed (in particular) the final
__set_current_state(TASK_RUNNING) and I didn't notice the previous
set_current_state(TASK_INTERRUPTIBLE). This triggers the warning
in __might_sleep() after return.

Afaics, we can simply remove both set_current_state()'s, and we
could do this a long ago right after ef87979c27 "pktgen: better
scheduler friendliness" which changed pktgen_thread_worker() to
use wait_event_interruptible_timeout().

Reported-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 23:52:44 -07:00
Roopa Prabhu
3dcb615e68 af_mpls: add null dev check in find_outdev
This patch adds null dev check for the 'cfg->rc_via_table ==
NEIGH_LINK_TABLE or dev_get_by_index() failed' case

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 22:03:58 -07:00
Dan Carpenter
5a9348b54d mpls: small cleanup in inet/inet6_fib_lookup_dev()
We recently changed this code from returning NULL to returning ERR_PTR.
There are some left over NULL assignments which we can remove.  We can
preserve the error code from ip_route_output() instead of always
returning -ENODEV.  Also these functions use a mix of gotos and direct
returns.  There is no cleanup necessary so I changed the gotos to
direct returns.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 21:56:55 -07:00
Herbert Xu
a0a2a66024 net: Fix skb_set_peeked use-after-free bug
The commit 738ac1ebb9 ("net: Clone
skb before setting peeked flag") introduced a use-after-free bug
in skb_recv_datagram.  This is because skb_set_peeked may create
a new skb and free the existing one.  As it stands the caller will
continue to use the old freed skb.

This patch fixes it by making skb_set_peeked return the new skb
(or the old one if unchanged).

Fixes: 738ac1ebb9 ("net: Clone skb before setting peeked flag")
Reported-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Brenden Blanco <bblanco@plumgrid.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 21:55:47 -07:00
Jakub Pawlowski
cb92205bad Bluetooth: fix MGMT_EV_NEW_LONG_TERM_KEY event
This patch fixes how MGMT_EV_NEW_LONG_TERM_KEY event is build. Right now
val vield is filled with only 1 byte, instead of whole value. This bug
was introduced in
commit 1fc62c526a ("Bluetooth: Fix exposing full value of shortened LTKs")

Before that patch, if you paired with device using bluetoothd using simple
pairing, and then restarted bluetoothd, you would be able to re-connect,
but device would fail to establish encryption and would terminate
connection. After this patch connecting after bluetoothd restart works
fine.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-06 16:36:03 +02:00
Joe Stringer
f58e5aa7b8 netfilter: conntrack: Use flags in nf_ct_tmpl_alloc()
The flags were ignored for this function when it was introduced. Also
fix the style problem in kzalloc.

Fixes: 0838aa7fc (netfilter: fix netns dependencies with conntrack
templates)
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-05 10:56:43 +02:00
David S. Miller
9dc20a6496 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, they are:

1) A couple of cleanups for the netfilter core hook from Eric Biederman.

2) Net namespace hook registration, also from Eric. This adds a dependency with
   the rtnl_lock. This should be fine by now but we have to keep an eye on this
   because if we ever get the per-subsys nfnl_lock before rtnl we have may
   problems in the future. But we have room to remove this in the future by
   propagating the complexity to the clients, by registering hooks for the init
   netns functions.

3) Update nf_tables to use the new net namespace hook infrastructure, also from
   Eric.

4) Three patches to refine and to address problems from the new net namespace
   hook infrastructure.

5) Switch to alternate jumpstack in xtables iff the packet is reentering. This
   only applies to a very special case, the TEE target, but Eric Dumazet
   reports that this is slowing down things for everyone else. So let's only
   switch to the alternate jumpstack if the tee target is in used through a
   static key. This batch also comes with offline precalculation of the
   jumpstack based on the callchain depth. From Florian Westphal.

6) Minimal SCTP multihoming support for our conntrack helper, from Michal
   Kubecek.

7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian
   Westphal.

8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler.

9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-04 23:57:45 -07:00
Simon Wunderlich
27a4d5efd4 batman-adv: initialize up/down values when adding a gateway
Without this initialization, gateways which actually announce up/down
bandwidth of 0/0 could be added. If these nodes get purged via
_batadv_purge_orig() later, the gw_node structure does not get removed
since batadv_gw_node_delete() updates the gw_node with up/down
bandwidth of 0/0, and the updating function then discards the change
and does not free gw_node.

This results in leaking the gw_node structures, which references other
structures: gw_node -> orig_node -> orig_node_ifinfo -> hardif. When
removing the interface later, the open reference on the hardif may cause
hangs with the infamous "unregister_netdevice: waiting for mesh1 to
become free. Usage count = 1" message.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-05 00:31:47 +02:00
Marek Lindner
ef72706a05 batman-adv: protect tt_local_entry from concurrent delete events
The tt_local_entry deletion performed in batadv_tt_local_remove() was neither
protecting against simultaneous deletes nor checking whether the element was
still part of the list before calling hlist_del_rcu().

Replacing the hlist_del_rcu() call with batadv_hash_remove() provides adequate
protection via hash spinlocks as well as an is-element-still-in-hash check to
avoid 'blind' hash removal.

Fixes: 068ee6e204 ("batman-adv: roaming handling mechanism redesign")
Reported-by: alfonsname@web.de
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-05 00:31:47 +02:00
Marek Lindner
354136bcc3 batman-adv: fix kernel crash due to missing NULL checks
batadv_softif_vlan_get() may return NULL which has to be verified
by the caller.

Fixes: 35df3b298f ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
Reported-by: Ryan Thompson <ryan@eero.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-05 00:31:46 +02:00
Antonio Quartulli
f202a666e9 batman-adv: avoid DAT to mess up LAN state
When a node running DAT receives an ARP request from the LAN for the
first time, it is likely that this node will request the ARP entry
through the distributed ARP table (DAT) in the mesh.

Once a DAT reply is received the asking node must check if the MAC
address for which the IP address has been asked is local. If it is, the
node must drop the ARP reply bceause the client should have replied on
its own locally.

Forwarding this reply means fooling any L2 bridge (e.g. Ethernet
switches) lying between the batman-adv node and the LAN. This happens
because the L2 bridge will think that the client sending the ARP reply
lies somewhere in the mesh, while this node is sitting in the same LAN.

Reported-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-05 00:31:46 +02:00
Subash Abhinov Kasiviswanathan
a6cd379b4d netfilter: ip6t_REJECT: Remove debug messages from reject_tg6()
Make it similar to reject_tg() in ipt_REJECT.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-04 11:12:51 +02:00
Robert Shearman
a6affd24f4 mpls: Use definition for reserved label checks
In multiple locations there are checks for whether the label in hand
is a reserved label or not using the arbritray value of 16. Factor
this out into a #define for better maintainability and for
documentation.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 22:35:00 -07:00
Robert Shearman
0335f5b500 ipv4: apply lwtunnel encap for locally-generated packets
lwtunnel encap is applied for forwarded packets, but not for
locally-generated packets. This is because the output function is not
overridden in __mkroute_output, unlike it is in __mkroute_input.

The lwtunnel state is correctly set on the rth through the call to
rt_set_nexthop, so all that needs to be done is to override the dst
output function to be lwtunnel_output if there is lwtunnel state
present and it requires output redirection.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 22:26:14 -07:00
Robert Shearman
abf7c1c540 lwtunnel: set skb protocol and dev
In the locally-generated packet path skb->protocol may not be set and
this is required for the lwtunnel encap in order to get the lwtstate.

This would otherwise have been set by ip_output or ip6_output so set
skb->protocol prior to calling the lwtunnel encap
function. Additionally set skb->dev in case it is needed further down
the transmit path.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 22:26:13 -07:00
Eric Dumazet
10e2eb878f udp: fix dst races with multicast early demux
Multicast dst are not cached. They carry DST_NOCACHE.

As mentioned in commit f886497212 ("ipv4: fix dst race in
sk_dst_get()"), these dst need special care before caching them
into a socket.

Caching them is allowed only if their refcnt was not 0, ie we
must use atomic_inc_not_zero()

Also, we must use READ_ONCE() to fetch sk->sk_rx_dst, as mentioned
in commit d0c294c53a ("tcp: prevent fetching dst twice in early demux
code")

Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Tested-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
Reported-by: Alex Gartrell <agartrell@fb.com>
Cc: Michal Kubeček <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 22:16:50 -07:00
Nikolay Aleksandrov
58da018053 bridge: mdb: fix vlan_enabled access when vlans are not configured
Instead of trying to access br->vlan_enabled directly use the provided
helper br_vlan_enabled().

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 16:20:51 -07:00
Daniel Borkmann
a5c90b29e5 act_bpf: properly support late binding of bpf action to a classifier
Since the introduction of the BPF action in d23b8ad8ab ("tc: add BPF
based action"), late binding was not working as expected. I.e. setting
the action part for a classifier only via 'bpf index <num>', where <num>
is the index of an existing action, is being rejected by the kernel due
to other missing parameters.

It doesn't make sense to require these parameters such as BPF opcodes
etc, as they are not going to be used anyway: in this case, they're just
allocated/parsed and then freed again w/o doing anything meaningful.

Instead, parse and verify the remaining parameters *after* the test on
tcf_hash_check(), when we really know that we're dealing with creation
of a new action or replacement of an existing one and where late binding
is thus irrelevant.

After patch, test case is now working:

  FOO="1,6 0 0 4294967295,"
  tc actions add action bpf bytecode "$FOO"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action bpf index 1
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 2 bind 1
  tc filter show dev foo
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 flowid 1:1 bytecode '1,6 0 0 4294967295'
    action order 1: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 2 bind 1

Late binding of a BPF action can be useful for preloading maps (e.g. before
they hit traffic) in case of eBPF programs, or to share a single eBPF action
with multiple classifiers.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 16:05:56 -07:00
Satish Ashok
e44deb2f0c bridge: mdb: add/del entry on all vlans if vlan_filter is enabled and vid is 0
Before this patch when a vid was not specified, the entry was added with
vid 0 which is useless when vlan_filtering is enabled. This patch makes
the entry to be added on all configured vlans when vlan filtering is
enabled and respectively deleted from all, if the entry vid is 0.
This is also closer to the way fdb works with regard to vid 0 and vlan
filtering.

Example:
Setup:
$ bridge vlan add vid 256 dev eth4
$ bridge vlan add vid 1024 dev eth4
$ bridge vlan add vid 64 dev eth3
$ bridge vlan add vid 128 dev eth3
$ bridge vlan
port	vlan ids
eth3	 1 PVID Egress Untagged
	 64
	 128

eth4	 1 PVID Egress Untagged
	 256
	 1024
$ echo 1 > /sys/class/net/br0/bridge/vlan_filtering

Before:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp

After:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp vid 1
dev br0 port eth3 grp 239.0.0.1 temp vid 128
dev br0 port eth3 grp 239.0.0.1 temp vid 64

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 15:43:35 -07:00
Dan Carpenter
468b732b6f rds: fix an integer overflow test in rds_info_getsockopt()
"len" is a signed integer.  We check that len is not negative, so it
goes from zero to INT_MAX.  PAGE_SIZE is unsigned long so the comparison
is type promoted to unsigned long.  ULONG_MAX - 4095 is a higher than
INT_MAX so the condition can never be true.

I don't know if this is harmful but it seems safe to limit "len" to
INT_MAX - 4095.

Fixes: a8c879a7ee ('RDS: Info and stats')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 15:20:16 -07:00
Toshiaki Makita
6678053092 bridge: Don't segment multiple tagged packets on bridge device
Bridge devices don't need to segment multiple tagged packets since thier
ports can segment them.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 14:24:50 -07:00
WANG Cong
636dba8e12 act_mirred: avoid calling tcf_hash_release() when binding
When we share an action within a filter, the bind refcnt
should increase, therefore we should not call tcf_hash_release().

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 14:13:28 -07:00
Glenn Griffin
3576fd794b openvswitch: Fix L4 checksum handling when dealing with IP fragments
openvswitch modifies the L4 checksum of a packet when modifying
the ip address. When an IP packet is fragmented only the first
fragment contains an L4 header and checksum. Prior to this change
openvswitch would modify all fragments, modifying application data
in non-first fragments, causing checksum failures in the
reassembled packet.

Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 14:03:08 -07:00
Daniel Borkmann
ba7591d8b2 ebpf: add skb->hash to offset map for usage in {cls, act}_bpf or filters
Add skb->hash to the __sk_buff offset map, so it can be accessed from
an eBPF program. We currently already do this for classic BPF filters,
but not yet on eBPF, it might be useful as a demuxer in combination with
helpers like bpf_clone_redirect(), toy example:

  __section("cls-lb") int ingress_main(struct __sk_buff *skb)
  {
    unsigned int which = 3 + (skb->hash & 7);
    /* bpf_skb_store_bytes(skb, ...); */
    /* bpf_l{3,4}_csum_replace(skb, ...); */
    bpf_clone_redirect(skb, which, 0);
    return -1;
  }

I was thinking whether to add skb_get_hash(), but then concluded the
raw skb->hash seems fine in this case: we can directly access the hash
w/o extra eBPF helper function call, it's filled out by many NICs on
ingress, and in case the entropy level would not be sufficient, people
can still implement their own specific sw fallback hash mix anyway.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-02 17:20:47 -07:00
Eric Dumazet
3d0e0af406 fq_codel: explicitly reset flows in ->reset()
Alex reported the following crash when using fq_codel
with htb:

  crash> bt
  PID: 630839  TASK: ffff8823c990d280  CPU: 14  COMMAND: "tc"
   [... snip ...]
   #8 [ffff8820ceec17a0] page_fault at ffffffff8160a8c2
      [exception RIP: htb_qlen_notify+24]
      RIP: ffffffffa0841718  RSP: ffff8820ceec1858  RFLAGS: 00010282
      RAX: 0000000000000000  RBX: 0000000000000000  RCX: ffff88241747b400
      RDX: ffff88241747b408  RSI: 0000000000000000  RDI: ffff8811fb27d000
      RBP: ffff8820ceec1868   R8: ffff88120cdeff24   R9: ffff88120cdeff30
      R10: 0000000000000bd4  R11: ffffffffa0840919  R12: ffffffffa0843340
      R13: 0000000000000000  R14: 0000000000000001  R15: ffff8808dae5c2e8
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #9 [...] qdisc_tree_decrease_qlen at ffffffff81565375
  #10 [...] fq_codel_dequeue at ffffffffa084e0a0 [sch_fq_codel]
  #11 [...] fq_codel_reset at ffffffffa084e2f8 [sch_fq_codel]
  #12 [...] qdisc_destroy at ffffffff81560d2d
  #13 [...] htb_destroy_class at ffffffffa08408f8 [sch_htb]
  #14 [...] htb_put at ffffffffa084095c [sch_htb]
  #15 [...] tc_ctl_tclass at ffffffff815645a3
  #16 [...] rtnetlink_rcv_msg at ffffffff81552cb0
  [... snip ...]

As Jamal pointed out, there is actually no need to call dequeue
to purge the queued skb's in reset, data structures can be just
reset explicitly. Therefore, we reset everything except config's
and stats, so that we would have a fresh start after device flipping.

Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Reported-by: Alex Gartrell <agartrell@fb.com>
Cc: Alex Gartrell <agartrell@fb.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[xiyou.wangcong@gmail.com: added codel_vars_init() and qdisc_qstats_backlog_dec()]
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-02 17:20:01 -07:00
David S. Miller
5510b3c2a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/s390/net/bpf_jit_comp.c
	drivers/net/ethernet/ti/netcp_ethss.c
	net/bridge/br_multicast.c
	net/ipv4/ip_fragment.c

All four conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 23:52:20 -07:00
Linus Torvalds
7c764cec37 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Must teardown SR-IOV before unregistering netdev in igb driver, from
    Alex Williamson.

 2) Fix ipv6 route unreachable crash in IPVS, from Alex Gartrell.

 3) Default route selection in ipv4 should take the prefix length, table
    ID, and TOS into account, from Julian Anastasov.

 4) sch_plug must have a reset method in order to purge all buffered
    packets when the qdisc is reset, likewise for sch_choke, from WANG
    Cong.

 5) Fix deadlock and races in slave_changelink/br_setport in bridging.
    From Nikolay Aleksandrov.

 6) mlx4 bug fixes (wrong index in port even propagation to VFs,
    overzealous BUG_ON assertion, etc.) from Ido Shamay, Jack
    Morgenstein, and Or Gerlitz.

 7) Turn off klog message about SCTP userspace interface compat that
    makes no sense at all, from Daniel Borkmann.

 8) Fix unbounded restarts of inet frag eviction process, causing NMI
    watchdog soft lockup messages, from Florian Westphal.

 9) Suspend/resume fixes for r8152 from Hayes Wang.

10) Fix busy loop when MSG_WAITALL|MSG_PEEK is used in TCP recv, from
    Sabrina Dubroca.

11) Fix performance regression when removing a lot of routes from the
    ipv4 routing tables, from Alexander Duyck.

12) Fix device leak in AF_PACKET, from Lars Westerhoff.

13) AF_PACKET also has a header length comparison bug due to signedness,
    from Alexander Drozdov.

14) Fix bug in EBPF tail call generation on x86, from Daniel Borkmann.

15) Memory leaks, TSO stats, watchdog timeout and other fixes to
    thunderx driver from Sunil Goutham and Thanneeru Srinivasulu.

16) act_bpf can leak memory when replacing programs, from Daniel
    Borkmann.

17) WOL packet fixes in gianfar driver, from Claudiu Manoil.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (79 commits)
  stmmac: fix missing MODULE_LICENSE in stmmac_platform
  gianfar: Enable device wakeup when appropriate
  gianfar: Fix suspend/resume for wol magic packet
  gianfar: Fix warning when CONFIG_PM off
  act_pedit: check binding before calling tcf_hash_release()
  net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
  net: sched: fix refcount imbalance in actions
  r8152: reset device when tx timeout
  r8152: add pre_reset and post_reset
  qlcnic: Fix corruption while copying
  act_bpf: fix memory leaks when replacing bpf programs
  net: thunderx: Fix for crash while BGX teardown
  net: thunderx: Add PCI driver shutdown routine
  net: thunderx: Fix crash when changing rss with mutliple traffic flows
  net: thunderx: Set watchdog timeout value
  net: thunderx: Wakeup TXQ only if CQE_TX are processed
  net: thunderx: Suppress alloc_pages() failure warnings
  net: thunderx: Fix TSO packet statistic
  net: thunderx: Fix memory leak when changing queue count
  net: thunderx: Fix RQ_DROP miscalculation
  ...
2015-07-31 17:10:56 -07:00
Tom Herbert
be26849bfb ipv6: Disable flowlabel state ranges by default
Per RFC6437 stateful flow labels (e.g. labels set by flow label manager)
cannot "disturb" nodes taking part in stateless flow labels. While the
ranges only reduce the flow label entropy by one bit, it is conceivable
that this might bias the algorithm on some routers causing a load
imbalance. For best results on the Internet we really need the full
20 bits.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 17:07:11 -07:00
Tom Herbert
42240901f7 ipv6: Implement different admin modes for automatic flow labels
Change the meaning of net.ipv6.auto_flowlabels to provide a mode for
automatic flow labels generation. There are four modes:

0: flow labels are disabled
1: flow labels are enabled, sockets can opt-out
2: flow labels are allowed, sockets can opt-in
3: flow labels are enabled and enforced, no opt-out for sockets

np->autoflowlabel is initialized according to the sysctl value.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 17:07:11 -07:00
Tom Herbert
67800f9b1f ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel
We can't call skb_get_hash here since the packet is not complete to do
flow_dissector. Create hash based on flowi6 instead.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 17:07:11 -07:00
Tom Herbert
f70ea018da net: Add functions to get skb->hash based on flow structures
Add skb_get_hash_flowi6 and skb_get_hash_flowi4 which derive an sk_buff
hash from flowi6 and flowi4 structures respectively. These functions
can be called when creating a packet in the output path where the new
sk_buff does not yet contain a fully formed packet that is parsable by
flow dissector.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 17:07:11 -07:00
Florian Fainelli
04ff53f96a net: dsa: Add netconsole support
Add support for using DSA slave network devices with netconsole, which
requires us to allocate and free custom netpoll instances and invoke the
parent network device poll controller callback.

In order for netconsole to work, we need to construct the DSA tag, but
not queue the skb for transmission on the master network device xmit
function.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:45:37 -07:00
Florian Fainelli
4ed70ce9f0 net: dsa: Refactor transmit path to eliminate duplication
All tagging protocols do the same thing: increment device statistics,
make room for the tag to be inserted, create the tag, invoke the parent
network device transmit function.

In order to prepare for adding netpoll support, which requires the tag
creation, but not using the parent network device transmit function, do
some little refactoring which eliminates duplication between the 4
tagging protocols supported.

We need to return a sk_buff pointer back to the caller because the tag
specific transmit function may have to reallocate the original skb (e.g:
tag_trailer.c) and this is the one we should be transmitting, not the
original sk_buff we were passed.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:45:37 -07:00
Joe Perches
85b1d8bbfd br2684: Remove unnecessary formatting macros b1 and bs
Use vsprintf extension %pI4 instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:25:52 -07:00
WANG Cong
5175f7106c act_pedit: check binding before calling tcf_hash_release()
When we share an action within a filter, the bind refcnt
should increase, therefore we should not call tcf_hash_release().

Fixes: 1a29321ed0 ("net_sched: act: Dont increment refcnt on replace")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:22:34 -07:00
Roopa Prabhu
bf21563acc af_mpls: fix undefined reference to ip6_route_output
Undefined reference to ip6_route_output and ip_route_output
was reported with CONFIG_INET=n and CONFIG_IPV6=n.

This patch uses ipv6_stub_impl.ipv6_dst_lookup instead of
ip6_route_output. And wraps affected code under
IS_ENABLED(CONFIG_INET) and IS_ENABLED(CONFIG_IPV6).

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:21:30 -07:00
Roopa Prabhu
343d60aada ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.

All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:21:30 -07:00
Alexei Starovoitov
d3aa45ce6b bpf: add helpers to access tunnel metadata
Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
skb: pointer to skb
key: pointer to 'struct bpf_tunnel_key'
size: size of 'struct bpf_tunnel_key'
flags: room for future extensions

First eBPF program that uses these helpers will allocate per_cpu
metadata_dst structures that will be used on TX.
On RX metadata_dst is allocated by tunnel driver.

Typical usage for TX:
struct bpf_tunnel_key tkey;
... populate tkey ...
bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);

RX:
struct bpf_tunnel_key tkey = {};
bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
... lookup or redirect based on tkey ...

'struct bpf_tunnel_key' will be extended in the future by adding
elements to the end and the 'size' argument will indicate which fields
are populated, thereby keeping backwards compatibility.
The 'flags' argument may be used as well when the 'size' is not enough or
to indicate completely different layout of bpf_tunnel_key.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:20:22 -07:00
Jon Paul Maloy
440d8963cd tipc: clean up link creation
We simplify the link creation function tipc_link_create() and the way
the link struct it is connected to the node struct. In particular, we
remove the duplicate initialization of some fields which are anyway set
in tipc_link_reset().

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:15 -07:00
Jon Paul Maloy
9073fb8be3 tipc: use temporary, non-protected skb queue for bundle reception
Currently, when we extract small messages from a message bundle, or
when many messages have accumulated in the link arrival queue, those
messages are added one by one to the lock protected link input queue.
This may increase contention with the reader of that queue, in
the function tipc_sk_rcv().

This commit introduces a temporary, unprotected input queue in
tipc_link_rcv() for such cases. Only when the arrival queue has been
emptied, and the function is ready to return, does it splice the whole
temporary queue into the real input queue.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:15 -07:00
Jon Paul Maloy
23d8335d78 tipc: remove implicit message delivery in node_unlock()
After the most recent changes, all access calls to a link which
may entail addition of messages to the link's input queue are
postpended by an explicit call to tipc_sk_rcv(), using a reference
to the correct queue.

This means that the potentially hazardous implicit delivery, using
tipc_node_unlock() in combination with a binary flag and a cached
queue pointer, now has become redundant.

This commit removes this implicit delivery mechanism both for regular
data messages and for binding table update messages.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy
598411d70f tipc: make resetting of links non-atomic
In order to facilitate future improvements to the locking structure, we
want to make resetting and establishing of links non-atomic. I.e., the
functions tipc_node_link_up() and tipc_node_link_down() should be called
from outside the node lock context, and grab/release the node lock
themselves. This requires that we can freeze the link state from the
moment it is set to RESETTING or PEER_RESET in one lock context until
it is set to RESET or ESTABLISHING in a later context. The recently
introduced link FSM makes this possible, so we are now ready to introduce
the above change.

This commit implements this.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy
cf148816ac tipc: move received discovery data evaluation inside node.c
The node lock is currently grabbed and and released in the function
tipc_disc_rcv() in the file discover.c. As a preparation for the next
commits, we need to move this node lock handling, along with the code
area it is covering, to node.c.

This commit introduces this change.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy
662921cd0a tipc: merge link->exec_mode and link->state into one FSM
Until now, we have been handling link failover and synchronization
by using an additional link state variable, "exec_mode". This variable
is not independent of the link FSM state, something causing a risk of
inconsistencies, apart from the fact that it clutters the code.

The conditions are now in place to define a new link FSM that covers
all existing use cases, including failover and synchronization, and
eliminate the "exec_mode" field altogether. The FSM must also support
non-atomic resetting of links, which will be introduced later.

The new link FSM is shown below, with 7 states and 8 events.
Only events leading to state change are shown as edges.

+------------------------------------+
|RESET_EVT                           |
|                                    |
|                             +--------------+
|           +-----------------|   SYNCHING   |-----------------+
|           |FAILURE_EVT      +--------------+   PEER_RESET_EVT|
|           |                  A            |                  |
|           |                  |            |                  |
|           |                  |            |                  |
|           |                  |SYNCH_      |SYNCH_            |
|           |                  |BEGIN_EVT   |END_EVT           |
|           |                  |            |                  |
|           V                  |            V                  V
|    +-------------+          +--------------+          +------------+
|    |  RESETTING  |<---------|  ESTABLISHED |--------->| PEER_RESET |
|    +-------------+ FAILURE_ +--------------+ PEER_    +------------+
|           |        EVT        |    A         RESET_EVT       |
|           |                   |    |                         |
|           |                   |    |                         |
|           |    +--------------+    |                         |
|  RESET_EVT|    |RESET_EVT          |ESTABLISH_EVT            |
|           |    |                   |                         |
|           |    |                   |                         |
|           V    V                   |                         |
|    +-------------+          +--------------+        RESET_EVT|
+--->|    RESET    |--------->| ESTABLISHING |<----------------+
     +-------------+ PEER_    +--------------+
      |           A  RESET_EVT       |
      |           |                  |
      |           |                  |
      |FAILOVER_  |FAILOVER_         |FAILOVER_
      |BEGIN_EVT  |END_EVT           |BEGIN_EVT
      |           |                  |
      V           |                  |
     +-------------+                 |
     | FAILINGOVER |<----------------+
     +-------------+

These changes are fully backwards compatible.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy
5045f7b900 tipc: move protocol message sending away from link FSM
The implementation of the link FSM currently takes decisions about and
sends out link protocol messages. This is unnecessary, since such
actions are not the result of any link state change, and are even
decided based on non-FSM state information ("silent_intv_cnt").

We now move the sending of unicast link protocol messages to the
function tipc_link_timeout(), and the initial broadcast synchronization
message to tipc_node_link_up(). The latter is done because a link
instance should not need to know whether it is the first or second
link to a destination. Such information is now restricted to and
handled by the link aggregation layer in node.c

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy
6e498158a8 tipc: move link synch and failover to link aggregation level
Link failover and synchronization have until now been handled by the
links themselves, forcing them to have knowledge about and to access
parallel links in order to make the two algorithms work correctly.

In this commit, we move the control part of this functionality to the
link aggregation level in node.c, which is the right location for this.
As a result, the two algorithms become easier to follow, and the link
implementation becomes simpler.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy
66996b6c47 tipc: extend node FSM
In the next commit, we will move link synch/failover orchestration to
the link aggregation level. In order to do this, we first need to extend
the node FSM with two more states, NODE_SYNCHING and NODE_FAILINGOVER,
plus four new events to enter and leave those states.

This commit introduces this change, without yet making use of it.
The node FSM now looks as follows:

                           +-----------------------------------------+
                           |                            PEER_DOWN_EVT|
                           |                                         |
  +------------------------+----------------+                        |
  |SELF_DOWN_EVT           |                |                        |
  |                        |                |                        |
  |              +-----------+          +-----------+                |
  |              |NODE_      |          |NODE_      |                |
  |   +----------|FAILINGOVER|<---------|SYNCHING   |------------+   |
  |   |SELF_     +-----------+ FAILOVER_+-----------+    PEER_   |   |
  |   |DOWN_EVT   |         A  BEGIN_EVT A         |     DOWN_EVT|   |
  |   |           |         |            |         |             |   |
  |   |           |         |            |         |             |   |
  |   |           |FAILOVER_|FAILOVER_   |SYNCH_   |SYNCH_       |   |
  |   |           |END_EVT  |BEGIN_EVT   |BEGIN_EVT|END_EVT      |   |
  |   |           |         |            |         |             |   |
  |   |           |         |            |         |             |   |
  |   |           |        +--------------+        |             |   |
  |   |           +------->|   SELF_UP_   |<-------+             |   |
  |   |   +----------------|   PEER_UP    |------------------+   |   |
  |   |   |SELF_DOWN_EVT   +--------------+     PEER_DOWN_EVT|   |   |
  |   |   |                   A          A                   |   |   |
  |   |   |                   |          |                   |   |   |
  |   |   |        PEER_UP_EVT|          |SELF_UP_EVT        |   |   |
  |   |   |                   |          |                   |   |   |
  V   V   V                   |          |                   V   V   V
+------------+       +-----------+    +-----------+       +------------+
|SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
|PEER_LEAVING|<------|PEER_COMING|    |SELF_COMING|------>|SELF_LEAVING|
+------------+ SELF_ +-----------+    +-----------+ PEER_ +------------+
       |       DOWN_EVT       A          A          DOWN_EVT     |
       |                      |          |                       |
       |                      |          |                       |
       |           SELF_UP_EVT|          |PEER_UP_EVT            |
       |                      |          |                       |
       |                      |          |                       |
       |PEER_DOWN_EVT       +--------------+        SELF_DOWN_EVT|
       +------------------->|  SELF_DOWN_  |<--------------------+
                            |  PEER_DOWN   |
                            +--------------+

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy
655fb243b8 tipc: reverse call order for link_reset()->node_link_down()
In many cases the call order when a link is reset goes as follows:
tipc_node_xx()->tipc_link_reset()->tipc_node_link_down()

This is not the right order if we want the node to be in control,
so in this commit we change the order to:
tipc_node_xx()->tipc_node_link_down()->tipc_link_reset()

The fact that tipc_link_reset() now is called from only one
location with a well-defined state will also facilitate later
simplifications of tipc_link_reset() and the link FSM.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy
6144a996a6 tipc: move all link_reset() calls to link aggregation level
In line with our effort to let the node level have full control over
its links, we want to move all link reset calls from link.c to node.c.
Some of the calls can be moved by simply moving the calling function,
when this is the right thing to do. For the remaining calls we use
the now established technique of returning a TIPC_LINK_DOWN_EVT
flag from tipc_link_rcv(), whereafter we perform the reset call when
the call returns.

This change serves as a preparation for the coming commits.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy
cbeb83ca68 tipc: eliminate function tipc_link_activate()
The function tipc_link_activate() is redundant, since it mostly performs
settings that have already been done in a preceding tipc_link_reset().

There are three exceptions to this:
- The actual state change to TIPC_LINK_WORKING. This should anyway be done
  in the FSM, and not in a separate function.
- Registration of the link with the bearer. This should be done by the
  node, since we don't want the link to have any knowledge about its
  specific bearer.
- Call to tipc_node_link_up() for user access registration. With the new
  role distribution between link aggregation and link level this becomes
  the wrong call order; tipc_node_link_up() should instead be called
  directly as a result of a TIPC_LINK_UP event, hence by the node itself.

This commit implements those changes.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
David S. Miller
29a3060aa7 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-07-30

Here's a set of Bluetooth & 802.15.4 patches intended for the 4.3 kernel.

 - Cleanups & fixes to mac802154
 - Refactoring of Intel Bluetooth HCI driver
 - Various coding style fixes to Bluetooth HCI drivers
 - Support for Intel Lightning Peak Bluetooth devices
 - Generic class code in interface descriptor in btusb to match more HW
 - Refactoring of Bluetooth HS code together with a new config option
 - Support for BCM4330B1 Broadcom UART controller

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 16:16:43 -07:00
Sowmini Varadhan
8a68173691 net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).

E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.

Fixes: 26abe14379 ("net: Modify sk_alloc to not reference count the
      netns of kernel sockets.")
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 15:59:12 -07:00
Hangbin Liu
8013d1d7ea net/ipv6: add sysctl option accept_ra_min_hop_limit
Commit 6fd99094de ("ipv6: Don't reduce hop limit for an interface")
disabled accept hop limit from RA if it is smaller than the current hop
limit for security stuff. But this behavior kind of break the RFC definition.

RFC 4861, 6.3.4.  Processing Received Router Advertisements
   A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
   and Retrans Timer) may contain a value denoting that it is
   unspecified.  In such cases, the parameter should be ignored and the
   host should continue using whatever value it is already using.

   If the received Cur Hop Limit value is non-zero, the host SHOULD set
   its CurHopLimit variable to the received value.

So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
hop limit value they can accept from RA. And set default to 1 to meet RFC
standards.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 15:56:40 -07:00
Daniel Borkmann
28e6b67f0b net: sched: fix refcount imbalance in actions
Since commit 55334a5db5 ("net_sched: act: refuse to remove bound action
outside"), we end up with a wrong reference count for a tc action.

Test case 1:

  FOO="1,6 0 0 4294967295,"
  BAR="1,6 0 0 4294967294,"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
     action bpf bytecode "$FOO"
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 1 bind 1
  tc actions replace action bpf bytecode "$BAR" index 1
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
    index 1 ref 2 bind 1
  tc actions replace action bpf bytecode "$FOO" index 1
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 3 bind 1

Test case 2:

  FOO="1,6 0 0 4294967295,"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
  tc actions show action gact
    action order 0: gact action pass
    random type none pass val 0
     index 1 ref 1 bind 1
  tc actions add action drop index 1
    RTNETLINK answers: File exists [...]
  tc actions show action gact
    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 2 bind 1
  tc actions add action drop index 1
    RTNETLINK answers: File exists [...]
  tc actions show action gact
    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 3 bind 1

What happens is that in tcf_hash_check(), we check tcf_common for a given
index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
found an existing action. Now there are the following cases:

  1) We do a late binding of an action. In that case, we leave the
     tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
     handler. This is correctly handeled.

  2) We replace the given action, or we try to add one without replacing
     and find out that the action at a specific index already exists
     (thus, we go out with error in that case).

In case of 2), we have to undo the reference count increase from
tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
do so because of the 'tcfc_bindcnt > 0' check which bails out early with
an -EPERM error.

Now, while commit 55334a5db5 prevents 'tc actions del action ...' on an
already classifier-bound action to drop the reference count (which could
then become negative, wrap around etc), this restriction only accounts for
invocations outside a specific action's ->init() handler.

One possible solution would be to add a flag thus we possibly trigger
the -EPERM ony in situations where it is indeed relevant.

After the patch, above test cases have correct reference count again.

Fixes: 55334a5db5 ("net_sched: act: refuse to remove bound action outside")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 14:20:39 -07:00
Alexander Aring
5857d1dbae Bluetooth: 6lowpan: Fix possible race
This patch fix a possible race after calling register_netdev. After
calling netdev_register it could be possible that netdev_ops callbacks
use the uninitialized private data of lowpan_dev. By moving the
initialization of this data before netdev_register we can be sure that
initialized private data is be used after netdev_register.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 14:11:36 +02:00
Lennert Buytenhek
c22ff7b4e7 mac802154: Fix memory corruption with global deferred transmit state.
When transmitting a packet via a mac802154 driver that can sleep in
its transmit function, mac802154 defers the call to the driver's
transmit function to a per-device workqueue.

However, mac802154 uses a single global work_struct for this, which
means that if you have more than one registered mac802154 interface
in the system, and you transmit on more than one of them at the same
time, you'll very easily cause memory corruption.

This patch moves the deferred transmit processing state from global
variables to struct ieee802154_local, and this seems to fix the memory
corruption issue.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 14:08:55 +02:00
Dan Carpenter
1a727c6361 netfilter: nf_conntrack: checking for IS_ERR() instead of NULL
We recently changed this from nf_conntrack_alloc() to nf_ct_tmpl_alloc()
so the error handling needs to changed to check for NULL instead of
IS_ERR().

Fixes: 0838aa7fcf ('netfilter: fix netns dependencies with conntrack templates')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-30 14:04:19 +02:00
Christophe JAILLET
54c9ee3992 Bluetooth: cmtp: Do not use list_for_each_safe when not needed
There is no need to use the safe version of list_for_each here.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:50:35 +02:00
Bernhard Thaler
f4b3eee727 netfilter: bridge: do not initialize statics to 0 or NULL
Fix checkpatch.pl "ERROR: do not initialise statics to 0 or NULL" for
all statics explicitly initialized to 0.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-30 13:46:04 +02:00
Florian Westphal
72b1e5e4ca netfilter: bridge: reduce nf_bridge_info to 32 bytes again
We can use union for most of the temporary cruft (original ipv4/ipv6
address, source mac, physoutdev) since they're used during different
stages of br netfilter traversal.

Also get rid of the last two ->mask users.

Shrinks struct from 48 to 32 on 64bit arch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-30 13:37:42 +02:00
Arron Wang
df9b89c7e4 Bluetooth: Move create/accept phy link completed callback to amp.c
To avoid amp module hooks from hci_event.c

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:37:22 +02:00
Arron Wang
b3d3914006 Bluetooth: Move amp assoc read/write completed callback to amp.c
To avoid amp module hooks from hci_event.c

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:37:22 +02:00
Arron Wang
839278823c Bluetooth: Move get info completed callback to a2mp.c
To avoid a2mp module hooks from hci_event.c and send
getinfo response operation only required by a2mp module,
we can move this callback to a2mp.c

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:37:22 +02:00
Arron Wang
a77a6a14e5 Bluetooth: Move high speed specific event under BT_HS option
Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:31:59 +02:00
Arron Wang
244bc37759 Bluetooth: Add BT_HS config option
Move A2MP Module under BT_HS config option and allow
the user have flexible option to choose the feature only
they need

a2mp_discover_amp() & a2mp_channel_create() are a2mp module
entry point for master and slave, and this is dynamic
invoked depends on the userspace or remote request, then
we defined their implementation depends on BT_HS config

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:31:59 +02:00
Michal Kubeček
d7ee351904 netfilter: nf_ct_sctp: minimal multihoming support
Currently nf_conntrack_proto_sctp module handles only packets between
primary addresses used to establish the connection. Any packets between
secondary addresses are classified as invalid so that usual firewall
configurations drop them. Allowing HEARTBEAT and HEARTBEAT-ACK chunks to
establish a new conntrack would allow traffic between secondary
addresses to pass through. A more sophisticated solution based on the
addresses advertised in the initial handshake (and possibly also later
dynamic address addition and removal) would be much harder to implement.
Moreover, in general we cannot assume to always see the initial
handshake as it can be routed through a different path.

The patch adds two new conntrack states:

  SCTP_CONNTRACK_HEARTBEAT_SENT  - a HEARTBEAT chunk seen but not acked
  SCTP_CONNTRACK_HEARTBEAT_ACKED - a HEARTBEAT acked by HEARTBEAT-ACK

State transition rules:

- HEARTBEAT_SENT responds to usual chunks the same way as NONE (so that
  the behaviour changes as little as possible)
- HEARTBEAT_ACKED responds to usual chunks the same way as ESTABLISHED
  does, except the resulting state is HEARTBEAT_ACKED rather than
  ESTABLISHED
- previously existing states except NONE are preserved when HEARTBEAT or
  HEARTBEAT-ACK is seen
- NONE (in the initial direction) changes to HEARTBEAT_SENT on HEARTBEAT
  and to CLOSED on HEARTBEAT-ACK
- HEARTBEAT_SENT changes to HEARTBEAT_ACKED on HEARTBEAT-ACK in the
  reply direction
- HEARTBEAT_SENT and HEARTBEAT_ACKED are preserved on HEARTBEAT and
  HEARTBEAT-ACK otherwise

Normally, vtag is set from the INIT chunk for the reply direction and
from the INIT-ACK chunk for the originating direction (i.e. each of
these defines vtag value for the opposite direction). For secondary
conntracks, we can't rely on seeing INIT/INIT-ACK and even if we have
seen them, we would need to connect two different conntracks. Therefore
simplified logic is applied: vtag of first packet in each direction
(HEARTBEAT in the originating and HEARTBEAT-ACK in reply direction) is
saved and all following packets in that direction are compared with this
saved value. While INIT and INIT-ACK define vtag for the opposite
direction, vtags extracted from HEARTBEAT and HEARTBEAT-ACK are always
for their direction.

Default timeout values for new states are

  HEARTBEAT_SENT: 30 seconds (default hb_interval)
  HEARTBEAT_ACKED: 210 seconds (hb_interval * path_max_retry + max_rto)

(We cannot expect to see the shutdown sequence so that, unlike
ESTABLISHED, the HEARTBEAT_ACKED timeout shouldn't be too long.)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-30 12:59:25 +02:00
Pablo Neira Ayuso
f0ad462189 netfilter: nf_conntrack: silence warning on falling back to vmalloc()
Since 88eab472ec ("netfilter: conntrack: adjust nf_conntrack_buckets default
value"), the hashtable can easily hit this warning. We got reports from users
that are getting this message in a quite spamming fashion, so better silence
this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2015-07-30 12:32:55 +02:00
Daniel Borkmann
f4eaed28c7 act_bpf: fix memory leaks when replacing bpf programs
We currently trigger multiple memory leaks when replacing bpf
actions, besides others:

  comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s)
  hex dump (first 32 bytes):
    01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00  ................
    18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00  ...m............
  backtrace:
    [<ffffffff817e623e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff8120a22d>] __vmalloc_node_range+0x1bd/0x2c0
    [<ffffffff8120a37a>] __vmalloc+0x4a/0x50
    [<ffffffff811a8d0a>] bpf_prog_alloc+0x3a/0xa0
    [<ffffffff816c0684>] bpf_prog_create+0x44/0xa0
    [<ffffffffa09ba4eb>] tcf_bpf_init+0x28b/0x3c0 [act_bpf]
    [<ffffffff816d7001>] tcf_action_init_1+0x191/0x1b0
    [<ffffffff816d70a2>] tcf_action_init+0x82/0xf0
    [<ffffffff816d4d12>] tcf_exts_validate+0xb2/0xc0
    [<ffffffffa09b5838>] cls_bpf_modify_existing+0x98/0x340 [cls_bpf]
    [<ffffffffa09b5cd6>] cls_bpf_change+0x1a6/0x274 [cls_bpf]
    [<ffffffff816d56e5>] tc_ctl_tfilter+0x335/0x910
    [<ffffffff816b9145>] rtnetlink_rcv_msg+0x95/0x240
    [<ffffffff816df34f>] netlink_rcv_skb+0xaf/0xc0
    [<ffffffff816b909e>] rtnetlink_rcv+0x2e/0x40
    [<ffffffff816deaaf>] netlink_unicast+0xef/0x1b0

Issue is that the old content from tcf_bpf is allocated and needs
to be released when we replace it. We seem to do that since the
beginning of act_bpf on the filter and insns, later on the name as
well.

Example test case, after patch:

  # FOO="1,6 0 0 4294967295,"
  # BAR="1,6 0 0 4294967294,"
  # tc actions add action bpf bytecode "$FOO" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
   index 2 ref 1 bind 0
  # tc actions replace action bpf bytecode "$BAR" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
   index 2 ref 1 bind 0
  # tc actions replace action bpf bytecode "$FOO" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
   index 2 ref 1 bind 0
  # tc actions del action bpf index 2
  [...]
  # echo "scan" > /sys/kernel/debug/kmemleak
  # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
  0

Fixes: d23b8ad8ab ("tc: add BPF based action")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:56:22 -07:00
Thomas Graf
dcc38c033b openvswitch: Re-add CONFIG_OPENVSWITCH_VXLAN
This readds the config option CONFIG_OPENVSWITCH_VXLAN to avoid a
hard dependency of OVS on VXLAN. It moves the VXLAN config compat
code to vport-vxlan.c and allows compliation as a module.

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Fixes: 2661371ace ("openvswitch: fix compilation when vxlan is a module")
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:03:10 -07:00
Eric Dumazet
c8507fb235 ipv6: flush nd cache on IFF_NOARP change
This patch is the IPv6 equivalent of commit
6c8b4e3ff8 ("arp: flush arp cache on IFF_NOARP change")

Without it, we keep buggy neighbours in the cache, with destination
MAC address equal to our own MAC address.

Tested:
 tcpdump -i eth0 -s 0 ip6 -n -e &
 ip link set dev eth0 arp off
 ping6 remote   // sends buggy frames
 ip link set dev eth0 arp on
 ping6 remote   // should work once kernel is patched

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mario Fanelli <mariofanelli@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:01:39 -07:00
Bogdan Hamciuc
1ce4b2f44d net: pktgen: Remove unused 'allocated_skbs' field
Field pktgen_dev.allocated_skbs had been written to, but never read
from. The number of allocated skbs can be deduced anyway, from the total
number of sent packets and the 'clone_skb' param.

Signed-off-by: Bogdan Hamciuc <bogdan.hamciuc@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:00:39 -07:00
Bogdan Hamciuc
879c7220e8 net: pktgen: Observe needed_headroom of the device
Allocate enough space so as not to force the outgoing net device to do
skb_realloc_headroom().

Signed-off-by: Bogdan Hamciuc <bogdan.hamciuc@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:00:39 -07:00
Thomas Graf
92a99bf3ba lwtunnel: Make lwtun_encaps[] static
Any external user should use the registration API instead of
accessing this directly.

Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 22:59:39 -07:00
Tom Herbert
877d1f6291 net: Set sk_txhash from a random number
This patch creates sk_set_txhash and eliminates protocol specific
inet_set_txhash and ip6_set_txhash. sk_set_txhash simply sets a
random number instead of performing flow dissection. sk_set_txash
is also allowed to be called multiple times for the same socket,
we'll need this when redoing the hash for negative routing advice.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 22:44:04 -07:00
Jon Maloy
5a4c355229 tipc: fix bug in broadcast synch message create function
In commit d999297c3d
("tipc: reduce locking scope during packet reception") we introduced
a new function tipc_build_bcast_sync_msg(), which carries initial
synchronization data between two nodes at first contact and at
re-contact. In this function, we missed to add synchronization data,
with the effect that the broadcast link endpoints will fail to
synchronize correctly at re-contact between a running and a restarted
node. All other cases work as intended.

With this commit, we fix this bug.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 16:48:16 -07:00
Tobias Klauser
73d0fcf2f4 packet: remove handling of tx_ring from prb_shutdown_retire_blk_timer()
Follow e8e85cc5eb ("packet: remove handling of tx_ring") and remove
the tx_ring parameter from prb_shutdown_retire_blk_timer() as it is only
called with tx_ring = 0.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 15:11:12 -07:00
Nikolay Aleksandrov
7ae90a4f96 bridge: mdb: fix delmdb state in the notification
Since mdb states were introduced when deleting an entry the state was
left as it was set in the delete request from the user which leads to
the following output when doing a monitor (for example):
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 temp
^^^
Note the "temp" state in the delete notification which is wrong since
the entry was permanent, the state in a delete is always reported as
"temp" regardless of the real state of the entry.

After this patch:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent

There's one important note to make here that the state is actually not
matched when doing a delete, so one can delete a permanent entry by
stating "temp" in the end of the command, I've chosen this fix in order
not to break user-space tools which rely on this (incorrect) behaviour.

So to give an example after this patch and using the wrong state:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent

Note the state of the entry that got deleted is correct in the
notification.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: ccb1c31a7a ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 15:02:30 -07:00
Satish Ashok
544586f742 bridge: mcast: give fast leave precedence over multicast router and querier
When fast leave is configured on a bridge port and an IGMP leave is
received for a group, the group is not deleted immediately if there is
a router detected or if multicast querier is configured.
Ideally the group should be deleted immediately when fast leave is
configured.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 14:57:05 -07:00
Toshiaki Makita
df356d5e81 bridge: Fix network header pointer for vlan tagged packets
There are several devices that can receive vlan tagged packets with
CHECKSUM_PARTIAL like tap, possibly veth and xennet.
When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
by bridge to a device with the IP_CSUM feature, they end up with checksum
error because before entering bridge, the network header is set to
ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.

Since the network header is exepected to be ETH_HLEN in flow-dissection
and hash-calculation in RPS in rx path, and since the header pointer fix
is needed only in tx path, set the appropriate network header on forwarding
packets.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 12:20:16 -07:00
Alexander Drozdov
dbd46ab412 packet: tpacket_snd(): fix signed/unsigned comparison
tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:

tp_len > dev->mtu + dev->hard_header_len

as dev->mtu and dev->hard_header_len are both unsigned.

That may lead to just returning an incorrect EMSGSIZE errno
to the user.

Fixes: 52f1454f62 ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 00:09:58 -07:00
Eric Dumazet
11c91ef98f arp: filter NOARP neighbours for SIOCGARP
When arp is off on a device, and ioctl(SIOCGARP) is queried,
a buggy answer is given with MAC address of the device, instead
of the mac address of the destination/gateway.

We filter out NUD_NOARP neighbours for /proc/net/arp,
we must do the same for SIOCGARP ioctl.

Tested:

lpaa23:~# ./arp 10.246.7.190
MAC=00:01:e8:22:cb:1d      // correct answer

lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# cat /proc/net/arp   # check arp table is now 'empty'
IP address       HW type     Flags       HW address    Mask     Device
lpaa23:~# ./arp 10.246.7.190
MAC=00:1a:11:c3:0d:7f   // buggy answer before patch (this is eth0 mac)

After patch :

lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# ./arp 10.246.7.190
ioctl(SIOCGARP) failed: No such device or address

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Vytautas Valancius <valas@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-28 23:41:24 -07:00
David Ward
865b804244 net/ipv4: suppress NETDEV_UP notification on address lifetime update
This notification causes the FIB to be updated, which is not needed
because the address already exists, and more importantly it may undo
intentional changes that were made to the FIB after the address was
originally added. (As a point of comparison, when an address becomes
deprecated because its preferred lifetime expired, a notification on
this chain is not generated.)

The motivation for this commit is fixing an incompatibility between
DHCP clients which set and update the address lifetime according to
the lease, and a commercial VPN client which replaces kernel routes
in a way that outbound traffic is sent only through the tunnel (and
disconnects if any further route changes are detected via netlink).

Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-28 23:38:13 -07:00
Nikolay Aleksandrov
76b91c32dd bridge: stp: when using userspace stp stop kernel hello and hold timers
These should be handled only by the respective STP which is in control.
They become problematic for devices with limited resources with many
ports because the hold_timer is per port and fires each second and the
hello timer fires each 2 seconds even though it's global. While in
user-space STP mode these timers are completely unnecessary so it's better
to keep them off.
Also ensure that when the bridge is up these timers are started only when
running with kernel STP.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-28 23:33:20 -07:00
Linus Torvalds
d8132e08d2 NFS client bugfixes for Linux 4.2
Highlights include:
 
 Stable patches:
 - Fix a situation where the client uses the wrong (zero) stateid.
 - Fix a memory leak in nfs_do_recoalesce
 
 Bugfixes:
 - Plug a memory leak when ->prepare_layoutcommit fails
 - Fix an Oops in the NFSv4 open code
 - Fix a backchannel deadlock
 - Fix a livelock in sunrpc when sendmsg fails due to low memory availability
 - Don't revalidate the mapping if both size and change attr are up to date
 - Ensure we don't miss a file extension when doing pNFS
 - Several fixes to handle NFSv4.1 sequence operation status bits correctly
 - Several pNFS layout return bugfixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVt6RGAAoJEGcL54qWCgDyiDIP/2+fUM7Tc1llCxYbM2WLC6Ar
 34v5yVwO96MqhI4L2mXB5FJvr4LP2/EZ4ZExMcf4ymT7pgJnjFK4nEv9IHUSy6xb
 ea+oS9GjvFSeGdkukJLRniNER5/ZG3GWkojlHNJCgByoIVRK4ISXF/qL9w2sedGw
 +5ejvjqie9NmBnBXMq8DRlU+kXhVYCF6E9qWATwUNK5Eq2eeQnDbA2w9ACSBVK3W
 LhCvZi0eBq7krSbHob018PmlQ0VPvmYwk5xL4d//FvcaNj/utk82VjAZCdKOK1sH
 qn8hcKgVeVko/3jwcUp6m3zAkKZ1IX/XaXJeHbosnKG/g0vy3hQirpa/g2iDTQ4H
 NXOSwcsd6syReZDZbQTxbvaSOp5ACxZAQKYLnlPerJ/hMpXDQCEAwyeAFKzEaKz4
 FfF0VJF+30w9PJk3wgk2DF66xbYVfHyvrLtVcb/ki8gb91cH09i+nFFSSfHQBMLh
 +ciHg7rOyXnbXoCaW9fBvONz2sCYDwbHATmhpWWZIx/3UTDf5owxHFa3BFDgGKnD
 jyiPjMh6I3JUE+Qm1zwInsfsskBKRSl2BdJgTHBGY5ODuQGF/sogOmvgbrT7Ox3t
 kbL8nzCydqLixM+4aw61nYakZqgDsKNER5Ggr+lkv4AZ2dH6IeP2IZjuoHLLylvZ
 dyqHwpCjoUtmYAUr166U
 =wlUD
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a situation where the client uses the wrong (zero) stateid.
   - Fix a memory leak in nfs_do_recoalesce

  Bugfixes:
   - Plug a memory leak when ->prepare_layoutcommit fails
   - Fix an Oops in the NFSv4 open code
   - Fix a backchannel deadlock
   - Fix a livelock in sunrpc when sendmsg fails due to low memory
     availability
   - Don't revalidate the mapping if both size and change attr are up to
     date
   - Ensure we don't miss a file extension when doing pNFS
   - Several fixes to handle NFSv4.1 sequence operation status bits
     correctly
   - Several pNFS layout return bugfixes"

* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
  nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
  nfs: plug memory leak when ->prepare_layoutcommit fails
  SUNRPC: Report TCP errors to the caller
  sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
  NFSv4.2: handle NFS-specific llseek errors
  NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
  NFS: Fix a memory leak in nfs_do_recoalesce
  NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
  NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
  NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
  NFS: Don't revalidate the mapping if both size and change attr are up to date
  NFSv4/pnfs: Ensure we don't miss a file extension
  NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
  SUNRPC: xprt_complete_bc_request must also decrement the free slot count
  SUNRPC: Fix a backchannel deadlock
  pNFS: Don't throw out valid layout segments
  pNFS: pnfs_roc_drain() fix a race with open
  pNFS: Fix races between return-on-close and layoutreturn.
  pNFS: pnfs_roc_drain should return 'true' when sleeping
  pNFS: Layoutreturn must invalidate all existing layout segments.
  ...
2015-07-28 09:37:44 -07:00
Lars Westerhoff
158cd4af8d packet: missing dev_put() in packet_do_bind()
When binding a PF_PACKET socket, the use count of the bound interface is
always increased with dev_hold in dev_get_by_{index,name}.  However,
when rebound with the same protocol and device as in the previous bind
the use count of the interface was not decreased.  Ultimately, this
caused the deletion of the interface to fail with the following message:

unregister_netdevice: waiting for dummy0 to become free. Usage count = 1

This patch moves the dev_put out of the conditional part that was only
executed when either the protocol or device changed on a bind.

Fixes: 902fefb82e ('packet: improve socket create/bind latency in some cases')
Signed-off-by: Lars Westerhoff <lars.westerhoff@newtec.eu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 15:38:58 -07:00
Trond Myklebust
f580dd0428 SUNRPC: Report TCP errors to the caller
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 17:56:57 -04:00
Alexander Duyck
1513069edc fib_trie: Drop unnecessary calls to leaf_pull_suffix
It was reported that update_suffix was taking a long time on systems where
a large number of leaves were attached to a single node.  As it turns out
fib_table_flush was calling update_suffix for each leaf that didn't have all
of the aliases stripped from it.  As a result, on this large node removing
one leaf would result in us calling update_suffix for every other leaf on
the node.

The fix is to just remove the calls to leaf_pull_suffix since they are
redundant as we already have a call in resize that will go through and
update the suffix length for the node before we exit out of
fib_table_flush or fib_table_flush_external.

Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 14:29:11 -07:00
NeilBrown
743c69e7c0 sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
The networking layer does not reliably report the distinction between
a non-block write failing because:
 1/ the queue is too full already and
 2/ a memory allocation attempt failed.

The distinction is important because in the first case it is
appropriate to retry as soon as the socket reports that it is
writable, and in the second case a small delay is required as the
socket will most likely report as writable but kmalloc could still
fail.

sk_stream_wait_memory() exhibits this distinction nicely, setting
'vm_wait' if a small wait is needed.  However in the non-blocking case
it always returns -EAGAIN no matter the cause of the failure.  This
-EAGAIN call get all the way to sunrpc.

The sunrpc layer expects EAGAIN to indicate the first cause, and
ENOBUFS to indicate the second.  Various documentation suggests that
this is not unreasonable, but does not guarantee the desired error
codes.

The result of getting -EAGAIN when -ENOBUFS is expected is that the
send is tried again in a tight loop and soft lockups are reported.

so: add tests after calls to xs_sendpages() to translate -EAGAIN into
-ENOBUFS if the socket is writable.  This cannot happen inside
xs_sendpages() as the test for "is socket writable" is different
between TCP and UDP.

With this change, the tight loop retrying xs_sendpages() becomes a
loop which only retries every 250ms, and so will not trigger a
soft-lockup warning.

It is possible that the write did fail because the queue was too full
and by the time xs_sendpages() completed, the queue was writable
again.  In this case an extra 250ms delay is inserted that isn't
really needed.  This circumstance suggests a degree of congestion so a
delay is not necessarily a bad thing, and it can only cause a single
250ms delay, not a series of them.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 11:16:56 -04:00
Dan Carpenter
e11f40b935 lwtunnel: use kfree_skb() instead of vanilla kfree()
kfree_skb() is correct here.

Fixes: ffce41962e ('lwtunnel: support dst output redirect function')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:24:59 -07:00
Eric Dumazet
99d7662a04 tcp: tso: allow deferring under reordering state
While doing experiments with reordering resilience, we found
linux senders were not able to send at full speed under reordering,
because every incoming SACK was releasing one MSS.

This patch removes the limitation, as we did for CWR state
in commit a0ea700e40 ("tcp: tso: allow CA_CWR state in
tcp_tso_should_defer()")

Neal Cardwell had a concern about limited transmit so
Yuchung conducted experiments on GFE and found nothing
worth adding an extra check on fast path :

  if (icsk->icsk_ca_state == TCP_CA_Disorder &&
      tcp_sk(sk)->reordering == sysctl_tcp_reordering)
          goto send_now;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:23:20 -07:00
Martin KaFai Lau
8d6c31bf57 ipv6: Avoid rt6_probe() taking writer lock in the fast path
The patch checks neigh->nud_state before acquiring the writer lock.
Note that rt6_probe() is only used in CONFIG_IPV6_ROUTER_PREF.

40 udpflood processes and a /64 gateway route are used.
The gateway has NUD_PERMANENT.  Each of them is run for 30s.
At the end, the total number of finished sendto():

Before: 55M
After: 95M

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Julian Anastasov <ja@ssi.bg>
CC: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:08:25 -07:00
Martin KaFai Lau
990edb428c ipv6: Re-arrange code in rt6_probe()
It is a prep work for the next patch to remove write_lock
from rt6_probe().

1. Reduce the number of if(neigh) check.  From 4 to 1.
2. Bring the write_(un)lock() closer to the operations that the
   lock is protecting.

Hopefully, the above make rt6_probe() more readable.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:08:25 -07:00
Sabrina Dubroca
dfbafc9953 tcp: fix recv with flags MSG_WAITALL | MSG_PEEK
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.

sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.

Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258
Reported-by: Enrico Scholz <rh-bugzilla@ensc.de>
Reported-by: Dan Searle <dan@censornet.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:06:53 -07:00
Nicolas Dichtel
5a6228a0b4 lwtunnel: change prototype of lwtunnel_state_get()
It saves some lines and simplify a bit the code when the state is returning
by this function. It's also useful to handle a NULL entry.

To avoid too long lines, I've also renamed lwtunnel_state_get() and
lwtunnel_state_put() to lwtstate_get() and lwtstate_put().

CC: Thomas Graf <tgraf@suug.ch>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:02:49 -07:00
Nicolas Dichtel
d943659508 ipv6: copy lwtstate in ip6_rt_copy_init()
We need to copy this field (ip6_rt_cache_alloc() and ip6_rt_pcpu_alloc()
use ip6_rt_copy_init() to build a dst).

CC: Thomas Graf <tgraf@suug.ch>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Fixes: 19e42e4515 ("ipv6: support for fib route lwtunnel encap attributes")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:02:49 -07:00
Nicolas Dichtel
6673a9f4e3 ipv6: use lwtunnel_output6() only if flag redirect is set
This function make sense only when LWTUNNEL_STATE_OUTPUT_REDIRECT is set.
The check is already done in IPv4.

CC: Thomas Graf <tgraf@suug.ch>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Fixes: 74a0f2fe8e ("ipv6: rt6_info output redirect to tunnel output")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:00:51 -07:00