Commit Graph

42139 Commits

Author SHA1 Message Date
Nikolay Aleksandrov
31ca0458a6 net: bridge: fix old ioctl unlocked net device walk
get_bridge_ifindices() is used from the old "deviceless" bridge ioctl
calls which aren't called with rtnl held. The comment above says that it is
called with rtnl but that is not really the case.
Here's a sample output from a test ASSERT_RTNL() which I put in
get_bridge_ifindices and executed "brctl show":
[  957.422726] RTNL: assertion failed at net/bridge//br_ioctl.c (30)
[  957.422925] CPU: 0 PID: 1862 Comm: brctl Tainted: G        W  O
4.6.0-rc4+ #157
[  957.423009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.8.1-20150318_183358- 04/01/2014
[  957.423009]  0000000000000000 ffff880058adfdf0 ffffffff8138dec5
0000000000000400
[  957.423009]  ffffffff81ce8380 ffff880058adfe58 ffffffffa05ead32
0000000000000001
[  957.423009]  00007ffec1a444b0 0000000000000400 ffff880053c19130
0000000000008940
[  957.423009] Call Trace:
[  957.423009]  [<ffffffff8138dec5>] dump_stack+0x85/0xc0
[  957.423009]  [<ffffffffa05ead32>]
br_ioctl_deviceless_stub+0x212/0x2e0 [bridge]
[  957.423009]  [<ffffffff81515beb>] sock_ioctl+0x22b/0x290
[  957.423009]  [<ffffffff8126ba75>] do_vfs_ioctl+0x95/0x700
[  957.423009]  [<ffffffff8126c159>] SyS_ioctl+0x79/0x90
[  957.423009]  [<ffffffff8163a4c0>] entry_SYSCALL_64_fastpath+0x23/0xc1

Since it only reads bridge ifindices, we can use rcu to safely walk the net
device list. Also remove the wrong rtnl comment above.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-05 23:32:30 -04:00
Ian Campbell
dedc58e067 VSOCK: do not disconnect socket when peer has shutdown SEND only
The peer may be expecting a reply having sent a request and then done a
shutdown(SHUT_WR), so tearing down the whole socket at this point seems
wrong and breaks for me with a client which does a SHUT_WR.

Looking at other socket family's stream_recvmsg callbacks doing a shutdown
here does not seem to be the norm and removing it does not seem to have
had any adverse effects that I can see.

I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact
on the vmci transport.

Signed-off-by: Ian Campbell <ian.campbell@docker.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Andy King <acking@vmware.com>
Cc: Dmitry Torokhov <dtor@vmware.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Adit Ranadive <aditr@vmware.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-05 23:31:29 -04:00
James Morris
a6926cc989 Merge branch 'stable-4.7' of git://git.infradead.org/users/pcmoore/selinux into next 2016-05-06 09:31:34 +10:00
Phil Turnbull
eda3fc50da netfilter: nfnetlink_acct: validate NFACCT_QUOTA parameter
If a quota bit is set in NFACCT_FLAGS but the NFACCT_QUOTA parameter is
missing then a NULL pointer dereference is triggered. CAP_NET_ADMIN is
required to trigger the bug.

Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:47:08 +02:00
Pablo Neira Ayuso
cb39ad8b8e netfilter: nf_tables: allow set names up to 32 bytes
Currently, we support set names of up to 16 bytes, get this aligned
with the maximum length we can use in ipset to make it easier when
considering migration to nf_tables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:51 +02:00
Pablo Neira Ayuso
d7cdf81657 netfilter: x_tables: get rid of old and inconsistent debugging
The dprintf() and duprintf() functions are enabled at compile time,
these days we have better runtime debugging through pr_debug() and
static keys.

On top of this, this debugging is so old that I don't expect anyone
using this anymore, so let's get rid of this.

IP_NF_ASSERT() is still left in place, although this needs that
NETFILTER_DEBUG is enabled, I think these assertions provide useful
context information when reading the code.

Note that ARP_NF_ASSERT() has been removed as there is no user of
this.

Kill also DEBUG_ALLOW_ALL and a couple of pr_error() and pr_debug()
spots that are inconsistently placed in the code.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:51 +02:00
Pablo Neira Ayuso
3b78155b1b openvswitch: __nf_ct_l{3,4}proto_find() always return a valid pointer
If the protocol is not natively supported, this assigns generic protocol
tracker so we can always assume a valid pointer after these calls.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Joe Stringer <joe@ovn.org>
2016-05-05 16:39:50 +02:00
Pablo Neira Ayuso
71d8c47fc6 netfilter: conntrack: introduce clash resolution on insertion race
This patch introduces nf_ct_resolve_clash() to resolve race condition on
conntrack insertions.

This is particularly a problem for connection-less protocols such as
UDP, with no initial handshake. Two or more packets may race to insert
the entry resulting in packet drops.

Another problematic scenario are packets enqueued to userspace via
NFQUEUE after the raw table, that make it easier to trigger this
race.

To resolve this, the idea is to reset the conntrack entry to the one
that won race. Packet and bytes counters are also merged.

The 'insert_failed' stats still accounts for this situation, after
this patch, the drop counter is bumped whenever we drop packets, so we
can watch for unresolved clashes.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:50 +02:00
Pablo Neira Ayuso
ba76738c03 netfilter: conntrack: introduce nf_ct_acct_update()
Introduce a helper function to update conntrack counters.
__nf_ct_kill_acct() was unnecessarily subtracting skb_network_offset()
that is expected to be zero from the ipv4/ipv6 hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:49 +02:00
Pablo Neira Ayuso
4b4ceb9dbf netfilter: conntrack: __nf_ct_l4proto_find() always returns valid pointer
Remove unnecessary check for non-nul pointer in destroy_conntrack()
given that __nf_ct_l4proto_find() returns the generic protocol tracker
if the protocol is not supported.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:48 +02:00
Florian Westphal
3e86638e9a netfilter: conntrack: consider ct netns in early_drop logic
When iterating, skip conntrack entries living in a different netns.

We could ignore netns and kill some other non-assured one, but it
has two problems:

- a netns can kill non-assured conntracks in other namespace
- we would start to 'over-subscribe' the affected/overlimit netns.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:48 +02:00
Florian Westphal
56d52d4892 netfilter: conntrack: use a single hashtable for all namespaces
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the table.

Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
64bit system.

NAT bysrc and expectation hash is still per namespace, those will
changed too soon.

Future patch will also make conntrack object slab cache global again.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:47 +02:00
Florian Westphal
1b8c8a9f64 netfilter: conntrack: make netns address part of hash
Once we place all conntracks into a global hash table we want them to be
spread across entire hash table, even if namespaces have overlapping ip
addresses.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:46 +02:00
Florian Westphal
e0c7d47221 netfilter: conntrack: check netns when comparing conntrack objects
Once we place all conntracks in the same hash table we must also compare
the netns pointer to skip conntracks that belong to a different namespace.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:46 +02:00
Florian Westphal
245cfdcaba netfilter: conntrack: small refactoring of conntrack seq_printf
The iteration process is lockless, so we test if the conntrack object is
eligible for printing (e.g. is AF_INET) after obtaining the reference
count.

Once we put all conntracks into same hash table we might see more
entries that need to be skipped.

So add a helper and first perform the test in a lockless fashion
for fast skip.

Once we obtain the reference count, just repeat the check.

Note that this refactoring also includes a missing check for unconfirmed
conntrack entries due to slab rcu object re-usage, so they need to be
skipped since they are not part of the listing.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:45 +02:00
Florian Westphal
868043485e netfilter: conntrack: use nf_ct_key_equal() in more places
This prepares for upcoming change that places all conntracks into a
single, global table.  For this to work we will need to also compare
net pointer during lookup.  To avoid open-coding such check use the
nf_ct_key_equal helper and then later extend it to also consider net_eq.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:44 +02:00
Florian Westphal
88b68bc523 netfilter: conntrack: don't attempt to iterate over empty table
Once we place all conntracks into same table iteration becomes more
costly because the table contains conntracks that we are not interested
in (belonging to other netns).

So don't bother scanning if the current namespace has no entries.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:44 +02:00
Florian Westphal
5e3c61f981 netfilter: conntrack: fix lookup race during hash resize
When resizing the conntrack hash table at runtime via
echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with
the conntrack lookup path -- reads can happen in parallel and nothing
prevents readers from observing a the newly allocated hash but the old
size (or vice versa).

So access to hash[bucket] can trigger OOB read access in case the table got
expanded and we saw the new size but the old hash pointer (or it got shrunk
and we got new hash ptr but the size of the old and larger table):

kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN
CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107
[..]
Call Trace:
[<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90
[<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0
[<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760

Use generation counter to obtain the address/length of the same table.

Also add a synchronize_net before freeing the old hash.
AFAICS, without it we might access ct_hash[bucket] after ct_hash has been
freed, provided that lockless reader got delayed by another event:

CPU1			CPU2
seq_begin
seq_retry
<delay>			resize occurs
			free oldhash
for_each(oldhash[size])

Note that resize is only supported in init_netns, it took over 2 minutes
of constant resizing+flooding to produce the warning, so this isn't a
big problem in practice.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:43 +02:00
Florian Westphal
2cf1234807 netfilter: conntrack: keep BH enabled during lookup
No need to disable BH here anymore:

stats are switched to _ATOMIC variant (== this_cpu_inc()), which
nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:39:43 +02:00
Florian Westphal
1ad8f48df6 netfilter: nftables: add connlabel set support
Conntrack labels are currently sized depending on the iptables
ruleset, i.e. if we're asked to test or set bits 1, 2, and 65 then we
would allocate enough room to store at least bit 65.

However, with nft, the input is just a register with arbitrary runtime
content.

We therefore ask for the upper ceiling we currently have, which is
enough room to store 128 bits.

Alternatively, we could alter nf_connlabel_replace to increase
net->ct.label_words at run time, but since 128 bits is not that
big we'd only save sizeof(long) so it doesn't seem worth it for now.

This follows a similar approach that xtables 'connlabel'
match uses, so when user inputs

    ct label set bar

then we will set the bit used by the 'bar' label and leave the rest alone.

This is done by passing the sreg content to nf_connlabels_replace
as both value and mask argument.
Labels (bits) already set thus cannot be re-set to zero, but
this is not supported by xtables connlabel match either.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05 16:27:59 +02:00
Eric Dumazet
777c6ae57e tcp: two more missing bh disable
percpu_counter only have protection against preemption.

TCP stack uses them possibly from BH, so we need BH protection
in contexts that could be run in process context

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 23:47:54 -04:00
Eric Dumazet
614bdd4d6e tcp: must block bh in __inet_twsk_hashdance()
__inet_twsk_hashdance() might be called from process context,
better block BH before acquiring bind hash and established locks

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:55:11 -04:00
Eric Dumazet
46cc6e4976 tcp: fix lockdep splat in tcp_snd_una_update()
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.

This triggers a lockdep splat on 32bit & SMP builds.

We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.

I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam <festevam@gmail.com>
Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:55:11 -04:00
David S. Miller
32b583a0cb Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-05-04

1) The flowcache can hit an OOM condition if too
   many entries are in the gc_list. Fix this by
   counting the entries in the gc_list and refuse
   new allocations if the value is too high.

2) The inner headers are invalid after a xfrm transformation,
   so reset the skb encapsulation field to ensure nobody tries
   access the inner headers. Otherwise tunnel devices stacked
   on top of xfrm may build the outer headers based on wrong
   informations.

3) Add pmtu handling to vti, we need it to report
   pmtu informations for local generated packets.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:35:31 -04:00
David S. Miller
5332174a83 Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:

====================
pull request: batman-adv 20160504

In this pull request you have:
- two changes to the MAINTAINERS file where one marks our mailing list
  as moderated and the other adds a missing documentation file
- kernel-doc fixes
- code refactoring and various cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:21:08 -04:00
Kangjie Lu
5f8e44741f net: fix infoleak in rtnetlink
The stack object “map” has a total size of 32 bytes. Its last 4
bytes are padding generated by compiler. These padding bytes are
not initialized and sent out via “nla_put”.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:19:42 -04:00
Kangjie Lu
b8670c09f3 net: fix infoleak in llc
The stack object “info” has a total size of 12 bytes. Its last byte
is padding which is not initialized and leaked via “put_cmsg”.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:18:48 -04:00
Florian Westphal
9b36627ace net: remove dev->trans_start
previous patches removed all direct accesses to dev->trans_start,
so change the netif_trans_update helper to update trans_start of
netdev queue 0 instead and then remove trans_start from struct net_device.

AFAICS a lot of the netif_trans_update() invocations are now useless
because they occur in ndo_start_xmit and driver doesn't set LLTX
(i.e. stack already took care of the update).

As I can't test any of them it seems better to just leave them alone.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:16:50 -04:00
Florian Westphal
860e9538a9 treewide: replace dev->trans_start update with helper
Replace all trans_start updates with netif_trans_update helper.
change was done via spatch:

struct net_device *d;
@@
- d->trans_start = jiffies
+ netif_trans_update(d)

Compile tested only.

Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: linux-xtensa@linux-xtensa.org
Cc: linux1394-devel@lists.sourceforge.net
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: MPT-FusionLinux.pdl@broadcom.com
Cc: linux-scsi@vger.kernel.org
Cc: linux-can@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-omap@vger.kernel.org
Cc: linux-hams@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: b.a.t.m.a.n@lists.open-mesh.org
Cc: linux-bluetooth@vger.kernel.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:16:49 -04:00
Arnd Bergmann
8bf42e9e51 gre6: add Kconfig dependency for NET_IPGRE_DEMUX
The ipv6 gre implementation was cleaned up to share more code
with the ipv4 version, but it can be enabled even when NET_IPGRE_DEMUX
is disabled, resulting in a link error:

net/built-in.o: In function `gre_rcv':
:(.text+0x17f5d0): undefined reference to `gre_parse_header'
ERROR: "gre_parse_header" [net/ipv6/ip6_gre.ko] undefined!

This adds a Kconfig dependency to prevent that now invalid
configuration.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 308edfdf15 ("gre6: Cleanup GREv6 receive path, call common GRE functions")
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:12:36 -04:00
Jiri Benc
125372faa4 gre: receive also TEB packets for lwtunnels
For ipgre interfaces in collect metadata mode, receive also traffic with
encapsulated Ethernet headers. The lwtunnel users are supposed to sort this
out correctly. This allows to have mixed Ethernet + L3-only traffic on the
same lwtunnel interface. This is the same way as VXLAN-GPE behaves.

To keep backwards compatibility and prevent any surprises, gretap interfaces
have priority in receiving packets with Ethernet headers.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:32 -04:00
Jiri Benc
244a797bdc gre: move iptunnel_pull_header down to ipgre_rcv
This will allow to make the pull dependent on the tunnel type.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:31 -04:00
Jiri Benc
00b2034029 gre: remove superfluous pskb_may_pull
The call to gre_parse_header is either followed by iptunnel_pull_header, or
in the case of ICMP error path, the actual header is not accessed at all.

In the first case, iptunnel_pull_header will call pskb_may_pull anyway and
it's pointless to do it twice. The only difference is what call will fail
with what error code but the net effect is still the same in all call sites.

In the second case, pskb_may_pull is pointless, as skb->data is at the outer
IP header and not at the GRE header.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:11:31 -04:00
Alexander Duyck
b1dc497b28 net: Fix netdev_fix_features so that TSO_MANGLEID is only available with TSO
This change makes it so that we will strip the TSO_MANGLEID bit if TSO is
not present.  This way we will also handle ECN correctly of TSO is not
present.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 13:32:27 -04:00
Alexander Duyck
36c983824b gso: Only allow GSO_PARTIAL if we can checksum the inner protocol
This patch addresses a possible issue that can occur if we get into any odd
corner cases where we support TSO for a given protocol but not the checksum
or scatter-gather offload.  There are few drivers floating around that
setup their tunnels this way and by enforcing the checksum piece we can
avoid mangling any frames.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 13:32:27 -04:00
Alexander Duyck
d7fb5a8049 gso: Do not perform partial GSO if number of partial segments is 1 or less
In the event that the number of partial segments is equal to 1 we don't
really need to perform partial segmentation offload.  As such we should
skip multiplying the MSS and instead just clear the partial_segs value
since it will not provide any gain to advertise the frame as being GSO when
it is a single frame.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 13:32:26 -04:00
Jiri Benc
f132ae7c46 gre: change gre_parse_header to return the header length
It's easier for gre_parse_header to return the header length instead of
filing it into a parameter. That way, the callers that don't care about the
header length can just check whether the returned value is lower than zero.

In gre_err, the tunnel header must not be pulled. See commit b7f8fe251e
("gre: do not pull header in ICMP error processing") for details.

This patch reduces the conflict between the mentioned commit and commit
95f5c64c3c ("gre: Move utility functions to common headers").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 12:44:45 -04:00
Eric Dumazet
d4011239f4 tcp: guarantee forward progress in tcp_sendmsg()
Under high rx pressure, it is possible tcp_sendmsg() never has a
chance to allocate an skb and loop forever as sk_flush_backlog()
would always return true.

Fix this by calling sk_flush_backlog() only if one skb had been
allocated and filled before last backlog check.

Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 12:44:36 -04:00
David S. Miller
cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Christophe Ricard
1c53855f6b nfc: nci: Add nci_nfcc_loopback to the nci core
For test purpose, provide the generic nci loopback function.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-05-04 01:48:16 +02:00
Christophe Ricard
9b8d1a4cf2 nfc: nci: Add an additional parameter to identify a connection id
According to NCI specification, destination type and destination
specific parameters shall uniquely identify a single destination
for the Logical Connection.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-05-04 01:43:21 +02:00
Christophe Ricard
de5ea8517c nfc: nci: Fix nci_core_conn_close
nci_core_conn_close was not retrieving a conn_info using the correct
connection id.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-05-04 01:42:31 +02:00
Christophe Ricard
18836029d8 nfc: nci: Fix nci_core_conn_create to allowing empty destination
NCI_CORE_CONN_CREATE may not have any destination type parameter.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-05-04 01:41:03 +02:00
Nicolas Dichtel
79e8dc8b80 ipv6/ila: fix nlsize calculation for lwtunnel
The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.

Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
CC: Tom Herbert <tom@herbertland.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:21:33 -04:00
Wei Wang
26879da587 ipv6: add new struct ipcm6_cookie
In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:08:14 -04:00
Sowmini Varadhan
bd7c5f983f RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.
An arbitration scheme for duelling SYNs is implemented as part of
commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
involved will arrive at the same arbitration decision. However, this
needs to be synchronized with an outgoing SYN to be generated by
rds_tcp_conn_connect(). This commit achieves the synchronization
through the t_conn_lock mutex in struct rds_tcp_connection.

The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
the t_conn_lock mutex.  A SYN is sent out only if the RDS connection is
not already UP (an UP would indicate that rds_tcp_accept_one() has
completed 3WH, so no SYN needs to be generated).

Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
acquiring the t_conn_lock mutex. The only acceptable states (to
allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:03:44 -04:00
Sowmini Varadhan
eb19284026 RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock
There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").

Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
           rds_tcp_accept_one                  rds_send_xmit

                                             conn is RDS_CONN_UP, so
    					 invoke rds_tcp_xmit

                                             tc = conn->c_transport_data
        rds_tcp_restore_callbacks
            /* reset t_sock */
    					 null ptr deref from tc->t_sock

The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.

Fixes: 241b271952 ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:03:44 -04:00
Eric Dumazet
1d2077ac01 net: add __sock_wfree() helper
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:02:36 -04:00
Alexander Duyck
996e802187 net: Disable segmentation if checksumming is not supported
In the case of the mlx4 and mlx5 driver they do not support IPv6 checksum
offload for tunnels.  With this being the case we should disable GSO in
addition to the checksum offload features when we find that a device cannot
perform a checksum on a given packet type.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:00:54 -04:00
Jon Paul Maloy
10724cc7bb tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.

Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.

A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.

This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.

This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.

In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 15:51:16 -04:00