Commit Graph

61760 Commits

Author SHA1 Message Date
Ursula Braun
e82f2e31f5 net/smc: optimize consumer cursor updates
The SMC protocol requires to send a separate consumer cursor update,
if it cannot be piggybacked to updates of the producer cursor.
Currently the decision to send a separate consumer cursor update
just considers the amount of data already received by the socket
program. It does not consider the amount of data already arrived, but
not yet consumed by the receiver. Basing the decision on the
difference between already confirmed and already arrived data
(instead of difference between already confirmed and already consumed
data), may lead to a somewhat earlier consumer cursor update send in
fast unidirectional traffic scenarios, and thus to better throughput.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Suggested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 20:42:25 +09:00
Ursula Braun
0afff91c6f net/smc: add pnetid support
s390 hardware supports the definition of a so-call Physical NETwork
IDentifier (short PNETID) per network device port. These PNETIDS
can be used to identify network devices that are attached to the same
physical network (broadcast domain).

On s390 try to use the PNETID of the ethernet device port used for
initial connecting, and derive the IB device port used for SMC RDMA
traffic.

On platforms without PNETID support fall back to the existing
solution of a configured pnet table.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 20:42:25 +09:00
Ursula Braun
be6a3f38ff net/smc: determine port attributes independent from pnet table
For SMC it is important to know the current port state of RoCE devices.
Monitoring port states has been triggered, when a RoCE device was added
to the pnet table. To support future alternatives to the pnet table the
monitoring of ports is made independent of the existence of a pnet table.
It starts once the smc_ib_device is established.

Due to this change smc_ib_remember_port_attr() is now a local function
and shuffling its location and the location of its used functions
makes any forward references obsolete.

And the duplicate SMC_MAX_PORTS definition is removed.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 20:42:25 +09:00
Michal Hocko
d14b56f508 net: cleanup gfp mask in alloc_skb_with_frags
alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
which is just a noop and a little bit confusing.

__GFP_NORETRY was added by ed98df3361 ("net: use __GFP_NORETRY for
high order allocations") to prevent from the OOM killer. Yet this was
not enough because fb05e7a89f ("net: don't wait for order-3 page
allocation") didn't want an excessive reclaim for non-costly orders
so it made it completely NOWAIT while it preserved __GFP_NORETRY in
place which is now redundant.

Drop the pointless __GFP_NORETRY because this function is used as
copy&paste source for other places.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 20:18:49 +09:00
Yafang Shao
ea5d0c3249 tcp: add new SNMP counter for drops when try to queue in rcv queue
When sk_rmem_alloc is larger than the receive buffer and we can't
schedule more memory for it, the skb will be dropped.

In above situation, if this skb is put into the ofo queue,
LINUX_MIB_TCPOFODROP is incremented to track it.

While if this skb is put into the receive queue, there's no record.
So a new SNMP counter is introduced to track this behavior.

LINUX_MIB_TCPRCVQDROP:  Number of packets meant to be queued in rcv queue
			but dropped because socket rcvbuf limit hit.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 18:43:53 +09:00
Yuchung Cheng
c860e997e9 tcp: fix Fast Open key endianness
Fast Open key could be stored in different endian based on the CPU.
Previously hosts in different endianness in a server farm using
the same key config (sysctl value) would produce different cookies.
This patch fixes it by always storing it as little endian to keep
same API for LE hosts.

Reported-by: Daniele Iamartino <danielei@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30 18:40:46 +09:00
Simon Horman
0ed5269f9e net/sched: add tunnel option support to act_tunnel_key
Allow setting tunnel options using the act_tunnel_key action.

Options are expressed as class:type:data and multiple options
may be listed using a comma delimiter.

 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent ffff: \
     flower indev eth0 \
        ip_proto udp \
        action tunnel_key \
            set src_ip 10.0.99.192 \
            dst_ip 10.0.99.193 \
            dst_port 6081 \
            id 11 \
            geneve_opts 0102:80:00800022,0102:80:00800022 \
    action mirred egress redirect dev geneve0

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 23:50:26 +09:00
Pieter Jansen van Vuuren
256c87c17c net: check tunnel option type in tunnel flags
Check the tunnel option type stored in tunnel flags when creating options
for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
options on interfaces that are not associated with them.

Make sure all users of the infrastructure set correct flags, for the BPF
helper we have to set all bits to keep backward compatibility.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 23:50:26 +09:00
Simon Horman
9d7298cd1d net/sched: act_tunnel_key: add extended ack support
Add extended ack support for the tunnel key action by using NL_SET_ERR_MSG
during validation of user input.

Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 23:50:26 +09:00
Simon Horman
a1165b5919 net/sched: act_tunnel_key: disambiguate metadata dst error cases
Metadata may be NULL for one of two reasons:
* Missing user input
* Failure to allocate the metadata dst

Disambiguate these case by returning -EINVAL for the former and -ENOMEM
for the latter rather than -EINVAL for both cases.

This is in preparation for using extended ack to provide more information
to users when parsing their input.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 23:50:26 +09:00
Christoph Hellwig
e88958e636 net: handle NULL ->poll gracefully
The big aio poll revert broke various network protocols that don't
implement ->poll as a patch in the aio poll serie removed sock_no_poll
and made the common code handle this case.

Reported-by: syzbot+57727883dbad76db2ef0@syzkaller.appspotmail.com
Reported-by: syzbot+cdb0d3176b53d35ad454@syzkaller.appspotmail.com
Reported-by: syzbot+2c7e8f74f8b2571c87e8@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: a11e1d432b ("Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-29 06:51:51 -07:00
Xin Long
b0e9a2fe3f sctp: add support for SCTP_REUSE_PORT sockopt
This feature is actually already supported by sk->sk_reuse which can be
set by socket level opt SO_REUSEADDR. But it's not working exactly as
RFC6458 demands in section 8.1.27, like:

  - This option only supports one-to-one style SCTP sockets
  - This socket option must not be used after calling bind()
    or sctp_bindx().

Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
work in linux.

To separate it from the socket level version, this patch adds 'reuse' in
sctp_sock and it works pretty much as sk->sk_reuse, but with some extra
setup limitations that are needed when it is being enabled.

"It should be noted that the behavior of the socket-level socket option
to reuse ports and/or addresses for SCTP sockets is unspecified", so it
leaves SO_REUSEADDR as is for the compatibility.

Note that the name SCTP_REUSE_PORT is somewhat confusing, as its
functionality is nearly identical to SO_REUSEADDR, but with some
extra restrictions. Here it uses 'reuse' in sctp_sock instead of
'reuseport'. As for sk->sk_reuseport support for SCTP, it will be
added in another patch.

Thanks to Neil to make this clear.

v1->v2:
  - add sctp_sk->reuse to separate it from the socket level version.
v2->v3:
  - improve changelog according to Marcelo's suggestion.

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 22:20:55 +09:00
David S. Miller
0933cc294f Merge tag 'mac80211-for-davem-2018-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:

====================
Just three fixes:
 * fix HT operation in mesh mode
 * disable preemption in control frame TX
 * check nla_parse_nested() return values
   where missing (two places)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 22:09:26 +09:00
Shakeel Butt
e699e2c6a6 net, mm: account sock objects to kmemcg
Currently the kernel accounts the memory for network traffic through
mem_cgroup_[un]charge_skmem() interface. However the memory accounted
only includes the truesize of sk_buff which does not include the size of
sock objects. In our production environment, with opt-out kmem
accounting, the sock kmem caches (TCP[v6], UDP[v6], RAW[v6], UNIX) are
among the top most charged kmem caches and consume a significant amount
of memory which can not be left as system overhead. So, this patch
converts the kmem caches of all sock objects to SLAB_ACCOUNT.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 21:56:27 +09:00
Omer Efrat
a421775058 mac80211: use BIT_ULL for NL80211_STA_INFO_* attribute types
The BIT macro uses unsigned long which some architectures handle as 32 bit
and therefore might cause macro's shift to overflow when used on a value
equals or larger than 32 (NL80211_STA_INFO_RX_DURATION and afterwards).

Since 'filled' member in station_info changed to u64, BIT_ULL macro
should be used with all NL80211_STA_INFO_* attribute types instead of BIT
to prevent future possible bugs when one will use BIT macro for higher
attributes by mistake.

This commit cleans up all usages of BIT macro with the above field
in mac80211 by changing it to BIT_ULL instead.

Signed-off-by: Omer Efrat <omer.efrat@tandemg.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-29 09:53:09 +02:00
Omer Efrat
397c657a06 cfg80211: use BIT_ULL for NL80211_STA_INFO_* attribute types
The BIT macro uses unsigned long which some architectures handle as 32 bit
and therefore might cause macro's shift to overflow when used on a value
equals or larger than 32 (NL80211_STA_INFO_RX_DURATION and afterwards).

Since 'filled' member in station_info changed to u64, BIT_ULL macro
should be used with all NL80211_STA_INFO_* attribute types instead of BIT
to prevent future possible bugs when one will use BIT macro for higher
attributes by mistake.

This commit cleans up all usages of BIT macro with the above field
in cfg80211 by changing it to BIT_ULL instead. In addition, there are
some places which don't use BIT nor BIT_ULL macros so align those as well.

Signed-off-by: Omer Efrat <omer.efrat@tandemg.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-29 09:52:23 +02:00
Johannes Berg
f0c0407d2a mac80211: remove unnecessary NULL check
We don't need to check if he_oper is NULL before calling
ieee80211_verify_sta_he_mcs_support() as it - now - will
correctly check this itself. Remove the redundant check.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-29 09:51:39 +02:00
Gustavo A. R. Silva
47aa7861b9 mac80211: fix potential null pointer dereference
he_op is being dereferenced before it is null checked, hence there
is a potential null pointer dereference.

Fix this by moving the pointer dereference after he_op has been
properly null checked.

Notice that, currently, he_op is already being null checked before
calling this function at 4593:

4593	if (!he_oper ||
4594	    !ieee80211_verify_sta_he_mcs_support(sband, he_oper))
4595		ifmgd->flags |= IEEE80211_STA_DISABLE_HE;

but in case ieee80211_verify_sta_he_mcs_support is ever called
without verifying he_oper is not null, we will end up having a
null pointer dereference. So, we better don't take any chances.

Addresses-Coverity-ID: 1470068 ("Dereference before null check")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-29 09:50:43 +02:00
Arnd Bergmann
fe0984d389 cfg80211: track time using boottime
The cfg80211 layer uses get_seconds() to read the current time
in its supend handling. This function is deprecated because of the 32-bit
time_t overflow, and it can cause unexpected behavior when the time
changes due to settimeofday() calls or leap second updates.

In many cases, we want to use monotonic time instead, however cfg80211
explicitly tracks the time spent in suspend, so this changes the
driver over to use ktime_get_boottime_seconds(), which is slightly
slower, but not used in a fastpath here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-29 09:49:28 +02:00
Johannes Berg
95bca62fb7 nl80211: check nla_parse_nested() return values
At the very least we should check the return value if
nla_parse_nested() is called with a non-NULL policy.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-29 09:44:51 +02:00
Bob Copeland
188f60ab8e nl80211: relax ht operation checks for mesh
Commit 9757235f45, "nl80211: correct checks for
NL80211_MESHCONF_HT_OPMODE value") relaxed the range for the HT
operation field in meshconf, while also adding checks requiring
the non-greenfield and non-ht-sta bits to be set in certain
circumstances.  The latter bit is actually reserved for mesh BSSes
according to Table 9-168 in 802.11-2016, so in fact it should not
be set.

wpa_supplicant sets these bits because the mesh and AP code share
the same implementation, but authsae does not.  As a result, some
meshconf updates from authsae which set only the NONHT_MIXED
protection bits were being rejected.

In order to avoid breaking userspace by changing the rules again,
simply accept the values with or without the bits set, and mask
off the reserved bit to match the spec.

While in here, update the 802.11-2012 reference to 802.11-2016.

Fixes: 9757235f45 ("nl80211: correct checks for NL80211_MESHCONF_HT_OPMODE value")
Cc: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Bob Copeland <bobcopeland@fb.com>
Reviewed-by: Masashi Honma <masashi.honma@gmail.com>
Reviewed-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-29 09:39:30 +02:00
Denis Kenzior
e7441c9274 mac80211: disable BHs/preemption in ieee80211_tx_control_port()
On pre-emption enabled kernels the following print was being seen due to
missing local_bh_disable/local_bh_enable calls.  mac80211 assumes that
pre-emption is disabled in the data path.

    BUG: using smp_processor_id() in preemptible [00000000] code: iwd/517
    caller is __ieee80211_subif_start_xmit+0x144/0x210 [mac80211]
    [...]
    Call Trace:
    dump_stack+0x5c/0x80
    check_preemption_disabled.cold.0+0x46/0x51
    __ieee80211_subif_start_xmit+0x144/0x210 [mac80211]

Fixes: 9118064914 ("mac80211: Add support for tx_control_port")
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
[commit message rewrite, fixes tag]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-06-29 09:39:08 +02:00
Tom Herbert
b6e71bdebb ila: Flush netlink command to clear xlat table
Add ILA_CMD_FLUSH netlink command to clear the ILA translation table.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 11:32:55 +09:00
Tom Herbert
ad68147ef2 ila: Create main ila source file
Create a main ila file that contains the module initialization functions
as well as netlink definitions. Previously these were defined in
ila_xlat and ila_common. This approach allows better extensibility.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 11:32:55 +09:00
Tom Herbert
b893281715 ila: Call library function alloc_bucket_locks
To allocate the array of bucket locks for the hash table we now
call library function alloc_bucket_spinlocks.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 11:32:55 +09:00
Tom Herbert
f7a2ba5ab9 ila: Fix use of rhashtable walk in ila_xlat.c
Perform better EAGAIN handling, handle case where ila_dump_info
fails and we missed objects in the dump, and add a skip index
to skip over ila entires in a list on a rhashtable node that have
already been visited (by a previous call to ila_nl_dump).

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-29 11:32:55 +09:00
David Ahern
4c79579b44 bpf: Change bpf_fib_lookup to return lookup status
For ACLs implemented using either FIB rules or FIB entries, the BPF
program needs the FIB lookup status to be able to drop the packet.
Since the bpf_fib_lookup API has not reached a released kernel yet,
change the return code to contain an encoding of the FIB lookup
result and return the nexthop device index in the params struct.

In addition, inform the BPF program of any post FIB lookup reason as
to why the packet needs to go up the stack.

The fib result for unicast routes must have an egress device, so remove
the check that it is non-NULL.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-29 00:02:02 +02:00
Linus Torvalds
a11e1d432b Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL
The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 10:40:47 -07:00
Flavio Leitner
9c4c325252 skbuff: preserve sock reference when scrubbing the skb.
The sock reference is lost when scrubbing the packet and that breaks
TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
performance impacts of about 50% in a single TCP stream when crossing
network namespaces.

XPS breaks because the queue mapping stored in the socket is not
available, so another random queue might be selected when the stack
needs to transmit something like a TCP ACK, or TCP Retransmissions.
That causes packet re-ordering and/or performance issues.

TSQ breaks because it orphans the packet while it is still in the
host, so packets are queued contributing to the buffer bloat problem.

Preserving the sock reference fixes both issues. The socket is
orphaned anyways in the receiving path before any relevant action
and on TX side the netfilter checks if the reference is local before
use it.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:21:32 +09:00
Flavio Leitner
f564650106 netfilter: check if the socket netns is correct.
Netfilter assumes that if the socket is present in the skb, then
it can be used because that reference is cleaned up while the skb
is crossing netns.

We want to change that to preserve the socket reference in a future
patch, so this is a preparation updating netfilter to check if the
socket netns matches before use it.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:21:32 +09:00
Roman Mashak
4305274153 net sched actions: avoid bitwise operation on signed value in pedit
Since char can be unsigned or signed, and bitwise operators may have
implementation-dependent results when performed on signed operands,
declare 'u8 *' operand instead.

Suggested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Roman Mashak
95b0d2dc13 net sched actions: fix misleading text strings in pedit action
Change "tc filter pedit .." to "tc actions pedit .." in error
messages to clearly refer to pedit action.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Roman Mashak
6ff7586e38 net sched actions: use sizeof operator for buffer length
Replace constant integer with sizeof() to clearly indicate
the destination buffer length in skb_header_pointer() calls.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Roman Mashak
544377cd25 net sched actions: fix sparse warning
The variable _data in include/asm-generic/sections.h defines sections,
this causes sparse warning in pedit:

net/sched/act_pedit.c:293:35: warning: symbol '_data' shadows an earlier one
./include/asm-generic/sections.h:36:13: originally declared here

Therefore rename the variable.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Roman Mashak
80f0f574cc net sched actions: fix coding style in pedit action
Fix coding style issues in tc pedit action detected by the
checkpatch script.

Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:12:03 +09:00
Yousuk Seung
0a9fe5c375 netem: slotting with non-uniform distribution
Extend slotting with support for non-uniform distributions. This is
similar to netem's non-uniform distribution delay feature.

Commit f043efeae2f1 ("netem: support delivering packets in delayed
time slots") added the slotting feature to approximate the behaviors
of media with packet aggregation but only supported a uniform
distribution for delays between transmission attempts. Tests with TCP
BBR with emulated wifi links with non-uniform distributions produced
more useful results.

Syntax:
   slot dist DISTRIBUTION DELAY JITTER [packets MAX_PACKETS] \
      [bytes MAX_BYTES]

The syntax and use of the distribution table is the same as in the
non-uniform distribution delay feature. A file DISTRIBUTION must be
present in TC_LIB_DIR (e.g. /usr/lib/tc) containing numbers scaled by
NETEM_DIST_SCALE. A random value x is selected from the table and it
takes DELAY + ( x * JITTER ) as delay. Correlation between values is not
supported.

Examples:
  Normal distribution delay with mean = 800us and stdev = 100us.
  > tc qdisc add dev eth0 root netem slot dist normal 800us 100us

  Optionally set the max slot size in bytes and/or packets.
  > tc qdisc add dev eth0 root netem slot dist normal 800us 100us \
    bytes 64k packets 42

Signed-off-by: Yousuk Seung <ysseung@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:06:24 +09:00
Ursula Braun
24ac3a08e6 net/smc: rebuild nonblocking connect
The recent poll change may lead to stalls for non-blocking connecting
SMC sockets, since sock_poll_wait is no longer performed on the
internal CLC socket, but on the outer SMC socket.  kernel_connect() on
the internal CLC socket returns with -EINPROGRESS, but the wake up
logic does not work in all cases. If the internal CLC socket is still
in state TCP_SYN_SENT when polled, sock_poll_wait() from sock_poll()
does not sleep. It is supposed to sleep till the state of the internal
CLC socket switches to TCP_ESTABLISHED.

This problem triggered a redesign of the SMC nonblocking connect logic.
This patch introduces a connect worker covering all connect steps
followed by a wake up of socket waiters. It allows to get rid of all
delays and locks in smc_poll().

Fixes: c0129a0614 ("smc: convert to ->poll_mask")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:03:55 +09:00
Eric Dumazet
15ecbe94a4 tcp: add one more quick ack after after ECN events
Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
about our recent patch removing ~16 quick acks after ECN events.

tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
but in the case the sender cwnd was lowered to 1, we do not want
to have a delayed ack for the next packet we will receive.

Fixes: 522040ea5f ("tcp: do not aggressively quick ack after ECN events")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 22:01:04 +09:00
Masahiro Yamada
8e75887d32 bpfilter: include bpfilter_umh in assembly instead of using objcopy
What we want here is to embed a user-space program into the kernel.
Instead of the complex ELF magic, let's simply wrap it in the assembly
with the '.incbin' directive.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 21:39:16 +09:00
Doron Roberts-Kedes
977c7114eb strparser: Remove early eaten to fix full tcp receive buffer stall
On receving an incomplete message, the existing code stores the
remaining length of the cloned skb in the early_eaten field instead of
incrementing the value returned by __strp_recv. This defers invocation
of sock_rfree for the current skb until the next invocation of
__strp_recv, which returns early_eaten if early_eaten is non-zero.

This behavior causes a stall when the current message occupies the very
tail end of a massive skb, and strp_peek/need_bytes indicates that the
remainder of the current message has yet to arrive on the socket. The
TCP receive buffer is totally full, causing the TCP window to go to
zero, so the remainder of the message will never arrive.

Incrementing the value returned by __strp_recv by the amount otherwise
stored in early_eaten prevents stalls of this nature.

Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 21:37:26 +09:00
Guillaume Nault
a408194aa0 l2tp: define helper for parsing struct sockaddr_pppol2tp*
'sockaddr_len' is checked against various values when entering
pppol2tp_connect(), to verify its validity. It is used again later, to
find out which sockaddr structure was passed from user space. This
patch combines these two operations into one new function in order to
simplify pppol2tp_connect().

A new structure, l2tp_connect_info, is used to pass sockaddr data back
to pppol2tp_connect(), to avoid passing too many parameters to
l2tp_sockaddr_get_info(). Also, the first parameter is void* in order
to avoid casting between all sockaddr_* structures manually.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 16:06:50 +09:00
Eric Dumazet
242b1bbe51 tcp: remove one indentation level in tcp_create_openreq_child
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 16:02:31 +09:00
Masahiro Yamada
88e85a7daf bpfilter: check compiler capability in Kconfig
With the brand-new syntax extension of Kconfig, we can directly
check the compiler capability in the configuration phase.

If the cc-can-link.sh fails, the BPFILTER_UMH is automatically
hidden by the dependency.

I also deleted 'default n', which is no-op.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 13:36:39 +09:00
David S. Miller
0901441839 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree:

1) Missing netlink attribute validation in nf_queue, uncovered by KASAN,
   from Eric Dumazet.

2) Use pointer to sysctl table, save us 192 bytes of memory per netns.
   Also from Eric.

3) Possible use-after-free when removing conntrack helper modules due
   to missing synchronize RCU call. From Taehee Yoo.

4) Fix corner case in systcl writes to nf_log that lead to appending
   data to uninitialized buffer, from Jann Horn.

5) Jann Horn says we may indefinitely block other users of nf_log_mutex
   if a userspace access in proc_dostring() blocked e.g. due to a
   userfaultfd.

6) Fix garbage collection race for unconfirmed conntrack entries,
   from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 13:32:44 +09:00
Zhen Lei
7284fdf39a esp6: fix memleak on error path in esp6_input
This ought to be an omission in e619492323 ("esp: Fix memleaks on error
paths."). The memleak on error path in esp6_input is similar to esp_input
of esp4.

Fixes: e619492323 ("esp: Fix memleaks on error paths.")
Fixes: 3f29770723 ("ipsec: check return value of skb_to_sgvec always")
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-06-27 17:32:11 +02:00
Roopa Prabhu
8e326289e3 neighbour: force neigh_invalidate when NUD_FAILED update is from admin
In systems where neigh gc thresh holds are set to high values,
admin deleted neigh entries (eg ip neigh flush or ip neigh del) can
linger around in NUD_FAILED state for a long time until periodic gc kicks
in. This patch forces neigh_invalidate when NUD_FAILED neigh_update is
from an admin.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-27 15:40:45 +09:00
Kees Cook
3463e51dc3 net/tls: Remove VLA usage on nonce
It looks like the prior VLA removal, commit b16520f749 ("net/tls: Remove
VLA usage"), and a new VLA addition, commit c46234ebb4 ("tls: RX path
for ktls"), passed in the night. This removes the newly added VLA, which
happens to have its bounds based on the same max value.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-27 10:39:52 +09:00
Jason A. Donenfeld
7c8f4e6dc3 fib_rules: match rules based on suppress_* properties too
Two rules with different values of suppress_prefix or suppress_ifgroup
are not the same. This fixes an -EEXIST when running:

   $ ip -4 rule add table main suppress_prefixlength 0

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Fixes: f9d4b0c1e9 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-27 10:33:05 +09:00
Sowmini Varadhan
c809195f55 rds: clean up loopback rds_connections on netns deletion
The RDS core module creates rds_connections based on callbacks
from rds_loop_transport when sending/receiving packets to local
addresses.

These connections will need to be cleaned up when they are
created from a netns that is not init_net, and that netns is deleted.

Add the changes aligned with the changes from
commit ebeeb1ad9b ("rds: tcp: use rds_destroy_pending() to synchronize
netns/module teardown and rds connection/workq management") for
rds_loop_transport

Reported-and-tested-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-27 10:11:03 +09:00
Florian Westphal
b36e4523d4 netfilter: nf_conncount: fix garbage collection confirm race
Yi-Hung Wei and Justin Pettit found a race in the garbage collection scheme
used by nf_conncount.

When doing list walk, we lookup the tuple in the conntrack table.
If the lookup fails we remove this tuple from our list because
the conntrack entry is gone.

This is the common cause, but turns out its not the only one.
The list entry could have been created just before by another cpu, i.e. the
conntrack entry might not yet have been inserted into the global hash.

The avoid this, we introduce a timestamp and the owning cpu.
If the entry appears to be stale, evict only if:
 1. The current cpu is the one that added the entry, or,
 2. The timestamp is older than two jiffies

The second constraint allows GC to be taken over by other
cpu too (e.g. because a cpu was offlined or napi got moved to another
cpu).

We can't pretend the 'doubtful' entry wasn't in our list.
Instead, when we don't find an entry indicate via IS_ERR
that entry was removed ('did not exist' or withheld
('might-be-unconfirmed').

This most likely also fixes a xt_connlimit imbalance earlier reported by
Dmitry Andrianov.

Cc: Dmitry Andrianov <dmitry.andrianov@alertme.com>
Reported-by: Justin Pettit <jpettit@vmware.com>
Reported-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-26 18:28:57 +02:00