Commit Graph

49182 Commits

Author SHA1 Message Date
Yafang Shao
986ffdfd08 net: sock: replace sk_state_load with inet_sk_state_load and remove sk_state_store
sk_state_load is only used by AF_INET/AF_INET6, so rename it to
inet_sk_state_load and move it into inet_sock.h.

sk_state_store is removed as it is not used any more.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 14:00:25 -05:00
Yafang Shao
563e0bb0dc net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state tracepoint
As sk_state is a common field for struct sock, so the state
transition tracepoint should not be a TCP specific feature.
Currently it traces all AF_INET state transition, so I rename this
tracepoint to inet_sock_set_state tracepoint with some minor changes and move it
into trace/events/sock.h.
We dont need to create a file named trace/events/inet_sock.h for this one single
tracepoint.

Two helpers are introduced to trace sk_state transition
    - void inet_sk_state_store(struct sock *sk, int newstate);
    - void inet_sk_set_state(struct sock *sk, int state);
As trace header should not be included in other header files,
so they are defined in sock.c.

The protocol such as SCTP maybe compiled as a ko, hence export
inet_sk_set_state().

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 14:00:25 -05:00
Haishuang Yan
afb4c97d90 ip6_gre: fix potential memory leak in ip6erspan_rcv
If md is NULL, tun_dst must be freed, otherwise it will cause memory
leak.

Fixes: ef7baf5e08 ("ip6_gre: add ip6 erspan collect_md mode")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 13:56:39 -05:00
Haishuang Yan
50670b6ee9 ip_gre: fix potential memory leak in erspan_rcv
If md is NULL, tun_dst must be freed, otherwise it will cause memory
leak.

Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 13:56:39 -05:00
Haishuang Yan
a7343211f0 ip6_gre: fix error path when ip6erspan_rcv failed
Same as ipv4 code, when ip6erspan_rcv call return PACKET_REJECT, we
should call icmpv6_send to send icmp unreachable message in error path.

Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Acked-by: William Tu <u9012063@gmail.com>
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 13:51:46 -05:00
Haishuang Yan
dd8d5b8c5b ip_gre: fix error path when erspan_rcv failed
When erspan_rcv call return PACKET_REJECT, we shoudn't call ipgre_rcv to
process packets again, instead send icmp unreachable message in error
path.

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Acked-by: William Tu <u9012063@gmail.com>
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 13:51:46 -05:00
Haishuang Yan
293a1991cf ip6_gre: fix a pontential issue in ip6erspan_rcv
pskb_may_pull() can change skb->data, so we need to load ipv6h/ershdr at
the right place.

Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Cc: William Tu <u9012063@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 13:48:39 -05:00
Andy Shevchenko
223b229b63 bridge: Use helpers to handle MAC address
Use
	%pM to print MAC
	mac_pton() to convert it from ASCII to binary format, and
	ether_addr_copy() to copy.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 12:46:11 -05:00
Alexey Kodanev
53c81e95df ip6_vti: adjust vti mtu according to mtu of lower device
LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams over
ip6_vti that require fragmentation and the underlying device has an
MTU smaller than 1500 plus some extra space for headers. This happens
because ip6_vti, by default, sets MTU to ETH_DATA_LEN and not updating
it depending on a destination address or link parameter. Further
attempts to send UDP packets may succeed because pmtu gets updated on
ICMPV6_PKT_TOOBIG in vti6_err().

In case the lower device has larger MTU size, e.g. 9000, ip6_vti works
but not using the possible maximum size, output packets have 1500 limit.

The above cases require manual MTU setup after ip6_vti creation. However
ip_vti already updates MTU based on lower device with ip_tunnel_bind_dev().

Here is the example when the lower device MTU is set to 9000:

  # ip a sh ltp_ns_veth2
      ltp_ns_veth2@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ...
        inet 10.0.0.2/24 scope global ltp_ns_veth2
        inet6 fd00::2/64 scope global

  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
      vti6@NONE: <POINTOPOINT,NOARP> mtu 1500 ...
        link/tunnel6 fd00::2 peer fd00::1

After the patch:
  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
      vti6@NONE: <POINTOPOINT,NOARP> mtu 8832 ...
        link/tunnel6 fd00::2 peer fd00::1

Reported-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 11:52:32 -05:00
Cong Wang
1df94c3c5d net_sched: properly check for empty skb array on error path
First, the check of &q->ring.queue against NULL is wrong, it
is always false. We should check the value rather than the address.

Secondly, we need the same check in pfifo_fast_reset() too,
as both ->reset() and ->destroy() are called in qdisc_destroy().

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 14:13:12 -05:00
David S. Miller
748a709974 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2017-12-18

Here's the first bluetooth-next pull request for the 4.16 kernel.

 - hci_ll: multiple cleanups & fixes
 - Remove Gustavo Padovan from the MAINTAINERS file
 - Support BLE Adversing while connected (if the controller can do it)
 - DT updates for TI chips
 - Various other smaller cleanups & fixes

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 13:53:39 -05:00
Michael Chan
56f5aa77cd net: Disable GRO_HW when generic XDP is installed on a device.
Hardware should not aggregate any packets when generic XDP is installed.

Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 10:38:36 -05:00
Michael Chan
fb1f5f79ae net: Introduce NETIF_F_GRO_HW.
Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware
GRO.  With this flag, we can now independently turn on or off hardware
GRO when GRO is on.  Previously, drivers were using NETIF_F_GRO to
control hardware GRO and so it cannot be independently turned on or
off without affecting GRO.

Hardware GRO (just like GRO) guarantees that packets can be re-segmented
by TSO/GSO to reconstruct the original packet stream.  Logically,
GRO_HW should depend on GRO since it a subset, but we will let
individual drivers enforce this dependency as they see fit.

Since NETIF_F_GRO is not propagated between upper and lower devices,
NETIF_F_GRO_HW should follow suit since it is a subset of GRO.  In other
words, a lower device can independent have GRO/GRO_HW enabled or disabled
and no feature propagation is required.  This will preserve the current
GRO behavior.  This can be changed later if we decide to propagate GRO/
GRO_HW/RXCSUM from upper to lower devices.

Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 10:38:36 -05:00
Tonghao Zhang
648845ab7e sock: Move the socket inuse to namespace.
In some case, we want to know how many sockets are in use in
different _net_ namespaces. It's a key resource metric.

This patch add a member in struct netns_core. This is a counter
for socket-inuse in the _net_ namespace. The patch will add/sub
counter in the sk_alloc, sk_clone_lock and __sk_free.

This patch will not counter the socket created in kernel.
It's not very useful for userspace to know how many kernel
sockets we created.

The main reasons for doing this are that:

1. When linux calls the 'do_exit' for process to exit, the functions
'exit_task_namespaces' and 'exit_task_work' will be called sequentially.
'exit_task_namespaces' may have destroyed the _net_ namespace, but
'sock_release' called in 'exit_task_work' may use the _net_ namespace
if we counter the socket-inuse in sock_release.

2. socket and sock are in pair. More important, sock holds the _net_
namespace. We counter the socket-inuse in sock, for avoiding holding
_net_ namespace again in socket. It's a easy way to maintain the code.

Signed-off-by: Martin Zhang <zhangjunweimartin@didichuxing.com>
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 09:58:14 -05:00
Tonghao Zhang
08fc7f8140 sock: Change the netns_core member name.
Change the member name will make the code more readable.
This patch will be used in next patch.

Signed-off-by: Martin Zhang <zhangjunweimartin@didichuxing.com>
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 09:58:14 -05:00
William Tu
d91e8db5b6 net: erspan: reload pointer after pskb_may_pull
pskb_may_pull() can change skb->data, so we need to re-load pkt_md
and ershdr at the right place.

Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Fixes: f551c91de2 ("net: erspan: introduce erspan v2 for ip_gre")
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 15:11:25 -05:00
William Tu
ae3e13373b net: erspan: fix wrong return value
If pskb_may_pull return failed, return PACKET_REJECT
instead of -ENOMEM.

Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Fixes: f551c91de2 ("net: erspan: introduce erspan v2 for ip_gre")
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 15:11:25 -05:00
Samuel Mendoza-Jonas
75e8e15635 net/ncsi: Don't take any action on HNCDSC AEN
The current HNCDSC handler takes the status flag from the AEN packet and
will update or change the current channel based on this flag and the
current channel status.

However the flag from the HNCDSC packet merely represents the host link
state. While the state of the host interface is potentially interesting
information it should not affect the state of the NCSI link. Indeed the
NCSI specification makes no mention of any recommended action related to
the host network controller driver state.

Update the HNCDSC handler to record the host network driver status but
take no other action.

Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 14:50:11 -05:00
David S. Miller
c30abd5e40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-16 22:11:55 -05:00
Linus Torvalds
d025fbf1a2 NFS client fixes for Linux 4.15-rc4
Stable bugfixes:
 - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
        commit in the case that there were no commit requests.
 - SUNRPC: Fix a race in the receive code path
 
 Other fixes:
 - NFS: Fix a deadlock in nfs client initialization
 - xprtrdma: Fix a performance regression for small IOs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlo0PdMACgkQ18tUv7Cl
 QOvlUg/+KoXWXNwItHIyyegYgRXcAPpaCtdnCjjOP6R9HEJ+clnLcaqDxdDKVWQ/
 oDvEcQcsBpywbUi7vVrvdar4mofwuyjXPpbcZPlDP1Ru4yyAlyylftwIuQW/nzdd
 vX2tZaVf+B9y1XvSD5NI+2EKWmp7MVrPdNhYxAB39TQZnAAvYDFHhywtZ0UR7vJt
 7YVcZoPtKUhg15jhCOr73eaCT0884/tlgedfd6DkDGR6bCtSQC2PySfqq9Lnnl/1
 ruDzzcgTARzSEzvta/uyBRspOLBHeeBhTdQUp79lMfekC4+68Tx6DFWnydIUttuE
 G7LphN6hfbJLF20U/ENb2H8v10WZsKvGEuxM+fp5PXGcIMSlX4qoJUe/egJFiiSL
 IaikgibvfiKmYSJvwdxTlOcr793X2Ej19HNciNjJQp4pviDOdZixgtGvVVHJBmh6
 LYzE5q9jgbW9wQXwTTeWHp/nyqL80NslX0UARYnS2Ua0B96GRCESXqCUFtxK6tKR
 wbYiHzKc4dOfSxpNlKI+FlX63m5oSAmTEii3ODsWZjObbwYHNX2Zqj2cVFiSLCpv
 ZXgmpNL+tL2zBWxPvn6rzYhpaXo++PqlHK7vv2QVBI6XM2J8ztpj5Wr5zneRoJaE
 ejk8nw/mR43bfdQuUGZRKh/Z+FTqL0/2WbDgJMXl09c+zRz7J2c=
 =XhEC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "This has two stable bugfixes, one to fix a BUG_ON() when
  nfs_commit_inode() is called with no outstanding commit requests and
  another to fix a race in the SUNRPC receive codepath.

  Additionally, there are also fixes for an NFS client deadlock and an
  xprtrdma performance regression.

  Summary:

  Stable bugfixes:
   - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
     commit in the case that there were no commit requests.
   - SUNRPC: Fix a race in the receive code path

  Other fixes:
   - NFS: Fix a deadlock in nfs client initialization
   - xprtrdma: Fix a performance regression for small IOs"

* tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Fix a race in the receive code path
  nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
  xprtrdma: Spread reply processing over more CPUs
  nfs: fix a deadlock in nfs client initialization
2017-12-16 13:12:53 -08:00
Linus Torvalds
7a3c296ae0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot.

 2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik
    Brueckner.

 3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes
    Berg.

 4) Add missing barriers to ptr_ring, from Michael S. Tsirkin.

 5) Don't advertise gigabit in sh_eth when not available, from Thomas
    Petazzoni.

 6) Check network namespace when delivering to netlink taps, from Kevin
    Cernekee.

 7) Kill a race in raw_sendmsg(), from Mohamed Ghannam.

 8) Use correct address in TCP md5 lookups when replying to an incoming
    segment, from Christoph Paasch.

 9) Add schedule points to BPF map alloc/free, from Eric Dumazet.

10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also
    from Eric Dumazet.

11) Fix SKB leak in tipc, from Jon Maloy.

12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz.

13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn.

14) Add some new qmi_wwan device IDs, from Daniele Palmas.

15) Fix static key imbalance in ingress qdisc, from Jiri Pirko.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
  net: qcom/emac: Reduce timeout for mdio read/write
  net: sched: fix static key imbalance in case of ingress/clsact_init error
  net: sched: fix clsact init error path
  ip_gre: fix wrong return value of erspan_rcv
  net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
  pkt_sched: Remove TC_RED_OFFLOADED from uapi
  net: sched: Move to new offload indication in RED
  net: sched: Add TCA_HW_OFFLOAD
  net: aquantia: Increment driver version
  net: aquantia: Fix typo in ethtool statistics names
  net: aquantia: Update hw counters on hw init
  net: aquantia: Improve link state and statistics check interval callback
  net: aquantia: Fill in multicast counter in ndev stats from hardware
  net: aquantia: Fill ndev stat couters from hardware
  net: aquantia: Extend stat counters to 64bit values
  net: aquantia: Fix hardware DMA stream overload on large MRRS
  net: aquantia: Fix actual speed capabilities reporting
  sock: free skb in skb_complete_tx_timestamp on error
  s390/qeth: update takeover IPs after configuration change
  s390/qeth: lock IP table while applying takeover changes
  ...
2017-12-15 13:08:37 -08:00
Jiri Pirko
b59e6979a8 net: sched: fix static key imbalance in case of ingress/clsact_init error
Move static key increments to the beginning of the init function
so they pair 1:1 with decrements in ingress/clsact_destroy,
which is called in case ingress/clsact_init fails.

Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 15:43:12 -05:00
Jiri Pirko
343723dd51 net: sched: fix clsact init error path
Since in qdisc_create, the destroy op is called when init fails, we
don't do cleanup in init and leave it up to destroy.
This fixes use-after-free when trying to put already freed block.

Fixes: 6e40cf2d4d ("net: sched: use extended variants of block_get/put in ingress and clsact qdiscs")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 15:43:12 -05:00
Trond Myklebust
90d91b0cd3 SUNRPC: Fix a race in the receive code path
We must ensure that the call to rpc_sleep_on() in xprt_transmit() cannot
race with the call to xprt_complete_rqst().

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317
Fixes: ce7c252a8c ("SUNRPC: Add a separate spinlock to protect..")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:56 -05:00
Chuck Lever
ccede75985 xprtrdma: Spread reply processing over more CPUs
Commit d8f532d20e ("xprtrdma: Invoke rpcrdma_reply_handler
directly from RECV completion") introduced a performance regression
for NFS I/O small enough to not need memory registration. In multi-
threaded benchmarks that generate primarily small I/O requests,
IOPS throughput is reduced by nearly a third. This patch restores
the previous level of throughput.

Because workqueues are typically BOUND (in particular ib_comp_wq,
nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
to aggregate on the CPU that is handling Receive completions.

The usual approach to addressing this problem is to create a QP
and CQ for each CPU, and then schedule transactions on the QP
for the CPU where you want the transaction to complete. The
transaction then does not require an extra context switch during
completion to end up on the same CPU where the transaction was
started.

This approach doesn't work for the Linux NFS/RDMA client because
currently the Linux NFS client does not support multiple connections
per client-server pair, and the RDMA core API does not make it
straightforward for ULPs to determine which CPU is responsible for
handling Receive completions for a CQ.

So for the moment, record the CPU number in the rpcrdma_req before
the transport sends each RPC Call. Then during Receive completion,
queue the RPC completion on that same CPU.

Additionally, move all RPC completion processing to the deferred
handler so that even RPCs with simple small replies complete on
the CPU that sent the corresponding RPC Call.

Fixes: d8f532d20e ("xprtrdma: Invoke rpcrdma_reply_handler ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:50 -05:00
Haishuang Yan
c05fad5713 ip_gre: fix wrong return value of erspan_rcv
If pskb_may_pull return failed, return PACKET_REJECT instead of -ENOMEM.

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 14:10:39 -05:00
Xin Long
463118c34a sctp: support sysctl to allow users to use stream interleave
This is the last patch for support of stream interleave, after this patch,
users could enable stream interleave by systcl -w net.sctp.intl_enable=1.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:52:22 -05:00
Xin Long
107e242569 sctp: update mid instead of ssn when doing stream and asoc reset
When using idata and doing stream and asoc reset, setting ssn with
0 could only clear the 1st 16 bits of mid.

So to make this work for both data and idata, it sets mid with 0
instead of ssn, and also mid_uo for unordered idata also need to
be cleared, as said in section 2.3.2 of RFC8260.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:52:22 -05:00
Xin Long
ef4775e340 sctp: add stream interleave support in stream scheduler
As Marcelo said in the stream scheduler patch:

  Support for I-DATA chunks, also described in RFC8260, with user message
  interleaving is straightforward as it just requires the schedulers to
  probe for the feature and ignore datamsg boundaries when dequeueing.

All needs to do is just to ignore datamsg boundaries when dequeueing.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:52:22 -05:00
Xin Long
de60fe9105 sctp: implement handle_ftsn for sctp_stream_interleave
handle_ftsn is added as a member of sctp_stream_interleave, used to skip
ssn for data or mid for idata, called for SCTP_CMD_PROCESS_FWDTSN cmd.

sctp_handle_iftsn works for ifwdtsn, and sctp_handle_fwdtsn works for
fwdtsn. Note that different from sctp_handle_fwdtsn, sctp_handle_iftsn
could do stream abort pd.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:52:22 -05:00
Xin Long
47b20a8856 sctp: implement report_ftsn for sctp_stream_interleave
report_ftsn is added as a member of sctp_stream_interleave, used to
skip tsn from tsnmap, remove old events from reasm or lobby queue,
and abort pd for data or idata, called for SCTP_CMD_REPORT_FWDTSN
cmd and asoc reset.

sctp_report_iftsn works for ifwdtsn, and sctp_report_fwdtsn works
for fwdtsn. Note that sctp_report_iftsn doesn't do asoc abort_pd,
as stream abort_pd will be done when handling ifwdtsn. But when
ftsn is equal with ftsn, which means asoc reset, asoc abort_pd has
to be done.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:52:22 -05:00
Xin Long
0fc2ea922c sctp: implement validate_ftsn for sctp_stream_interleave
validate_ftsn is added as a member of sctp_stream_interleave, used to
validate ssn/chunk type for fwdtsn or mid (message id)/chunk type for
ifwdtsn, called in sctp_sf_eat_fwd_tsn, just as validate_data.

If this check fails, an abort packet will be sent, as said in section
2.3.1 of RFC8260.

As ifwdtsn and fwdtsn chunks have different length, it also defines
ftsn_chunk_len for sctp_stream_interleave to describe the chunk size.
Then it replaces all sizeof(struct sctp_fwdtsn_chunk) with
sctp_ftsnchk_len.

It also adds the process for ifwdtsn in rx path. As Marcelo pointed
out, there's no need to add event table for ifwdtsn, but just share
prsctp_chunk_event_table with fwdtsn's. It would drop fwdtsn chunk
for ifwdtsn and drop ifwdtsn chunk for fwdtsn by calling validate_ftsn
in sctp_sf_eat_fwd_tsn.

After this patch, the ifwdtsn can be accepted.

Note that this patch also removes the sctp.intl_enable check for
idata chunks in sctp_chunk_event_lookup, as it will do this check
in validate_data later.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:52:22 -05:00
Xin Long
8e0c3b73ce sctp: implement generate_ftsn for sctp_stream_interleave
generate_ftsn is added as a member of sctp_stream_interleave, used to
create fwdtsn or ifwdtsn chunk according to abandoned chunks, called
in sctp_retransmit and sctp_outq_sack.

sctp_generate_iftsn works for ifwdtsn, and sctp_generate_fwdtsn is
still used for making fwdtsn.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:52:21 -05:00
Xin Long
2d07a49ade sctp: add basic structures and make chunk function for ifwdtsn
sctp_ifwdtsn_skip, sctp_ifwdtsn_hdr and sctp_ifwdtsn_chunk are used to
define and parse I-FWD TSN chunk format, and sctp_make_ifwdtsn is a
function to build the chunk.

The I-FORWARD-TSN Chunk Format is defined in section 2.3.1 of RFC8260.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:52:21 -05:00
Yuval Mintz
428a68af3a net: sched: Move to new offload indication in RED
Let RED utilize the new internal flag, TCQ_F_OFFLOADED,
to mark a given qdisc as offloaded instead of using a dedicated
indication.

Also, change internal logic into looking at said flag when possible.

Fixes: 602f3baf22 ("net_sch: red: Add offload ability to RED qdisc")
Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:35:36 -05:00
Yuval Mintz
7a4fa29106 net: sched: Add TCA_HW_OFFLOAD
Qdiscs can be offloaded to HW, but current implementation isn't uniform.
Instead, qdiscs either pass information about offload status via their
TCA_OPTIONS or omit it altogether.

Introduce a new attribute - TCA_HW_OFFLOAD that would form a uniform
uAPI for the offloading status of qdiscs.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 13:35:36 -05:00
William Tu
94d7d8f292 ip6_gre: add erspan v2 support
Similar to support for ipv4 erspan, this patch adds
erspan v2 to ip6erspan tunnel.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 12:34:00 -05:00
William Tu
f551c91de2 net: erspan: introduce erspan v2 for ip_gre
The patch adds support for erspan version 2.  Not all features are
supported in this patch.  The SGT (security group tag), GRA (timestamp
granularity), FT (frame type) are set to fixed value.  Only hardware
ID and direction are configurable.  Optional subheader is also not
supported.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 12:34:00 -05:00
William Tu
1d7e2ed22f net: erspan: refactor existing erspan code
The patch refactors the existing erspan implementation in order
to support erspan version 2, which has additional metadata.  So, in
stead of having one 'struct erspanhdr' holding erspan version 1,
breaks it into 'struct erspan_base_hdr' and 'struct erspan_metadata'.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 12:33:59 -05:00
Willem de Bruijn
35b99dffc3 sock: free skb in skb_complete_tx_timestamp on error
skb_complete_tx_timestamp must ingest the skb it is passed. Call
kfree_skb if the skb cannot be enqueued.

Fixes: b245be1f4d ("net-timestamp: no-payload only sysctl")
Fixes: 9ac25fc063 ("net: fix socket refcounting in skb_complete_tx_timestamp()")
Reported-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 11:30:36 -05:00
David S. Miller
8ce38aeb55 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-12-15

1) Currently we can add or update socket policies, but
   not clear them. Support clearing of socket policies
   too. From Lorenzo Colitti.

2) Add documentation for the xfrm device offload api.
   From Shannon Nelson.

3) Fix IPsec extended sequence numbers (ESN) for
   IPsec offloading. From Yossef Efraim.

4) xfrm_dev_state_add function returns success even for
   unsupported options, fix this to fail in such cases.
   From Yossef Efraim.

5) Remove a redundant xfrm_state assignment.
   From Aviv Heller.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 11:10:27 -05:00
David S. Miller
0f546ffcd0 Here are some batman-adv bugfixes:
- Initialize the fragment headers, by Sven Eckelmann
 
  - Fix a NULL check in BATMAN V, by Sven Eckelmann
 
  - Fix kernel doc for the time_setup() change, by Sven Eckelmann
 
  - Use the right lock in BATMAN IV OGM Update, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlozsZQWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoZ1uD/94DURJf4b8kgUDjgMfuKZZyafH
 c4N8SI+3+gZExtgXbb48f91a9S5T7qIAm46c7nHNJpTLEhc/ucup3c06i8VjBc2r
 Tz37jnnjrHzfbw0p10zpLkdDlU0mLtEQBpNGQxru8h2x3gs2RBfDy/yX7MzzAp4m
 maZ3ub0wPAtlRKQ4OywTStP9ZP7BXj4/puCRrn3kPIKDc1Ahwh4dS4r4euSvvR6z
 2Qkfnpb3sbzh1Sa526GNAwgEo6hpssOugQyCIZUK4YJh1OfjnS+xJHWmfBkLw7q9
 OfkuwAjkZMivrcxK/lvdMJ6WltQUyWVkKxDuA6iAtwRV4tX5W5b+9OEToZlHuPid
 11bDXDfmWTfnpYM7uTT7tRbjFpvJ9X4ftUNHHtlL+h+VRz8Uf0xcteBIi7FxHHm5
 fRqMLj5J0WU5rVf2vKTz2XIi5dXWgy/bkoPzalMc8UxWVsKthr34UZmcPRDVvCja
 bwad1H+EEelx4tvPeWTEEhdhLFvalmfgRG3i/paqr547ETllmO462kdUTOmxDnvT
 76GZVsLeHBPrnZi/tqc230EyoZijiM/l41twOZPjb0cklvOEFYR0vs/BUfmvdM/3
 xlUF+CBaVnHcY3HDn7ywVICWP6jnG+LOQ4tOL6jjsKiqbEpPHRHuQ++MnZ2uRUgs
 5e/yBVxdEjo9RoLVsg==
 =US7g
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20171215' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - Initialize the fragment headers, by Sven Eckelmann

 - Fix a NULL check in BATMAN V, by Sven Eckelmann

 - Fix kernel doc for the time_setup() change, by Sven Eckelmann

 - Use the right lock in BATMAN IV OGM Update, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 11:02:11 -05:00
Sean Wang
f0af34317f net: dsa: mediatek: combine MediaTek tag with VLAN tag
In order to let MT7530 switch can recognize well those egress packets
having both special tag and VLAN tag, the information about the special
tag should be carried on the existing VLAN tag. On the other hand, it's
unnecessary for extra handling for ingress packets when VLAN tag is
present since it is able to put the VLAN tag after the special tag and
then follow the existing way to parse.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15 10:31:54 -05:00
Thiago Rafael Becker
bdcf0a423e kernel: make groups_sort calling a responsibility group_info allocators
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.

This patch:
 - Make groups_sort globally visible.
 - Move the call to groups_sort to the modifiers of group_info
 - Remove the call to groups_sort from set_groups

Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14 16:00:49 -08:00
Eric Dumazet
4688eb7cf3 tcp: refresh tcp_mstamp from timers callbacks
Only the retransmit timer currently refreshes tcp_mstamp

We should do the same for delayed acks and keepalives.

Even if RFC 7323 does not request it, this is consistent to what linux
did in the past, when TS values were based on jiffies.

Fixes: 385e20706f ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Mike Maloney <maloney@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by:  Mike Maloney <maloney@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 16:04:04 -05:00
Wei Wang
9ee11bd03c tcp: fix potential underestimation on rcv_rtt
When ms timestamp is used, current logic uses 1us in
tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us.
This could cause rcv_rtt underestimation.
Fix it by always using a min value of 1ms if ms timestamp is used.

Fixes: 645f4c6f2e ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 16:01:17 -05:00
Yuchung Cheng
7268586baa tcp: pause Fast Open globally after third consecutive timeout
Prior to this patch, active Fast Open is paused on a specific
destination IP address if the previous connections to the
IP address have experienced recurring timeouts . But recent
experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla
browsers indicate the isssue is often caused by broken middle-boxes
sitting close to the client. Therefore it is much better user
experience if Fast Open is disabled out-right globally to avoid
experiencing further timeouts on connections toward other
destinations.

This patch changes the destination-IP disablement to global
disablement if a connection experiencing recurring timeouts
or aborts due to timeout.  Repeated incidents would still
exponentially increase the pause time, starting from an hour.
This is extremely conservative but an unfortunate compromise to
minimize bad experience due to broken middle-boxes.

Reported-by: Dragana Damjanovic <ddamjanovic@mozilla.com>
Reported-by: Patrick McManus <mcmanus@ducksong.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 15:51:12 -05:00
Willem de Bruijn
8d74e9f88d net: avoid skb_warn_bad_offload on IS_ERR
skb_warn_bad_offload warns when packets enter the GSO stack that
require skb_checksum_help or vice versa. Do not warn on arbitrary
bad packets. Packet sockets can craft many. Syzkaller was able to
demonstrate another one with eth_type games.

In particular, suppress the warning when segmentation returns an
error, which is for reasons other than checksum offload.

See also commit 36c9247449 ("net: WARN if skb_checksum_help() is
called on skb requiring segmentation") for context on this warning.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 15:14:10 -05:00
Nikolay Aleksandrov
eb7935830d net: bridge: use rhashtable for fdbs
Before this patch the bridge used a fixed 256 element hash table which
was fine for small use cases (in my tests it starts to degrade
above 1000 entries), but it wasn't enough for medium or large
scale deployments. Modern setups have thousands of participants in a
single bridge, even only enabling vlans and adding a few thousand vlan
entries will cause a few thousand fdbs to be automatically inserted per
participating port. So we need to scale the fdb table considerably to
cope with modern workloads, and this patch converts it to use a
rhashtable for its operations thus improving the bridge scalability.
Tests show the following results (10 runs each), at up to 1000 entries
rhashtable is ~3% slower, at 2000 rhashtable is 30% faster, at 3000 it
is 2 times faster and at 30000 it is 50 times faster.
Obviously this happens because of the properties of the two constructs
and is expected, rhashtable keeps pretty much a constant time even with
10000000 entries (tested), while the fixed hash table struggles
considerably even above 10000.
As a side effect this also reduces the net_bridge struct size from 3248
bytes to 1344 bytes. Also note that the key struct is 8 bytes.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 15:10:01 -05:00
Eric Dumazet
ec94c2696f tcp/dccp: avoid one atomic operation for timewait hashdance
First, rename __inet_twsk_hashdance() to inet_twsk_hashdance()

Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead
of 4, but adding a fat warning that we do not have the right to access
tw anymore after inet_twsk_hashdance()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 14:33:10 -05:00