Commit Graph

47449 Commits

Author SHA1 Message Date
Jiri Pirko
8cc6251381 net: sched: cls_cgroup: no need to call tcf_exts_change for newly allocated struct
As the new struct just was allocated, so no need to use tcf_exts_change
to do atomic change, and we can just fill-up the unused exts struct
directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
6839da326d net: sched: cls_bpf: no need to call tcf_exts_change for newly allocated struct
As the prog struct was allocated right before cls_bpf_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
ff1f8ca080 net: sched: cls_basic: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before basic_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
a74cb36980 net: sched: cls_matchall: no need to call tcf_exts_change for newly allocated struct
As the head struct was allocated right before mall_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
94611bff6e net: sched: cls_fw: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before fw_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
455075292b net: sched: cls_flower: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before fl_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
1e5003af37 net: sched: cls_fw: rename fw_change_attrs function
Since the function name is misleading since it is not changing
anything, name it similarly to other cls.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
6a725c481d net: sched: cls_bpf: rename cls_bpf_modify_existing function
The name cls_bpf_modify_existing is highly misleading, as it indeed does
not modify anything existing. It does not modify at all.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
978dfd8d14 net: sched: use tcf_exts_has_actions instead of exts->nr_actions
For check in tcf_exts_dump use tcf_exts_has_actions helper instead
of exts->nr_actions for checking if there are any actions present.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
ec1a9cca0e net: sched: remove check for number of actions in tcf_exts_exec
Leave it to tcf_action_exec to return TC_ACT_OK in case there is no
action present.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
6fc6d06e53 net: sched: remove redundant helpers tcf_exts_is_predicative and tcf_exts_is_available
These two helpers are doing the same as tcf_exts_has_actions, so remove
them and use tcf_exts_has_actions instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
3bcc0cec81 net: sched: change names of action number helpers to be aligned with the rest
The rest of the helpers are named tcf_exts_*, so change the name of
the action number helpers to be aligned. While at it, change to inline
functions.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
4ebc1e3cfc net: sched: remove unneeded tcf_em_tree_change
Since tcf_em_tree_validate could be always called on a newly created
filter, there is no need for this change function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
f7ebdff757 net: sched: sch_atm: use Qdisc_class_common structure
Even if it is only for classid now, use this common struct a be aligned
with the rest of the classful qdiscs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Linus Torvalds
c63716ab4d A bunch of fixes and follow-ups for -rc1 Luminous patches: issues with
->reencode_message() and last minute RADOS semantic changes in v12.1.2.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZhI6QAAoJEEp/3jgCEfOLeQYH/0b92RFzmsqwPI+U7iXdD06O
 r0EXbT5dydMngJkWz/i3jBX8cMBvZyNhBh77VPDYXoFUp8//8uv5w73BkXe8JE08
 +gLZU4oP/k7kl/YBYXgCcJYj7eIBFzqNvsWurKKHY/X3xrvEZ0HT+oub92xOUgRM
 IBnZb1gZ4TJQT1MxqKOwb5aqcxaXlrOGfX7Di0aU3PFQXj5VnBI25NUQF1bgd9+A
 MbhHpob6cbWZWzVdf0fTl28q9pStq4qggevRSM/5ZH/bETO8C80XYTuaPoLcQ0pY
 VfpwgWIAPwotw9KU7W+ane13BURw76+pWMHaUZgiJKRyuRMBOT/gaER+AUeuR1o=
 =XH7K
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.13-rc4' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A bunch of fixes and follow-ups for -rc1 Luminous patches: issues with
  ->reencode_message() and last minute RADOS semantic changes in
  v12.1.2"

* tag 'ceph-for-4.13-rc4' of git://github.com/ceph/ceph-client:
  libceph: make RECOVERY_DELETES feature create a new interval
  libceph: upmap semantic changes
  crush: assume weight_set != null imples weight_set_size > 0
  libceph: fallback for when there isn't a pool-specific choose_arg
  libceph: don't call ->reencode_message() more than once per message
  libceph: make encode_request_*() work with r_mempool requests
2017-08-04 10:15:11 -07:00
Willem de Bruijn
f214f915e7 tcp: enable MSG_ZEROCOPY
Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
both supported. Only data sent to remote destinations is sent without
copying. Packets looped onto a local destination have their payload
copied to avoid unbounded latency.

Tested:
  A 10x TCP_STREAM between two hosts showed a reduction in netserver
  process cycles by up to 70%, depending on packet size. Systemwide,
  savings are of course much less pronounced, at up to 20% best case.

  msg_zerocopy.sh 4 tcp:

  without zerocopy
    tx=121792 (7600 MB) txc=0 zc=n
    rx=60458 (7600 MB)

  with zerocopy
    tx=286257 (17863 MB) txc=286257 zc=y
    rx=140022 (17863 MB)

  This test opens a pair of sockets over veth, one one calls send with
  64KB and optionally MSG_ZEROCOPY and on the other reads the initial
  bytes. The receiver truncates, so this is strictly an upper bound on
  what is achievable. It is more representative of sending data out of
  a physical NIC (when payload is not touched, either).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn
a91dbff551 sock: ulimit on MSG_ZEROCOPY pages
Bound the number of pages that a user may pin.

Follow the lead of perf tools to maintain a per-user bound on memory
locked pages commit 789f90fcf6 ("perf_counter: per user mlock gift")

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn
4ab6c99d99 sock: MSG_ZEROCOPY notification coalescing
In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

TCP and corked sockets can cause send() calls to append new data to an
existing sk_buff and, thus, ubuf_info. In that case the notification
must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
and add skb_zerocopy_realloc() to optionally extend an existing range.

Also coalesce notifications in this common case: if a notification
[1, 1] is about to be queued while [0, 0] is the queue tail, just modify
the head of the queue to read [0, 1].

Coalescing is limited to a few TSO frames worth of data to bound
notification latency.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn
1f8b977ab3 sock: enable MSG_ZEROCOPY
Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

(2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn
76851d1212 sock: add SOCK_ZEROCOPY sockopt
The send call ignores unknown flags. Legacy applications may already
unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
socket opts in to zerocopy.

Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
Processes can also query this socket option to detect kernel support
for the feature. Older kernels will return ENOPROTOOPT.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:29 -07:00
Willem de Bruijn
52267790ef sock: add MSG_ZEROCOPY
The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:29 -07:00
Willem de Bruijn
3ece782693 sock: skb_copy_ubufs support for compound pages
Refine skb_copy_ubufs to support compound pages. With upcoming TCP
zerocopy sendmsg, such fragments may appear.

The existing code replaces each page one for one. Splitting each
compound page into an independent number of regular pages can result
in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.

Instead, fill all destination pages but the last to PAGE_SIZE.
Split the existing alloc + copy loop into separate stages:
1. compute bytelength and minimum number of pages to store this.
2. allocate
3. copy, filling each page except the last to PAGE_SIZE bytes
4. update skb frag array

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:29 -07:00
Willem de Bruijn
98ba0bd550 sock: allocate skbs from optmem
Add sock_omalloc and sock_ofree to be able to allocate control skbs,
for instance for looping errors onto sk_error_queue.

The transmit budget (sk_wmem_alloc) is involved in transmit skb
shaping, most notably in TCP Small Queues. Using this budget for
control packets would impact transmission.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:29 -07:00
Neal Cardwell
df92c8394e tcp: fix xmit timer to only be reset if data ACKed/SACKed
Fix a TCP loss recovery performance bug raised recently on the netdev
list, in two threads:

(i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
     "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
     outstanding TLP retransmission"

The basic problem is that incoming TCP packets that did not indicate
forward progress could cause the xmit timer (TLP or RTO) to be rearmed
and pushed back in time. In certain corner cases this could result in
the following problems noted in these threads:

 - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
   could cause TCP to repeatedly schedule TLPs forever. We kept
   sending TLPs after every ~200ms, which elicited bogus SACKs, which
   caused more TLPs, ad infinitum; we never fired an RTO to fill in
   the holes.

 - Incoming data segments could, in some cases, cause us to reschedule
   our RTO or TLP timer further out in time, for no good reason. This
   could cause repeated inbound data to result in stalls in outbound
   data, in the presence of packet loss.

This commit fixes these bugs by changing the TLP and RTO ACK
processing to:

 (a) Only reschedule the xmit timer once per ACK.

 (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
     ACK indicates sufficient forward progress (a packet was
     cumulatively ACKed, or we got a SACK for a packet that was sent
     before the most recent retransmit of the write queue head).

This brings us back into closer compliance with the RFCs, since, as
the comment for tcp_rearm_rto() notes, we should only restart the RTO
timer after forward progress on the connection. Previously we were
restarting the xmit timer even in these cases where there was no
forward progress.

As a side benefit, this commit simplifies and speeds up the TCP timer
arming logic. We had been calling inet_csk_reset_xmit_timer() three
times on normal ACKs that cumulatively acknowledged some data:

1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
        if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
               tcp_rearm_rto(sk);

2) Once in tcp_clean_rtx_queue(), to update the RTO:
        if (flag & FLAG_ACKED) {
               tcp_rearm_rto(sk);

3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
   to TLP:
        if (icsk->icsk_pending == ICSK_TIME_RETRANS)
               tcp_schedule_loss_probe(sk);

This commit, by only rescheduling the xmit timer once per ACK,
simplifies the code and reduces CPU overhead.

This commit was tested in an A/B test with Google web server
traffic. SNMP stats and request latency metrics were within noise
levels, substantiating that for normal web traffic patterns this is a
rare issue. This commit was also tested with packetdrill tests to
verify that it fixes the timer behavior in the corner cases discussed
in the netdev threads mentioned above.

This patch is a bug fix patch intended to be queued for -stable
relases.

Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Reported-by: Klavs Klavsen <kl@vsen.dk>
Reported-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:38:31 -07:00
Neal Cardwell
a2815817ff tcp: enable xmit timer fix by having TLP use time when RTO should fire
Have tcp_schedule_loss_probe() base the TLP scheduling decision based
on when the RTO *should* fire. This is to enable the upcoming xmit
timer fix in this series, where tcp_schedule_loss_probe() cannot
assume that the last timer installed was an RTO timer (because we are
no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
ACK). So tcp_schedule_loss_probe() must independently figure out when
an RTO would want to fire.

In the new TLP implementation following in this series, we cannot
assume that icsk_timeout was set based on an RTO; after processing a
cumulative ACK the icsk_timeout we see can be from a previous TLP or
RTO. So we need to independently recalculate the RTO time (instead of
reading it out of icsk_timeout). Removing this dependency on the
nature of icsk_timeout makes things a little easier to reason about
anyway.

Note that the old and new code should be equivalent, since they are
both saying: "if the RTO is in the future, but at an earlier time than
the normal TLP time, then set the TLP timer to fire when the RTO would
have fired".

Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:38:30 -07:00
Neal Cardwell
e1a10ef7fa tcp: introduce tcp_rto_delta_us() helper for xmit timer fix
Pure refactor. This helper will be required in the xmit timer fix
later in the patch series. (Because the TLP logic will want to make
this calculation.)

Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:38:30 -07:00
Ido Schimmel
a460aa8396 ipv6: fib: Add helpers to hold / drop a reference on rt6_info
Similar to commit 1c677b3d28 ("ipv4: fib: Add fib_info_hold() helper")
and commit b423cb1080 ("ipv4: fib: Export free_fib_info()") add an
helper to hold a reference on rt6_info and export rt6_release() to drop
it and potentially release the route.

This is needed so that drivers capable of FIB offload could hold a
reference on the route before queueing it for offload and drop it after
the route has been programmed to the device's tables.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
fc882fcff1 ipv6: Regenerate host route according to node pointer upon interface up
When an interface is brought back up, the kernel tries to restore the
host routes tied to its permanent addresses.

However, if the host route was removed from the FIB, then we need to
reinsert it. This is done by releasing the current dst and allocating a
new, so as to not reuse a dst with obsolete values.

Since this function is called under RTNL and using the same explanation
from the previous patch, we can test if the route is in the FIB by
checking its node pointer instead of its reference count.

Tested using the following script and Andrey's reproducer mentioned
in commit 8048ced9be ("net: ipv6: regenerate host route if moved to gc
list") and linked below:

$ ip link set dev lo up
$ ip link add dummy1 type dummy
$ ip -6 address add cafe::1/64 dev dummy1
$ ip link set dev lo down	# cafe::1/128 is removed
$ ip link set dev dummy1 up
$ ip link set dev lo up

The host route is correctly regenerated.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Link: http://lkml.kernel.org/r/CAAeHK+zSe82vc5gCRgr_EoUwiALPnWVdWJBPwJZBpbxYz=kGJw@mail.gmail.com
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
9217d8c2fe ipv6: Regenerate host route according to node pointer upon loopback up
When the loopback device is brought back up we need to check if the host
route attached to the address is still in the FIB and regenerate one in
case it's not.

Host routes using the loopback device are always inserted into and
removed from the FIB under RTNL (under which this function is called),
so we can test their node pointer instead of the reference count in
order to check if the route is in the FIB or not.

Tested using the following script from Nicolas mentioned in
commit a220445f9f ("ipv6: correctly add local routes when lo goes up"):

$ ip link add dummy1 type dummy
$ ip link set dummy1 up
$ ip link set lo down ; ip link set lo up

The host route is correctly regenerated.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
7483cea799 ipv6: fib: Unlink replaced routes from their nodes
When a route is deleted its node pointer is set to NULL to indicate it's
no longer linked to its node. Do the same for routes that are replaced.

This will later allow us to test if a route is still in the FIB by
checking its node pointer instead of its reference count.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
c5b12410fa ipv6: fib: Don't assume only nodes hold a reference on routes
The code currently assumes that only FIB nodes can hold a reference on
routes. Therefore, after fib6_purge_rt() has run and the route is no
longer present in any intermediate nodes, it's assumed that its
reference count would be 1 - taken by the node where it's currently
stored.

However, we're going to allow users other than the FIB to take a
reference on a route, so this assumption is no longer valid and the
BUG_ON() needs to be removed.

Note that purging only takes place if the initial reference count is
different than 1. I've left that check intact, as in the majority of
systems (where routes are only referenced by the FIB), it does actually
mean the route is present in intermediate nodes.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
61e4d01e16 ipv6: fib: Add offload indication to routes
Allow user space applications to see which routes are offloaded and
which aren't by setting the RTNH_F_OFFLOAD flag when dumping them.

To be consistent with IPv4, offload indication is provided on a
per-nexthop basis.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
e1ee0a5ba3 ipv6: fib: Dump tables during registration to FIB chain
Dump all the FIB tables in each net namespace upon registration to the
FIB notification chain so that the callee will have a complete view of
the tables.

The integrity of the dump is ensured by a per-table sequence counter
that is incremented (under write lock) whenever a route is added or
deleted from the table.

All the sequence counters are read (under each table's read lock) and
summed, prior and after the dump. In case the counters differ, then the
dump is either restarted or the registration fails.

While it's possible for a table to be modified after its counter has
been read, this isn't really a problem. In case it happened before it
was read the second time, then the comparison at the end will fail. If
it happened afterwards, then we're guaranteed to be notified about the
change, as the notification block is registered prior to the second
read.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
dcb18f762f ipv6: fib_rules: Dump rules during registration to FIB chain
Allow users of the FIB notification chain to receive a complete view of
the IPv6 FIB rules upon registration to the chain.

The integrity of the dump is ensured by a per-family sequence counter
that is incremented (under RTNL) whenever a rule is added or deleted.

All the sequence counters are read (under RTNL) and summed, prior and
after the dump. In case the counters differ, then the dump is either
restarted or the registration fails.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
df77fe4d98 ipv6: fib: Add in-kernel notifications for route add / delete
As with IPv4, allow listeners of the FIB notification chain to receive
notifications whenever a route is added, replaced or deleted. This is
done by placing calls to the FIB notification chain in the two lowest
level functions that end up performing these operations - namely,
fib6_add_rt2node() and fib6_del_route().

Unlike IPv4, APPEND notifications aren't sent as the kernel doesn't
distinguish between "append" (NLM_F_CREATE|NLM_F_APPEND) and "prepend"
(NLM_F_CREATE). If NLM_F_EXCL isn't set, duplicate routes are always
added after the existing duplicate routes.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:36:00 -07:00
Ido Schimmel
16ab6d7d4d ipv6: fib: Add FIB notifiers callbacks
We're about to add IPv6 FIB offload support, so implement the necessary
callbacks in IPv6 code, which will later allow us to add routes and
rules notifications.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:35:59 -07:00
Ido Schimmel
e3ea973159 ipv6: fib_rules: Check if rule is a default rule
As explained in commit 3c71006d15 ("ipv4: fib_rules: Check if rule is
a default rule"), drivers supporting IPv6 FIB offload need to be able to
sanitize the rules they don't support and potentially flush their
tables.

Add an IPv6 helper to check if a FIB rule is a default rule.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:35:59 -07:00
Ido Schimmel
1b2a444085 net: fib_rules: Implement notification logic in core
Unlike the routing tables, the FIB rules share a common core, so instead
of replicating the same logic for each address family we can simply dump
the rules and send notifications from the core itself.

To protect the integrity of the dump, a rules-specific sequence counter
is added for each address family and incremented whenever a rule is
added or deleted (under RTNL).

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:35:59 -07:00
Ido Schimmel
04b1d4e50e net: core: Make the FIB notification chain generic
The FIB notification chain is currently soley used by IPv4 code.
However, we're going to introduce IPv6 FIB offload support, which
requires these notification as well.

As explained in commit c3852ef7f2 ("ipv4: fib: Replay events when
registering FIB notifier"), upon registration to the chain, the callee
receives a full dump of the FIB tables and rules by traversing all the
net namespaces. The integrity of the dump is ensured by a per-namespace
sequence counter that is incremented whenever a change to the tables or
rules occurs.

In order to allow more address families to use the chain, each family is
expected to register its fib_notifier_ops in its pernet init. These
operations allow the common code to read the family's sequence counter
as well as dump its tables and rules in the given net namespace.

Additionally, a 'family' parameter is added to sent notifications, so
that listeners could distinguish between the different families.

Implement the common code that allows listeners to register to the chain
and for address families to register their fib_notifier_ops. Subsequent
patches will implement these operations in IPv6.

In the future, ipmr and ip6mr will be extended to provide these
notifications as well.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:35:59 -07:00
Xin Long
b91d532928 ipv6: set rt6i_protocol properly in the route when it is installed
After commit c2ed1880fd ("net: ipv6: check route protocol when
deleting routes"), ipv6 route checks rt protocol when trying to
remove a rt entry.

It introduced a side effect causing 'ip -6 route flush cache' not
to work well. When flushing caches with iproute, all route caches
get dumped from kernel then removed one by one by sending DELROUTE
requests to kernel for each cache.

The thing is iproute sends the request with the cache whose proto
is set with RTPROT_REDIRECT by rt6_fill_node() when kernel dumps
it. But in kernel the rt_cache protocol is still 0, which causes
the cache not to be matched and removed.

So the real reason is rt6i_protocol in the route is not set when
it is allocated. As David Ahern's suggestion, this patch is to
set rt6i_protocol properly in the route when it is installed and
remove the codes setting rtm_protocol according to rt6i_flags in
rt6_fill_node.

This is also an improvement to keep rt6i_protocol consistent with
rtm_protocol.

Fixes: c2ed1880fd ("net: ipv6: check route protocol when deleting routes")
Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 15:10:18 -07:00
Xin Long
bb96dec745 sctp: remove the typedef sctp_auth_chunk_t
This patch is to remove the typedef sctp_auth_chunk_t, and
replace with struct sctp_auth_chunk in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
96f7ef4d58 sctp: remove the typedef sctp_authhdr_t
This patch is to remove the typedef sctp_authhdr_t, and
replace with struct sctp_authhdr in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
68d7546946 sctp: remove the typedef sctp_addip_chunk_t
This patch is to remove the typedef sctp_addip_chunk_t, and
replace with struct sctp_addip_chunk in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
65205cc465 sctp: remove the typedef sctp_addiphdr_t
This patch is to remove the typedef sctp_addiphdr_t, and
replace with struct sctp_addiphdr in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
8b32f2348a sctp: remove the typedef sctp_addip_param_t
This patch is to remove the typedef sctp_addip_param_t, and
replace with struct sctp_addip_param in the places where it's
using this typedef.

It is to use sizeof(variable) instead of sizeof(type), and
also fix some indent problems.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
65f7710543 sctp: remove the typedef sctp_cwrhdr_t
This patch is to remove the typedef sctp_cwrhdr_t, and
replace with struct sctp_cwrhdr in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
b515fd2759 sctp: remove the typedef sctp_ecne_chunk_t
This patch is to remove the typedef sctp_ecne_chunk_t, and
replace with struct sctp_ecne_chunk in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
1fb6d83bd3 sctp: remove the typedef sctp_ecnehdr_t
This patch is to remove the typedef sctp_ecnehdr_t, and
replace with struct sctp_ecnehdr in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
2a49321677 sctp: remove the typedef sctp_error_t
This patch is to remove the typedef sctp_error_t, and replace
with enum sctp_error in the places where it's using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:47 -07:00
Xin Long
87caeba791 sctp: remove the typedef sctp_operr_chunk_t
This patch is to remove the typedef sctp_operr_chunk_t, and
replace with struct sctp_operr_chunk in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:46 -07:00
Xin Long
d8238d9dab sctp: remove the typedef sctp_errhdr_t
This patch is to remove the typedef sctp_errhdr_t, and replace
with struct sctp_errhdr in the places where it's using this
typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:46 -07:00
Xin Long
ac23e68133 sctp: fix the name of struct sctp_shutdown_chunk_t
This patch is to fix the name of struct sctp_shutdown_chunk_t
, replace with struct sctp_initack_chunk in the places where
it's using it.

It is also to fix some indent problem.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:46 -07:00
Xin Long
e61e4055b1 sctp: remove the typedef sctp_shutdownhdr_t
This patch is to remove the typedef sctp_shutdownhdr_t, and
replace with struct sctp_shutdownhdr in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:45:46 -07:00
Eric Dumazet
2dda640040 net: fix keepalive code vs TCP_FASTOPEN_CONNECT
syzkaller was able to trigger a divide by 0 in TCP stack [1]

Issue here is that keepalive timer needs to be updated to not attempt
to send a probe if the connection setup was deferred using
TCP_FASTOPEN_CONNECT socket option added in linux-4.11

[1]
 divide error: 0000 [#1] SMP
 CPU: 18 PID: 0 Comm: swapper/18 Not tainted
 task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
 RIP: 0010:[<ffffffff8409cc0d>]  [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160
 Call Trace:
  <IRQ>
  [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20
  [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0
  [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160
  [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230
  [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0
  [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280
  [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257
  [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0
  [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80
  [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90
  <EOI>
  [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0
  [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0

Tested:

Following packetdrill no longer crashes the kernel

`echo 0 >/proc/sys/net/ipv4/tcp_timestamps`

// Cache warmup: send a Fast Open cookie request
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
   +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
 +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
   +0 > . 1:1(0) ack 1
   +0 close(3) = 0
   +0 > F. 1:1(0) ack 1
   +0 < F. 1:1(0) ack 2 win 92
   +0 > .  2:2(0) ack 2

   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
 +.01 connect(4, ..., ...) = 0
   +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
   +10 close(4) = 0

`echo 1 >/proc/sys/net/ipv4/tcp_timestamps`

Fixes: 19f6d3f3c8 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:34:51 -07:00
Neal Cardwell
d06c3583c2 tcp: remove extra POLL_OUT added for finished active connect()
Commit 45f119bf93 ("tcp: remove header prediction") introduced a
minor bug: the sk_state_change() and sk_wake_async() notifications for
a completed active connection happen twice: once in this new spot
inside tcp_finish_connect() and once in the existing code in
tcp_rcv_synsent_state_process() immediately after it calls
tcp_finish_connect(). This commit remoes the duplicate POLL_OUT
notifications.

Fixes: 45f119bf93 ("tcp: remove header prediction")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:30:47 -07:00
Sowmini Varadhan
840df162b3 rds: reduce memory footprint for RDS when transport is RDMA
RDS over IB does not use multipath RDS, so the array
of additional rds_conn_path structures is always superfluous
in this case. Reduce the memory footprint of the rds module
by making this a dynamic allocation predicated on whether
the transport is mp_capable.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Efrain Galaviz <efrain.galaviz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:27:59 -07:00
Tonghao Zhang
93b1b31f87 ipv4: Introduce ipip_offload_init helper function.
It's convenient to init ipip offload. We will check
the return value, and print KERN_CRIT info on failure.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:27:07 -07:00
William Tu
eb48d68281 bpf: fix the printing of ifindex
Save the ifindex before it gets zeroed so the invalid
ifindex can be printed out.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:25:26 -07:00
David S. Miller
f6c00a3bb8 This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - Remove unnecessary length qualifier, by Joe Perches
 
  - Remove too short %pM field width, by Sven Eckelmann
 
  - Remove return value handling from skb_put_data, by Sven Eckelmann
 
  - Spelling fixes, by Colin Ian King
 
  - Convert batman-adv.txt to reStructuredText, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlmB364WHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoUL9EADc1EFFdHxbJdLbVL/FjkDkrwpL
 60exCnDNLZ06g2+XhViA/Lo5VMPjJTYvNVciG4RbphtZPGwpnXLO1mM+1EntzHLl
 Bkvd8pMS8SbhUIlJ4Ua8ret3GoIf2FLz3tEOr/LGO/aJ5iEOS1N9msI1nYK12E6V
 4pgLJiHztoLkoEnsCaB30iF/jxlCXKE+AG2LjD65zUuG95DPR0XGGkgdP0qomlvh
 kaSG7G4kVCQEsS2W+6TLqC+CKoO3uGGCF1wc4KDD3wgbUU0YCmqhzehrtstCku66
 ksNavOy0e0iA3bURo6md/aWWUyOV6wK2uV6QE5ef1gStgXQKYwXa6MzaHVRu8XYZ
 SrzfwKQRFDZXzouHTNJCNSeNhrPGKXPs6JSjeQDR1hzwZ0e+5xOct/mgp2VgHhqP
 v4xs0ZFcjOWPZ52Yy0kY0r/f4AKTwS20DJLSQaKEST1E0m4rlzpNUxi+/4T1wUSD
 LFtXjf9IonS1Weo6Ro5v6x2db4tXuX6pwmlCpfYcAdAK1FFyKIG7HHWw4UV7s85P
 5nNbZP/v6K8DsGVVD3I/HEIeoZyi2DnPzYgFeV8pMJy6gYggnu/axdgmd7mgDL3J
 aCEaL3rvSbkmnmq6QG/pC0VwXmTR9j945uEBjGICmPq1nzV2rvUt7KvEv1MWBH28
 Qv5VAUe8XsIiwKMgFA==
 =3oIU
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20170802' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Remove unnecessary length qualifier, by Joe Perches

 - Remove too short %pM field width, by Sven Eckelmann

 - Remove return value handling from skb_put_data, by Sven Eckelmann

 - Spelling fixes, by Colin Ian King

 - Convert batman-adv.txt to reStructuredText, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:24:06 -07:00
David S. Miller
4d2bbb0ef8 Here is a batman-adv bugfix:
- fix TT sync flag inconsistency problems, which can lead to excess packets,
    by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlmB3ikWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoUCUEADIvPfBxzon12xBV4JbBpTMnIwl
 09heUwE1FxCj2UoTV2QJ+O9ZQb1yc7ARJksymP30T96q9bNlDiNFVquXcuQLA+ba
 aoSowejLfTHLvOlP3Wrik5PVLBdKqeBpg7zYNtAuNI09DSryey1ipaC4Xemq9ZmC
 8Hjam9RAQh5FV3RCeviTzpusXt31/xjkTx3vrf1WGx2uY/MSKNwwfPcRAyfG+Ye5
 9kxpR8sBwRxJ+6+MEWRBjqi2Lk/BuD4urViAwGj88uDkKKByXT/pfULD3foTgx0d
 smZWWMun9jQl4XDk321oOp5HOqg29IFcF/mwuWQIvBDO8AJBRndYHw07lx0cB5tD
 I8PzTG5+kE3QXBeTBEUAuszaavDI9DKVdnr6h/8VbeEXIr73WHA0VixnTIv3zgI0
 AiYHIrvUstDB7Cl5457blKbljgzXMN1ssBtD/UMNFhrEUKet9qIZskwlLyhPt2s3
 qti70ZoURPz9ve57pNgZDZUR5lvvUhuSNUD4ZfGt5BC+YErX59jAswrRXHU5Zx/i
 V5pSsRQ2BfxZUZrnmKONg3RW+hmsJJYOSzNXiO5/0zeVuorju7C+0L2goONX2bpr
 fEzICfDJi+IXmYke0AAjiJq7KwobU703Nd3rT8eovjVlSSJj+mdyYSVYQF9dD2uE
 DdtwpJh37PwbHOeQqg==
 =a4HS
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20170802' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is a batman-adv bugfix:

 - fix TT sync flag inconsistency problems, which can lead to excess packets,
   by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:23:11 -07:00
Julia Lawall
e9e6c2329a X25: constify null_x25_address
null_x25_address is only used to access the string it contains, so it can
be const.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:13:51 -07:00
Ido Schimmel
475abbf1ef ipv4: fib: Set offload indication according to nexthop flags
We're going to have capable drivers indicate route offload using the
nexthop flags, but for non-multipath routes these flags aren't dumped to
user space.

Instead, set the offload indication in the route message flags.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 17:00:23 -07:00
Florian Fainelli
f613ed665b net: dsa: Add support for 64-bit statistics
DSA slave network devices maintain a pair of bytes and packets counters
for each directions, but these are not 64-bit capable. Re-use
pcpu_sw_netstats which contains exactly what we need for that purpose
and update the code path to report 64-bit capable statistics.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 16:49:09 -07:00
Yuchung Cheng
ed254971ed tcp: avoid setting cwnd to invalid ssthresh after cwnd reduction states
If the sender switches the congestion control during ECN-triggered
cwnd-reduction state (CA_CWR), upon exiting recovery cwnd is set to
the ssthresh value calculated by the previous congestion control. If
the previous congestion control is BBR that always keep ssthresh
to TCP_INIFINITE_SSTHRESH, cwnd ends up being infinite. The safe
step is to avoid assigning invalid ssthresh value when recovery ends.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 10:51:07 -07:00
WANG Cong
b2f9d432de flow_dissector: remove unused functions
They are introduced by commit f70ea018da
("net: Add functions to get skb->hash based on flow structures")
but never gets used in tree.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 10:50:03 -07:00
Eric Dumazet
5357f0bd4e tcp: tcp_data_queue() cleanup
Commit c13ee2a4f0 ("tcp: reindent two spots after prequeue removal")
removed code in tcp_data_queue().

We can go a little farther, removing an always true test,
and removing initializers for fragstolen and eaten variables.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 10:45:24 -07:00
Vivien Didelot
08f500610f net: dsa: rename switch EEE ops
To avoid confusion with the PHY EEE settings, rename the .set_eee and
.get_eee ops to respectively .set_mac_eee and .get_mac_eee.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 20:09:10 -07:00
Vivien Didelot
46587e4a31 net: dsa: remove PHY device argument from .set_eee
The DSA switch operations for EEE are only meant to configure a port's
MAC EEE settings. The port's PHY EEE settings are accessed by the DSA
layer and must be made available via a proper PHY driver.

In order to reduce this confusion, remove the phy_device argument from
the .set_eee operation.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 20:09:10 -07:00
Vivien Didelot
c48f7eb302 net: dsa: call phy_init_eee in DSA layer
All DSA drivers are calling phy_init_eee if eee_enabled is true.

Move up this statement in the DSA layer to simplify the DSA drivers.
qca8k does not require to cache the ethtool_eee structures from now on.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 20:09:10 -07:00
Vivien Didelot
7b9cc73843 net: dsa: PHY device is mandatory for EEE
The port's PHY and MAC are both implied in EEE. The current code does
not call the PHY operations if the related device is NULL. Change that
by returning -ENODEV if there's no PHY device attached to the interface.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 20:09:09 -07:00
K. Den
1bff8a0c1f gue: fix remcsum when GRO on and CHECKSUM_PARTIAL boundary is outer UDP
In the case that GRO is turned on and the original received packet is
CHECKSUM_PARTIAL, if the outer UDP header is exactly at the last
csum-unnecessary point, which for instance could occur if the packet
comes from another Linux guest on the same Linux host, we have to do
either remcsum_adjust or set up CHECKSUM_PARTIAL again with its
csum_start properly reset considering RCO.

However, since b7fe10e5eb ("gro: Fix remcsum offload to deal with frags
in GRO") that barrier in such case could be skipped if GRO turned on,
hence we pass over it and the inner L4 validation mistakenly reckons
it as a bad csum.

This patch makes remcsum_offload being reset at the same time of GRO
remcsum cleanup, so as to make it work in such case as before.

Fixes: b7fe10e5eb ("gro: Fix remcsum offload to deal with frags in GRO")
Signed-off-by: Koichiro Den <den@klaipeden.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 16:09:14 -07:00
Willem de Bruijn
c613c209c3 net: add skb_frag_foreach_page and use with kmap_atomic
Skb frags may contain compound pages. Various operations map frags
temporarily using kmap_atomic, but this function works on single
pages, not whole compound pages. The distinction is only relevant
for high mem pages that require temporary mappings.

Introduce a looping mechanism that for compound highmem pages maps
one page at a time, does not change behavior on other pages.
Use the loop in the kmap_atomic callers in net/core/skbuff.c.

Verified by triggering skb_copy_bits with

    tcpdump -n -c 100 -i ${DEV} -w /dev/null &
    netperf -t TCP_STREAM -H ${HOST}

  and by triggering __skb_checksum with

    ethtool -K ${DEV} tx off

  repeated the tests with looping on a non-highmem platform
  (x86_64) by making skb_frag_must_loop always return true.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 16:07:10 -07:00
yujuan.qi
40413955ee Cipso: cipso_v4_optptr enter infinite loop
in for(),if((optlen > 0) && (optptr[1] == 0)), enter infinite loop.

Test: receive a packet which the ip length > 20 and the first byte of ip option is 0, produce this issue

Signed-off-by: yujuan.qi <yujuan.qi@mediatek.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 15:31:23 -07:00
Tom Herbert
bbb03029a8 strparser: Generalize strparser
Generalize strparser from more than just being used in conjunction
with read_sock. strparser will also be used in the send path with
zero proxy. The primary change is to create strp_process function
that performs the critical processing on skbs. The documentation
is also updated to reflect the new uses.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 15:26:19 -07:00
Tom Herbert
20bf50de30 skbuff: Function to send an skbuf on a socket
Add skb_send_sock to send an skbuff on a socket within the kernel.
Arguments include an offset so that an skbuf might be sent in mulitple
calls (e.g. send buffer limit is hit).

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 15:26:18 -07:00
Tom Herbert
306b13eb3c proto_ops: Add locked held versions of sendmsg and sendpage
Add new proto_ops sendmsg_locked and sendpage_locked that can be
called when the socket lock is already held. Correspondingly, add
kernel_sendmsg_locked and kernel_sendpage_locked as front end
functions.

These functions will be used in zero proxy so that we can take
the socket lock in a ULP sendmsg/sendpage and then directly call the
backend transport proto_ops functions.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 15:26:18 -07:00
David S. Miller
29fda25a2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 10:07:50 -07:00
Julia Lawall
e6eeb287e3 Revert "l2tp: constify inet6_protocol structures"
This reverts commit d04916a48a.

inet6_add_protocol and inet6_del_protocol include casts that remove the
effect of the const annotation on their parameter, leading to possible
runtime crashes.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 10:03:17 -07:00
Julia Lawall
39294c3df2 Revert "ipv6: constify inet6_protocol structures"
This reverts commit 3a3a4e3054.

inet6_add_protocol and inet6_del_protocol include casts that remove the
effect of the const annotation on their parameter, leading to possible
runtime crashes.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 10:03:17 -07:00
Ilya Dryomov
ae78dd8139 libceph: make RECOVERY_DELETES feature create a new interval
This is needed so that the OSDs can regenerate the missing set at the
start of a new interval where support for recovery deletes changed.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-08-01 16:46:45 +02:00
Ilya Dryomov
f53b7665c8 libceph: upmap semantic changes
- apply both pg_upmap and pg_upmap_items
- allow bidirectional swap of pg-upmap-items

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-08-01 16:46:45 +02:00
Ilya Dryomov
c7ed1a4bf4 crush: assume weight_set != null imples weight_set_size > 0
Reflects ceph.git commit 5e8fa3e06b68fae1582c9230a3a8d1abc6146286.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-08-01 16:46:44 +02:00
Ilya Dryomov
e17e8969f5 libceph: fallback for when there isn't a pool-specific choose_arg
There is now a fallback to a choose_arg index of -1 if there isn't
a pool-specific choose_arg set.  If you create a per-pool weight-set,
that works for that pool.  Otherwise we try the compat/default one.  If
that doesn't exist either, then we use the normal CRUSH weights.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-08-01 16:46:44 +02:00
Ilya Dryomov
4690faf00c libceph: don't call ->reencode_message() more than once per message
Reencoding an already reencoded message is a bad idea.  This could
happen on Policy::stateful_server connections (!CEPH_MSG_CONNECT_LOSSY),
such as MDS sessions.

This didn't pop up in testing because currently only OSD requests are
reencoded and OSD sessions are always lossy.

Fixes: 98ad5ebd15 ("libceph: ceph_connection_operations::reencode_message() method")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2017-08-01 16:46:43 +02:00
Ilya Dryomov
986e89898a libceph: make encode_request_*() work with r_mempool requests
Messages allocated out of ceph_msgpool have a fixed front length
(pool->front_len).  Asserting that the entire front has been filled
while encoding is thus wrong.

Fixes: 8cb441c054 ("libceph: MOSDOp v8 encoding (actual spgid + full hash)")
Reported-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2017-08-01 16:46:31 +02:00
Linus Torvalds
bc78d646e7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle notifier registry failures properly in tun/tap driver, from
    Tonghao Zhang.

 2) Fix bpf verifier handling of subtraction bounds and add a testcase
    for this, from Edward Cree.

 3) Increase reset timeout in ftgmac100 driver, from Ben Herrenschmidt.

 4) Fix use after free in prd_retire_rx_blk_timer_exired() in AF_PACKET,
    from Cong Wang.

 5) Fix SElinux regression due to recent UDP optimizations, from Paolo
    Abeni.

 6) We accidently increment IPSTATS_MIB_FRAGFAILS in the ipv6 code
    paths, fix from Stefano Brivio.

 7) Fix some mem leaks in dccp, from Xin Long.

 8) Adjust MDIO_BUS kconfig deps to avoid build errors, from Arnd
    Bergmann.

 9) Mac address length check and buffer size fixes from Cong Wang.

10) Don't leak sockets in ipv6 udp early demux, from Paolo Abeni.

11) Fix return value when copy_from_user() fails in
    bpf_prog_get_info_by_fd(), from Daniel Borkmann.

12) Handle PHY_HALTED properly in phy library state machine, from
    Florian Fainelli.

13) Fix OOPS in fib_sync_down_dev(), from Ido Schimmel.

14) Fix truesize calculation in virtio_net which led to performance
    regressions, from Michael S Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
  samples/bpf: fix bpf tunnel cleanup
  udp6: fix jumbogram reception
  ppp: Fix a scheduling-while-atomic bug in del_chan
  Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config"
  virtio_net: fix truesize for mergeable buffers
  mv643xx_eth: fix of_irq_to_resource() error check
  MAINTAINERS: Add more files to the PHY LIBRARY section
  ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev()
  net: phy: Correctly process PHY_HALTED in phy_stop_machine()
  sunhme: fix up GREG_STAT and GREG_IMASK register offsets
  bpf: fix bpf_prog_get_info_by_fd to dump correct xlated_prog_len
  tcp: avoid bogus gcc-7 array-bounds warning
  net: tc35815: fix spelling mistake: "Intterrupt" -> "Interrupt"
  bpf: don't indicate success when copy_from_user fails
  udp6: fix socket leak on early demux
  net: thunderx: Fix BGX transmit stall due to underflow
  Revert "vhost: cache used event for better performance"
  team: use a larger struct for mac address
  net: check dev->addr_len for dev_set_mac_address()
  phy: bcm-ns-usb3: fix MDIO_BUS dependency
  ...
2017-07-31 22:36:42 -07:00
Paolo Abeni
cb891fa6a1 udp6: fix jumbogram reception
Since commit 67a51780ae ("ipv6: udp: leverage scratch area
helpers") udp6_recvmsg() read the skb len from the scratch area,
to avoid a cache miss.
But the UDP6 rx path support RFC 2675 UDPv6 jumbograms, and their
length exceeds the 16 bits available in the scratch area. As a side
effect the length returned by recvmsg() is:
<ingress datagram len> % (1<<16)

This commit addresses the issue allocating one more bit in the
IP6CB flags field and setting it for incoming jumbograms.
Such field is still in the first cacheline, so at recvmsg()
time we can check it and fallback to access skb->len if
required, without a measurable overhead.

Fixes: 67a51780ae ("ipv6: udp: leverage scratch area helpers")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 22:01:21 -07:00
Jakub Sitnicki
1f139ed9ec ipv6: Avoid going through ->sk_net to access the netns
There is no need to go through sk->sk_net to access the net namespace
and its sysctl variables because we allocate the sock and initialize
sk_net just a few lines earlier in the same routine.

Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 18:01:33 -07:00
Ido Schimmel
71ed7ee35a ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev()
Michał reported a NULL pointer deref during fib_sync_down_dev() when
unregistering a netdevice. The problem is that we don't check for
'in_dev' being NULL, which can happen in very specific cases.

Usually routes are flushed upon NETDEV_DOWN sent in either the netdev or
the inetaddr notification chains. However, if an interface isn't
configured with any IP address, then it's possible for host routes to be
flushed following NETDEV_UNREGISTER, after NULLing dev->ip_ptr in
inetdev_destroy().

To reproduce:
$ ip link add type dummy
$ ip route add local 1.1.1.0/24 dev dummy0
$ ip link del dev dummy0

Fix this by checking for the presence of 'in_dev' before referencing it.

Fixes: 982acb9756 ("ipv4: fib: Notify about nexthop status changes")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 17:51:11 -07:00
Wei Wang
bb7c19f960 tcp: add related fields into SCM_TIMESTAMPING_OPT_STATS
Add the following stats into SCM_TIMESTAMPING_OPT_STATS control msg:
    TCP_NLA_PACING_RATE
    TCP_NLA_DELIVERY_RATE
    TCP_NLA_SND_CWND
    TCP_NLA_REORDERING
    TCP_NLA_MIN_RTT
    TCP_NLA_RECUR_RETRANS
    TCP_NLA_DELIVERY_RATE_APP_LMT

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 17:26:18 -07:00
Wei Wang
0263598c77 tcp: extract the function to compute delivery rate
Refactor the code to extract the function to compute delivery rate.
This function will be used in later commit.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 17:26:18 -07:00
Florian Westphal
3282e65558 tcp: remove unused mib counters
was used by tcp prequeue and header prediction.
TCPFORWARDRETRANS use was removed in january.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:50 -07:00
Florian Westphal
573aeb0492 tcp: remove CA_ACK_SLOWPATH
re-indent tcp_ack, and remove CA_ACK_SLOWPATH; it is always set now.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:50 -07:00
Florian Westphal
45f119bf93 tcp: remove header prediction
Like prequeue, I am not sure this is overly useful nowadays.

If we receive a train of packets, GRO will aggregate them if the
headers are the same (HP predates GRO by several years) so we don't
get a per-packet benefit, only a per-aggregated-packet one.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Florian Westphal
b6690b1438 tcp: remove low_latency sysctl
Was only checked by the removed prequeue code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Florian Westphal
c13ee2a4f0 tcp: reindent two spots after prequeue removal
These two branches are now always true, remove the conditional.
objdiff shows no changes.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Florian Westphal
e7942d0633 tcp: remove prequeue support
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Linus Lüssing
54e22f265e batman-adv: fix TT sync flag inconsistencies
This patch fixes an issue in the translation table code potentially
leading to a TT Request + Response storm. The issue may occur for nodes
involving BLA and an inconsistent configuration of the batman-adv AP
isolation feature. However, since the new multicast optimizations, a
single, malformed packet may lead to a mesh-wide, persistent
Denial-of-Service, too.

The issue occurs because nodes are currently OR-ing the TT sync flags of
all originators announcing a specific MAC address via the
translation table. When an intermediate node now receives a TT Request
and wants to answer this on behalf of the destination node, then this
intermediate node now responds with an altered flag field and broken
CRC. The next OGM of the real destination will lead to a CRC mismatch
and triggering a TT Request and Response again.

Furthermore, the OR-ing is currently never undone as long as at least
one originator announcing the according MAC address remains, leading to
the potential persistency of this issue.

This patch fixes this issue by storing the flags used in the CRC
calculation on a a per TT orig entry basis to be able to respond with
the correct, original flags in an intermediate TT Response for one
thing. And to be able to correctly unset sync flags once all nodes
announcing a sync flag vanish for another.

Fixes: e9c00136a4 ("batman-adv: fix tt_global_entries flags update")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Antonio Quartulli <a@unstable.cc>
[sw: typo in commit message]
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-07-31 11:17:38 +02:00
Jamal Hadi Salim
e62e484df0 net sched actions: add time filter for action dumping
This patch adds support for filtering based on time since last used.
When we are dumping a large number of actions it is useful to
have the option of filtering based on when the action was last
used to reduce the amount of data crossing to user space.

With this patch the user space app sets the TCA_ROOT_TIME_DELTA
attribute with the value in milliseconds with "time of interest
since now".  The kernel converts this to jiffies and does the
filtering comparison matching entries that have seen activity
since then and returns them to user space.
Old kernels and old tc continue to work in legacy mode since
they dont specify this attribute.

Some example (we have 400 actions bound to 400 filters); at
installation time. Using updated when tc setting the time of
interest to 120 seconds earlier (we see 400 actions):
prompt$ hackedtc actions ls action gact since 120000| grep index | wc -l
400

go get some coffee and wait for > 120 seconds and try again:

prompt$ hackedtc actions ls action gact since 120000 | grep index | wc -l
0

Lets see a filter bound to one of these actions:
....
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 2 success 1)
  match 7f000002/ffffffff at 12 (success 1 )
    action order 1: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1145 sec used 802 sec
    Action statistics:
    Sent 84 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
....

that coffee took long, no? It was good.

Now lets ping -c 1 127.0.0.2, then run the actions again:
prompt$ hackedtc actions ls action gact since 120 | grep index | wc -l
1

More details please:
prompt$ hackedtc -s actions ls action gact since 120000

    action order 0: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1270 sec used 30 sec
    Action statistics:
    Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

And the filter?

filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 4 success 2)
  match 7f000002/ffffffff at 12 (success 2 )
    action order 1: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1324 sec used 84 sec
    Action statistics:
    Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
Jamal Hadi Salim
90825b23a8 net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
When you dump hundreds of thousands of actions, getting only 32 per
dump batch even when the socket buffer and memory allocations allow
is inefficient.

With this change, the user will get as many as possibly fitting
within the given constraints available to the kernel.

The top level action TLV space is extended. An attribute
TCA_ROOT_FLAGS is used to carry flags; flag TCA_FLAG_LARGE_DUMP_ON
is set by the user indicating the user is capable of processing
these large dumps. Older user space which doesnt set this flag
doesnt get the large (than 32) batches.
The kernel uses the TCA_ROOT_COUNT attribute to tell the user how many
actions are put in a single batch. As such user space app knows how long
to iterate (independent of the type of action being dumped)
instead of hardcoded maximum of 32 thus maintaining backward compat.

Some results dumping 1.5M actions below:
first an unpatched tc which doesnt understand these features...

prompt$ time -p tc actions ls action gact | grep index | wc -l
1500000
real 1388.43
user 2.07
sys 1386.79

Now lets see a patched tc which sets the correct flags when requesting
a dump:

prompt$ time -p updatedtc actions ls action gact | grep index | wc -l
1500000
real 178.13
user 2.02
sys 176.96

That is about 8x performance improvement for tc app which sets its
receive buffer to about 32K.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
Jamal Hadi Salim
df823b0297 net sched actions: Use proper root attribute table for actions
Bug fix for an issue which has been around for about a decade.
We got away with it because the enumeration was larger than needed.

Fixes: 7ba699c604 ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API")
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
Arnd Bergmann
efe967cdec tcp: avoid bogus gcc-7 array-bounds warning
When using CONFIG_UBSAN_SANITIZE_ALL, the TCP code produces a
false-positive warning:

net/ipv4/tcp_output.c: In function 'tcp_connect':
net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
   tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
                                        ^~
net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
   tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~

I have opened a gcc bug for this, but distros have already shipped
compilers with this problem, and it's not clear yet whether there is
a way for gcc to avoid the warning. As the problem is related to the
bitfield access, this introduces a temporary variable to store the old
enum value.

I did not notice this warning earlier, since UBSAN is disabled when
building with COMPILE_TEST, and that was always turned on in both
allmodconfig and randconfig tests.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81601
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 23:26:29 -07:00
Vidya Sagar Ravipati
1a5f3da20b net: ethtool: add support for forward error correction modes
Forward Error Correction (FEC) modes i.e Base-R
and Reed-Solomon modes are introduced in 25G/40G/100G standards
for providing good BER at high speeds. Various networking devices
which support 25G/40G/100G provides ability to manage supported FEC
modes and the lack of FEC encoding control and reporting today is a
source for interoperability issues for many vendors.
FEC capability as well as specific FEC mode i.e. Base-R
or RS modes can be requested or advertised through bits D44:47 of
base link codeword.

This patch set intends to provide option under ethtool to manage
and report FEC encoding settings for networking devices as per
IEEE 802.3 bj, bm and by specs.

set-fec/show-fec option(s) are designed to provide control and
report the FEC encoding on the link.

SET FEC option:
root@tor: ethtool --set-fec  swp1 encoding [off | RS | BaseR | auto]

Encoding: Types of encoding
Off    :  Turning off any encoding
RS     :  enforcing RS-FEC encoding on supported speeds
BaseR  :  enforcing Base R encoding on supported speeds
Auto   :  IEEE defaults for the speed/medium combination

Here are a few examples of what we would expect if encoding=auto:
- if autoneg is on, we are  expecting FEC to be negotiated as on or off
  as long as protocol supports it
- if the hardware is capable of detecting the FEC encoding on it's
      receiver it will reconfigure its encoder to match
- in absence of the above, the configuration would be set to IEEE
  defaults.

>From our  understanding , this is essentially what most hardware/driver
combinations are doing today in the absence of a way for users to
control the behavior.

SHOW FEC option:
root@tor: ethtool --show-fec  swp1
FEC parameters for swp1:
Active FEC encodings: RS
Configured FEC encodings:  RS | BaseR

ETHTOOL DEVNAME output modification:

ethtool devname output:
root@tor:~# ethtool swp1
Settings for swp1:
root@hpe-7712-03:~# ethtool swp18
Settings for swp18:
    Supported ports: [ FIBRE ]
    Supported link modes:   40000baseCR4/Full
                            40000baseSR4/Full
                            40000baseLR4/Full
                            100000baseSR4/Full
                            100000baseCR4/Full
                            100000baseLR4_ER4/Full
    Supported pause frame use: No
    Supports auto-negotiation: Yes
    Supported FEC modes: [RS | BaseR | None | Not reported]
    Advertised link modes:  Not reported
    Advertised pause frame use: No
    Advertised auto-negotiation: No
    Advertised FEC modes: [RS | BaseR | None | Not reported]
<<<< One or more FEC modes
    Speed: 100000Mb/s
    Duplex: Full
    Port: FIBRE
    PHYAD: 106
    Transceiver: internal
    Auto-negotiation: off
    Link detected: yes

This patch includes following changes
a) New ETHTOOL_SFECPARAM/SFECPARAM API, handled by
  the new get_fecparam/set_fecparam callbacks, provides support
  for configuration of forward error correction modes.
b) Link mode bits for FEC modes i.e. None (No FEC mode), RS, BaseR/FC
  are defined so that users can configure these fec modes for supported
  and advertising fields as part of link autonegotiation.

Signed-off-by: Vidya Sagar Ravipati <vidya.chowdary@gmail.com>
Signed-off-by: Dustin Byford <dustin@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 23:23:44 -07:00
Paolo Abeni
c9f2c1ae12 udp6: fix socket leak on early demux
When an early demuxed packet reaches __udp6_lib_lookup_skb(), the
sk reference is retrieved and used, but the relevant reference
count is leaked and the socket destructor is never called.
Beyond leaking the sk memory, if there are pending UDP packets
in the receive queue, even the related accounted memory is leaked.

In the long run, this will cause persistent forward allocation errors
and no UDP skbs (both ipv4 and ipv6) will be able to reach the
user-space.

Fix this by explicitly accessing the early demux reference before
the lookup, and properly decreasing the socket reference count
after usage.

Also drop the skb_steal_sock() in __udp6_lib_lookup_skb(), and
the now obsoleted comment about "socket cache".

The newly added code is derived from the current ipv4 code for the
similar path.

v1 -> v2:
  fixed the __udp6_lib_rcv() return code for resubmission,
  as suggested by Eric

Reported-by: Sam Edwards <CFSworks@gmail.com>
Reported-by: Marc Haber <mh+netdev@zugschlus.de>
Fixes: 5425077d73 ("net: ipv6: Add early demux handler for UDP unicast")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 14:19:03 -07:00
WANG Cong
0254e0c632 net: check dev->addr_len for dev_set_mac_address()
Historically, dev_ifsioc() uses struct sockaddr as mac
address definition, this is why dev_set_mac_address()
accepts a struct sockaddr pointer as input but now we
have various types of mac addresse whose lengths
are up to MAX_ADDR_LEN, longer than struct sockaddr,
and saved in dev->addr_len.

It is too late to fix dev_ifsioc() due to API
compatibility, so just reject those larger than
sizeof(struct sockaddr), otherwise we would read
and use some random bytes from kernel stack.

Fortunately, only a few IPv6 tunnel devices have addr_len
larger than sizeof(struct sockaddr) and they don't support
ndo_set_mac_addr(). But with team driver, in lb mode, they
can still be enslaved to a team master and make its mac addr
length as the same.

Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:25:05 -07:00
Ursula Braun
10428dd835 net/smc: synchronize buffer usage with device
Usage of send buffer "sndbuf" is synced
(a) before filling sndbuf for cpu access
(b) after filling sndbuf for device access

Usage of receive buffer "RMB" is synced
(a) before reading RMB content for cpu access
(b) after reading RMB content for device access

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:58 -07:00
Ursula Braun
b33982c3a6 net/smc: cleanup function __smc_buf_create()
Split function __smc_buf_create() for better readability.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:58 -07:00
Ursula Braun
3e034725c0 net/smc: common functions for RMBs and send buffers
Creation and deletion of SMC receive and send buffers shares a high
amount of common code . This patch introduces common functions to get
rid of duplicate code.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:58 -07:00
Ursula Braun
9d8fb61734 net/smc: introduce sg-logic for send buffers
SMC send buffers are processed the same way as RMBs. Since RMBs have
been converted to sg-logic, do the same for send buffers.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:58 -07:00
Ursula Braun
d5b361b0dc net/smc: remove Kconfig warning
Now separate memory regions are created and registered for separate
RMBs. The unsafe_global_rkey of the protection domain is no longer
used. Thus the exposing memory warning can be removed.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:58 -07:00
Ursula Braun
652a1e41ec net/smc: register RMB-related memory region
A memory region created for a new RMB must be registered explicitly,
before the peer can make use of it for remote DMA transfer.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:58 -07:00
Ursula Braun
897e1c2457 net/smc: use separate memory regions for RMBs
SMC currently uses the unsafe_global_rkey of the protection domain,
which exposes all memory for remote reads and writes once a connection
is established. This patch introduces separate memory regions with
separate rkeys for every RMB. Now the unsafe_global_rkey of the
protection domain is no longer needed.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:58 -07:00
Ursula Braun
a3fe3d01bd net/smc: introduce sg-logic for RMBs
The follow-on patch makes use of ib_map_mr_sg() when introducing
separate memory regions for RMBs. This function is based on
scatterlists; thus this patch introduces scatterlists for RMBs.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:57 -07:00
Ursula Braun
c45abf31e7 net/smc: shorten local bufsize variables
Initiate the coming rework of SMC buffer handling with this
small code cleanup. No functional changes here.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:57 -07:00
Ursula Braun
977bb32440 net/smc: serialize connection creation in all cases
If a link group for a new server connection exists already, the mutex
serializing the determination of link groups is given up early.
The coming registration of memory regions benefits from the serialization
as well, if the mutex is held till connection creation is finished.
This patch postpones the unlocking of the link group creation mutex.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-29 11:22:57 -07:00
Colin Ian King
f25cbb2a6a batman-adv: fix various spelling mistakes
Trivial fix to spelling mistakes in batadv_dbg debug messages and
also in a comment and ensure comment line is not wider than 80
characters

"ourselve" -> "ourselves"
"surpressed" -> "suppressed"
"troughput" -> "throughput"

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-07-29 09:51:28 +02:00
Sven Eckelmann
e04de4861c batman-adv: Remove variable deprecated by skb_put_data
skb_put_data makes it unnecessary to store the skb_put return value to copy
some data to the packet. The returned pointer of skb_put_data should
therefore not stored by functions which previously only used it to copy
some data.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-07-29 09:51:27 +02:00
Sven Eckelmann
6a04be8d86 batman-adv: Remove too short %pM printk field width
The string representation for a mac address produced by %pM is 17
characters long. Left-aligning the output in a 15 character wide field
width %-15pM is therefore misleading and unnecessary.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-07-29 09:51:26 +02:00
Joe Perches
cd0edf3ada batman-adv: Remove unnecessary length qualifier in %14pM
It's misleading and unnecessary.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-07-29 09:51:26 +02:00
Simon Wunderlich
a311abdffb batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-07-29 09:51:25 +02:00
Julia Lawall
d04916a48a l2tp: constify inet6_protocol structures
The inet6_protocol structure is only passed as the first argument to
inet6_add_protocol or inet6_del_protocol, both of which are declared as
const.  Thus the inet6_protocol structure itself can be const.

Also drop __read_mostly on the newly const structure.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-28 18:52:56 -07:00
Julia Lawall
3a3a4e3054 ipv6: constify inet6_protocol structures
The inet6_protocol structure is only passed as the first argument to
inet6_add_protocol or inet6_del_protocol, both of which are declared as
const.  Thus the inet6_protocol structure itself can be const.

Also drop __read_mostly where present on the newly const structures.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-28 18:52:56 -07:00
Xin Long
e90ce2fc27 dccp: fix a memleak for dccp_feat_init err process
In dccp_feat_init, when ccid_get_builtin_ccids failsto alloc
memory for rx.val, it should free tx.val before returning an
error.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-27 00:01:05 -07:00
Xin Long
b7953d3c0e dccp: fix a memleak that dccp_ipv4 doesn't put reqsk properly
The patch "dccp: fix a memleak that dccp_ipv6 doesn't put reqsk
properly" fixed reqsk refcnt leak for dccp_ipv6. The same issue
exists on dccp_ipv4.

This patch is to fix it for dccp_ipv4.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-27 00:01:05 -07:00
Xin Long
0c2232b0a7 dccp: fix a memleak that dccp_ipv6 doesn't put reqsk properly
In dccp_v6_conn_request, after reqsk gets alloced and hashed into
ehash table, reqsk's refcnt is set 3. one is for req->rsk_timer,
one is for hlist, and the other one is for current using.

The problem is when dccp_v6_conn_request returns and finishes using
reqsk, it doesn't put reqsk. This will cause reqsk refcnt leaks and
reqsk obj never gets freed.

Jianlin found this issue when running dccp_memleak.c in a loop, the
system memory would run out.

dccp_memleak.c:
  int s1 = socket(PF_INET6, 6, IPPROTO_IP);
  bind(s1, &sa1, 0x20);
  listen(s1, 0x9);
  int s2 = socket(PF_INET6, 6, IPPROTO_IP);
  connect(s2, &sa1, 0x20);
  close(s1);
  close(s2);

This patch is to put the reqsk before dccp_v6_conn_request returns,
just as what tcp_conn_request does.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-27 00:01:05 -07:00
Matthias Kaehlcke
0c3a8f8b8f netpoll: Fix device name check in netpoll_setup()
Apparently netpoll_setup() assumes that netpoll.dev_name is a pointer
when checking if the device name is set:

if (np->dev_name) {
  ...

However the field is a character array, therefore the condition always
yields true. Check instead whether the first byte of the array has a
non-zero value.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-26 17:01:43 -07:00
Paolo Abeni
9688f9b020 udp: unbreak build lacking CONFIG_XFRM
We must use pre-processor conditional block or suitable accessors to
manipulate skb->sp elsewhere builds lacking the CONFIG_XFRM will break.

Fixes: dce4551cb2 ("udp: preserve head state for IP_CMSG_PASSSEC")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-26 09:35:29 -07:00
Stefano Brivio
afce615aaa ipv6: Don't increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment()
RFC 2465 defines ipv6IfStatsOutFragFails as:

	"The number of IPv6 datagrams that have been discarded
	 because they needed to be fragmented at this output
	 interface but could not be."

The existing implementation, instead, would increase the counter
twice in case we fail to allocate room for single fragments:
once for the fragment, once for the datagram.

This didn't look intentional though. In one of the two affected
affected failure paths, the double increase was simply a result
of a new 'goto fail' statement, introduced to avoid a skb leak.
The other path appears to be affected since at least 2.6.12-rc2.

Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
Fixes: 1d325d217c ("ipv6: ip6_fragment: fix headroom tests and skb leak")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-25 21:17:10 -07:00
stephen hemminger
ce7426ca70 6lowpan: fix set not used warning
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Acked-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-25 12:31:37 -07:00
stephen hemminger
614d79c09e socket: fix set not used warning
The variable owned_by_user is always set, but only used
when kernel is configured with LOCKDEP enabled.

Get rid of the warning by moving the code to put the call
to owned_by_user into the the rcu_protected call.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-25 12:31:37 -07:00
stephen hemminger
3754b87a4e netfilter: remove unused variable
warning: ‘recent_old_fops’ defined but not used

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-25 12:31:37 -07:00
Paolo Abeni
dce4551cb2 udp: preserve head state for IP_CMSG_PASSSEC
Paul Moore reported a SELinux/IP_PASSSEC regression
caused by missing skb->sp at recvmsg() time. We need to
preserve the skb head state to process the IP_CMSG_PASSSEC
cmsg.

With this commit we avoid releasing the skb head state in the
BH even if a secpath is attached to the current skb, and stores
the skb status (with/without head states) in the scratch area,
so that we can access it at skb deallocation time, without
incurring in cache-miss penalties.

This also avoids misusing the skb CB for ipv6 packets,
as introduced by the commit 0ddf3fb2c4 ("udp: preserve
skb->dst if required for IP options processing").

Clean a bit the scratch area helpers implementation, to
reduce the code differences between 32 and 64 bits build.

Reported-by: Paul Moore <paul@paul-moore.com>
Fixes: 0a463c78d2 ("udp: avoid a cache miss on dequeue")
Fixes: 0ddf3fb2c4 ("udp: preserve skb->dst if required for IP options processing")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-25 10:00:58 -07:00
Florian Fainelli
9f9e772da2 net: dsa: Initialize ds->cpu_port_mask earlier
The mt7530 driver has its dsa_switch_ops::get_tag_protocol function
check ds->cpu_port_mask to issue a warning in case the configured CPU
port is not capable of supporting tags.

After commit 14be36c2c9 ("net: dsa: Initialize all CPU and enabled
ports masks in dsa_ds_parse()") we slightly re-arranged the
initialization such that this was no longer working. Just make sure that
ds->cpu_port_mask is set prior to the first call to get_tag_protocol,
thus restoring the expected contract. In case of error, the CPU port bit
is cleared.

Fixes: 14be36c2c9 ("net: dsa: Initialize all CPU and enabled ports masks in dsa_ds_parse()")
Reported-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 17:36:27 -07:00
WANG Cong
c800aaf8d8 packet: fix use-after-free in prb_retire_rx_blk_timer_expired()
There are multiple reports showing we have a use-after-free in
the timer prb_retire_rx_blk_timer_expired(), where we use struct
tpacket_kbdq_core::pkbdq, a pg_vec, after it gets freed by
free_pg_vec().

The interesting part is it is not freed via packet_release() but
via packet_setsockopt(), which means we are not closing the socket.
Looking into the big and fat function packet_set_ring(), this could
happen if we satisfy the following conditions:

1. closing == 0, not on packet_release() path
2. req->tp_block_nr == 0, we don't allocate a new pg_vec
3. rx_ring->pg_vec is already set as V3, which means we already called
   packet_set_ring() wtih req->tp_block_nr > 0 previously
4. req->tp_frame_nr == 0, pass sanity check
5. po->mapped == 0, never called mmap()

In this scenario we are clearing the old rx_ring->pg_vec, so we need
to free this pg_vec, but we don't stop the timer on this path because
of closing==0.

The timer has to be stopped as long as we need to free pg_vec, therefore
the check on closing!=0 is wrong, we should check pg_vec!=NULL instead.

Thanks to liujian for testing different fixes.

Reported-by: alexander.levin@verizon.com
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: liujian (CE) <liujian56@huawei.com>
Tested-by: liujian (CE) <liujian56@huawei.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 17:33:19 -07:00
Matvejchikov Ilya
e42e24c3cc tcp: remove redundant argument from tcp_rcv_established()
The last (4th) argument of tcp_rcv_established() is redundant as it
always equals to skb->len and the skb itself is always passed as 2th
agrument. There is no reason to have it.

Signed-off-by: Ilya V. Matveychikov <matvejchikov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 17:28:12 -07:00
Florian Westphal
a28b1b90de skbuff: re-add check for NULL skb->head in kfree_skb path
A null check is needed after all.  netlink skbs can have skb->head be
backed by vmalloc.  The netlink destructor vfree()s head, then sets it to
NULL.  We then panic in skb_release_data with a NULL dereference.

Re-add such a test.

Alternative would be to switch to kvfree to free skb->head memory
and remove the special handling in netlink destructor.

Reported-by: kernel test robot <fengguang.wu@intel.com>
Fixes: 06dc75ab06 ("net: Revert "net: add function to allocate sk_buff head without data area")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:27:08 -07:00
Liping Zhang
69ec932e36 openvswitch: fix potential out of bound access in parse_ct
Before the 'type' is validated, we shouldn't use it to fetch the
ovs_ct_attr_lens's minlen and maxlen, else, out of bound access
may happen.

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:25:06 -07:00
David S. Miller
71085745ec RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWXJGrPSw1s6N8H32AQK/5A/5AcdMLnejfV4r4qQvqKmv/M3w6Lt9P1qY
 sQJyWUhUPATptttj0rdtkHh9n5exOkBpPE/pBrwxoSXe0rm8oa1xO96UWsdDQWn1
 DwILipqyTQ9HHNESoY9XaBpPy7bTNRVUXOcVTXLqVuSozkiZgINic4uq/q8pVonB
 NRUULZPdcxmETUhZyBzloV2afY1pv287Rz5vRm8PUnRZmVK26lHylFi75Eywblju
 nw3N+McPe846Tc5qIFyj3b9VdMtzFA/py7GkrWPeHRmVHdZOviH9rQ++KkiBCUAz
 hQl/YaSKCGbTL9KU/B4E2dz3VnL48p3AVxQusCA5BExOO+HIDiCrenti3JMpEXLN
 gt29rD4AEyxBYbocJHpXNRxARxzDmBAmaw4tRC1Aw57MXomV5uMm/jKH/f646sYe
 S7ohnngaeWRwMa4JfxgNdf+NEenUwm/06tTSYrwYynWpjJDanI0xQDLgBYKR8SYp
 YoYLAv1tduMXcX7JjSWq2lPn6WvDnSZzRWOpJPHeFJcaGEcaYer5Qw8Yr/ZgOVxm
 0xz3wgZtckIfi2d6NcSybEvIPv5jI2BLhxqgpvxxfW95NdDXcsyQCycugX69Jg3o
 Zar4qJSFCBtC86KDOkL008X7fv/I27yb+nm5EcerC8stO6GymqfZVo6f9wEIhb6W
 P+rtLI3zYco=
 =8ZR+
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20170721' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Rearrange headers

Here's a pair of patches that rearrange some of the AF_RXRPC header files
that are outside of the net/rxrpc/ directory:

 (1) The bits userspace need are moved to uapi/linux/rxrpc.h.  [Should this
     be af_rxrpc.h instead, I wonder - but there doesn't seem to be
     precedent for that in the other net UAPI headers.]

 (2) For the most part, the contents of rxrpc/packet.h are no longer used
     outside of the AF_RXRPC module, so move them to net/rxrpc/protocol.h
     with the exception of the standard abort codes which are exposed to
     userspace when an abort occurs and the security index values which are
     needed when constructing keys.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:17:10 -07:00
Xin Long
441ae65ae0 sctp: remove the typedef sctp_abort_chunk_t
This patch is to remove the typedef sctp_abort_chunk_t, and
replace with struct sctp_abort_chunk in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:20 -07:00
Xin Long
38c00f7482 sctp: remove the typedef sctp_heartbeat_chunk_t
This patch is to remove the typedef sctp_heartbeat_chunk_t, and
replace with struct sctp_heartbeat_chunk in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:20 -07:00
Xin Long
4d2dcdf4e0 sctp: remove the typedef sctp_heartbeathdr_t
This patch is to remove the typedef sctp_heartbeathdr_t, and
replace with struct sctp_heartbeathdr in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:20 -07:00
Xin Long
d4d6c61489 sctp: remove the typedef sctp_sack_chunk_t
This patch is to remove the typedef sctp_sack_chunk_t, and
replace with struct sctp_sack_chunk in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:20 -07:00
Xin Long
787310859d sctp: remove the typedef sctp_sackhdr_t
This patch is to remove the typedef sctp_sackhdr_t, and replace
with struct sctp_sackhdr in the places where it's using this
typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:20 -07:00
Xin Long
afd93b7be6 sctp: remove the typedef sctp_sack_variable_t
This patch is to remove the typedef sctp_sack_variable_t, and
replace with union sctp_sack_variable in the places where it's
using this typedef.

It is also to fix some indents in sctp_acked().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:20 -07:00
Xin Long
62e6b7e4ee sctp: remove the typedef sctp_unrecognized_param_t
This patch is to remove the typedef sctp_unrecognized_param_t, and
replace with struct sctp_unrecognized_param in the places where it's
using this typedef.

It is also to fix some indents in sctp_sf_do_unexpected_init() and
sctp_sf_do_5_1B_init().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:19 -07:00
Xin Long
f48ef4c7f7 sctp: remove the typedef sctp_cookie_param_t
This patch is to remove the typedef sctp_cookie_param_t, and
replace with struct sctp_cookie_param in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:19 -07:00
Xin Long
cb1844c472 sctp: remove the typedef sctp_initack_chunk_t
This patch is to remove the typedef sctp_initack_chunk_t, and
replace with struct sctp_initack_chunk in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:01:19 -07:00
Paolo Abeni
864d966424 net/socket: fix type in assignment and trim long line
The commit ffb07550c7 ("copy_msghdr_from_user(): get rid of
field-by-field copyin") introduce a new sparse warning:

net/socket.c:1919:27: warning: incorrect type in assignment (different address spaces)
net/socket.c:1919:27:    expected void *msg_control
net/socket.c:1919:27:    got void [noderef] <asn:1>*[addressable] msg_control

and a line above 80 chars, let's fix them

Fixes: ffb07550c7 ("copy_msghdr_from_user(): get rid of field-by-field copyin")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 14:17:01 -07:00
Sabrina Dubroca
ae847f40b6 net: call udp_tunnel_get_rx_info when NETIF_F_RX_UDP_TUNNEL_PORT is toggled
NETIF_F_RX_UDP_TUNNEL_PORT is special, in that we need to do more than
just flip the bit in dev->features. When disabling we must also clear
currently offloaded ports from the device, and when enabling we must
tell the device to offload the ports it can.

Because vxlan stores its sockets in a hashtable and they are inserted at
the head of per-bucket lists, switching the feature off and then on can
result in a different set of ports being offloaded (depending on the
HW's limits).

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 13:52:59 -07:00
Sabrina Dubroca
296d8ee37c net: add infrastructure to un-offload UDP tunnel port
This adds a new NETDEV_UDP_TUNNEL_DROP_INFO event, similar to
NETDEV_UDP_TUNNEL_PUSH_INFO, to signal to un-offload ports.

This also adds udp_tunnel_drop_rx_port(), which calls
ndo_udp_tunnel_del.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 13:52:59 -07:00