Commit Graph

1021081 Commits

Author SHA1 Message Date
Nguyen Dinh Phi
be5d1b61a2 tcp: fix tcp_init_transfer() to not reset icsk_ca_initialized
This commit fixes a bug (found by syzkaller) that could cause spurious
double-initializations for congestion control modules, which could cause
memory leaks or other problems for congestion control modules (like CDG)
that allocate memory in their init functions.

The buggy scenario constructed by syzkaller was something like:

(1) create a TCP socket
(2) initiate a TFO connect via sendto()
(3) while socket is in TCP_SYN_SENT, call setsockopt(TCP_CONGESTION),
    which calls:
       tcp_set_congestion_control() ->
         tcp_reinit_congestion_control() ->
           tcp_init_congestion_control()
(4) receive ACK, connection is established, call tcp_init_transfer(),
    set icsk_ca_initialized=0 (without first calling cc->release()),
    call tcp_init_congestion_control() again.

Note that in this sequence tcp_init_congestion_control() is called
twice without a cc->release() call in between. Thus, for CC modules
that allocate memory in their init() function, e.g, CDG, a memory leak
may occur. The syzkaller tool managed to find a reproducer that
triggered such a leak in CDG.

The bug was introduced when that commit 8919a9b31e ("tcp: Only init
congestion control if not initialized already")
introduced icsk_ca_initialized and set icsk_ca_initialized to 0 in
tcp_init_transfer(), missing the possibility for a sequence like the
one above, where a process could call setsockopt(TCP_CONGESTION) in
state TCP_SYN_SENT (i.e. after the connect() or TFO open sendmsg()),
which would call tcp_init_congestion_control(). It did not intend to
reset any initialization that the user had already explicitly made;
it just missed the possibility of that particular sequence (which
syzkaller managed to find).

Fixes: 8919a9b31e ("tcp: Only init congestion control if not initialized already")
Reported-by: syzbot+f1e24a0594d4e3a895d3@syzkaller.appspotmail.com
Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-06 10:32:37 -07:00
Paul Blakey
8550ff8d8c skbuff: Release nfct refcount on napi stolen or re-used skbs
When multiple SKBs are merged to a new skb under napi GRO,
or SKB is re-used by napi, if nfct was set for them in the
driver, it will not be released while freeing their stolen
head state or on re-use.

Release nfct on napi's stolen or re-used SKBs, and
in gro_list_prepare, check conntrack metadata diff.

Fixes: 5c6b946047 ("net/mlx5e: CT: Handle misses after executing CT action")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-06 10:26:29 -07:00
Pablo Neira Ayuso
d1b5b80da7 netfilter: nft_last: incorrect arithmetics when restoring last used
Subtract the jiffies that have passed by to current jiffies to fix last
used restoration.

Fixes: 836382dc24 ("netfilter: nf_tables: add last expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06 14:15:13 +02:00
Pablo Neira Ayuso
6ac4bac4ce netfilter: nft_last: honor NFTA_LAST_SET on restoration
NFTA_LAST_SET tells us if this expression has ever seen a packet, do not
ignore this attribute when restoring the ruleset.

Fixes: 836382dc24 ("netfilter: nf_tables: add last expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06 14:15:13 +02:00
Manfred Spraul
cf4466ea47 netfilter: conntrack: Mark access for KCSAN
KCSAN detected an data race with ipc/sem.c that is intentional.

As nf_conntrack_lock() uses the same algorithm: Update
nf_conntrack_core as well:

nf_conntrack_lock() contains
  a1) spin_lock()
  a2) smp_load_acquire(nf_conntrack_locks_all).

a1) actually accesses one lock from an array of locks.

nf_conntrack_locks_all() contains
  b1) nf_conntrack_locks_all=true (normal write)
  b2) spin_lock()
  b3) spin_unlock()

b2 and b3 are done for every lock.

This guarantees that nf_conntrack_locks_all() prevents any
concurrent nf_conntrack_lock() owners:
If a thread past a1), then b2) will block until that thread releases
the lock.
If the threat is before a1, then b3)+a1) ensure the write b1) is
visible, thus a2) is guaranteed to see the updated value.

But: This is only the latest time when b1) becomes visible.
It may also happen that b1) is visible an undefined amount of time
before the b3). And thus KCSAN will notice a data race.

In addition, the compiler might be too clever.

Solution: Use WRITE_ONCE().

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06 14:15:13 +02:00
Ali Abdallah
1da4cd82dd netfilter: conntrack: add new sysctl to disable RST check
This patch adds a new sysctl tcp_ignore_invalid_rst to disable marking
out of segments RSTs as INVALID.

Signed-off-by: Ali Abdallah <aabdallah@suse.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06 14:15:12 +02:00
Ali Abdallah
c4edc3ccbc netfilter: conntrack: improve RST handling when tuple is re-used
If we receive a SYN packet in original direction on an existing
connection tracking entry, we let this SYN through because conntrack
might be out-of-sync.

Conntrack gets back in sync when server responds with SYN/ACK and state
gets updated accordingly.

However, if server replies with RST, this packet might be marked as
INVALID because td_maxack value reflects the *old* conntrack state
and not the state of the originator of the RST.

Avoid td_maxack-based checks if previous packet was a SYN.

Unfortunately that is not be enough: an out of order ACK in original
direction updates last_index, so we still end up marking valid RST.

Thus disable the sequence check when we are not in established state and
the received RST has a sequence of 0.

Because marking RSTs as invalid usually leads to unwanted timeouts,
also skip RST sequence checks if a conntrack entry is already closing.

Such entries can already be evicted via GC in case the table is full.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Ali Abdallah <aabdallah@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06 14:15:12 +02:00
Gu Shengxian
bc832065b6 bpftool: Properly close va_list 'ap' by va_end() on error
va_list 'ap' was opened but not closed by va_end() in error case. It should
be closed by va_end() before the return.

Fixes: aa52bcbe0e ("tools: bpftool: Fix json dump crash on powerpc")
Signed-off-by: Gu Shengxian <gushengxian@yulong.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/bpf/20210706013543.671114-1-gushengxian507419@gmail.com
2021-07-06 09:19:23 +02:00
Wang Hai
2620e92ae6 bpf, samples: Fix xdpsock with '-M' parameter missing unload process
Execute the following command and exit, then execute it again, the following
error will be reported:

  $ sudo ./samples/bpf/xdpsock -i ens4f2 -M
  ^C
  $ sudo ./samples/bpf/xdpsock -i ens4f2 -M
  libbpf: elf: skipping unrecognized data section(16) .eh_frame
  libbpf: elf: skipping relo section(17) .rel.eh_frame for section(16) .eh_frame
  libbpf: Kernel error message: XDP program already attached
  ERROR: link set xdp fd failed

Commit c9d27c9e8d ("samples: bpf: Do not unload prog within xdpsock") removed
the unloading prog code because of the presence of bpf_link. This is fine if
XDP_SHARED_UMEM is disabled, but if it is enabled, unloading the prog is still
needed.

Fixes: c9d27c9e8d ("samples: bpf: Do not unload prog within xdpsock")
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20210628091815.2373487-1-wanghai38@huawei.com
2021-07-05 23:34:18 +02:00
Toke Høiland-Jørgensen
5a0ae9872d bpf, samples: Add -fno-asynchronous-unwind-tables to BPF Clang invocation
The samples/bpf Makefile currently compiles BPF files in a way that will
produce an .eh_frame section, which will in turn confuse libbpf and produce
errors when loading BPF programs, like:

  libbpf: elf: skipping unrecognized data section(32) .eh_frame
  libbpf: elf: skipping relo section(33) .rel.eh_frame for section(32) .eh_frame

Fix this by instruction Clang not to produce this section, as it's useless
for BPF anyway.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210705103841.180260-1-toke@redhat.com
2021-07-05 21:56:30 +02:00
David S. Miller
c6c205ed44 Merge branch 'stmmac-ptp'
Xiaoliang Yang says:

====================
net: stmmac: re-configure tas basetime after ptp time adjust

If the DWMAC Ethernet device has already set the Qbv EST configuration
before using ptp to synchronize the time adjustment, the Qbv base time
may change to be the past time of the new current time. This is not
allowed by hardware.

This patch calculates and re-configures the Qbv basetime after ptp time
adjustment.

v1->v2:
  Update est mutex lock to protect btr/ctr r/w to be atomic.
  Add btr_reserve to store basetime from qopt and used as origin base
time in Qbv re-configuration.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-05 10:16:18 -07:00
Xiaoliang Yang
e9e3720002 net: stmmac: ptp: update tas basetime after ptp adjust
After adjusting the ptp time, the Qbv base time may be the past time
of the new current time. dwmac5 hardware limited the base time cannot
be set as past time. This patch add a btr_reserve to store the base
time get from qopt, then calculate the base time and reset the Qbv
configuration after ptp time adjust.

Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-05 10:16:17 -07:00
Xiaoliang Yang
b2aae654a4 net: stmmac: add mutex lock to protect est parameters
Add a mutex lock to protect est structure parameters so that the
EST parameters can be updated by other threads.

Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-05 10:16:17 -07:00
Xiaoliang Yang
81c52c42af net: stmmac: separate the tas basetime calculation function
Separate the TAS basetime calculation function so that it can be
called by other functions.

Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-05 10:16:17 -07:00
Yangbo Lu
f6a175cfcc ptp: fix format string mismatch in ptp_sysfs.c
Fix format string mismatch in ptp_sysfs.c. Use %u for unsigned int.

Fixes: 73f37068d5 ("ptp: support ptp physical/virtual clocks conversion")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-05 10:10:41 -07:00
Yangbo Lu
55eac20617 ptp: fix NULL pointer dereference in ptp_clock_register
Fix NULL pointer dereference in ptp_clock_register. The argument
"parent" of ptp_clock_register may be NULL pointer.

Fixes: 73f37068d5 ("ptp: support ptp physical/virtual clocks conversion")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-05 10:09:29 -07:00
Lorenzo Bianconi
6ff63a150b net: marvell: always set skb_shared_info in mvneta_swbm_add_rx_fragment
Always set skb_shared_info data structure in mvneta_swbm_add_rx_fragment
routine even if the fragment contains only the ethernet FCS.

Fixes: 039fbc47f9 ("net: mvneta: alloc skb_shared_info on the mvneta_rx_swbm stack")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-03 13:50:21 -07:00
Paolo Abeni
b43c8909be udp: properly flush normal packet at GRO time
If an UDP packet enters the GRO engine but is not eligible
for aggregation and is not targeting an UDP tunnel,
udp_gro_receive() will not set the flush bit, and packet
could delayed till the next napi flush.

Fix the issue ensuring non GROed packets traverse
skb_gro_flush_final().

Reported-and-tested-by: Matthias Treydte <mt@waldheinz.de>
Fixes: 18f25dc399 ("udp: skip L4 aggregation for UDP tunnel packets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 16:19:14 -07:00
Ronak Doshi
b22580233d vmxnet3: fix cksum offload issues for tunnels with non-default udp ports
Commit dacce2be33 ("vmxnet3: add geneve and vxlan tunnel offload
support") added support for encapsulation offload. However, the inner
offload capability is to be restricted to UDP tunnels with default
Vxlan and Geneve ports.

This patch fixes the issue for tunnels with non-default ports using
features check capability and filtering appropriate features for such
tunnels.

Fixes: dacce2be33 ("vmxnet3: add geneve and vxlan tunnel offload support")
Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:41:15 -07:00
David S. Miller
99f47ea437 Merge branch 'nfp-ct-fixes'
Simon Horman says:

====================
small tc conntrack fixes

Louis Peens says:

The first patch makes sure that any callbacks registered to
the ct->nf_tables table are cleaned up before freeing.

The second patch removes what was effectively a workaround
in the nfp driver because of the missing cleanup in patch 1.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:36:35 -07:00
Louis Peens
7cc93d888d nfp: flower-ct: remove callback delete deadlock
The current location of the callback delete can lead to a race
condition where deleting the callback requires a write_lock on
the nf_table, but at the same time another thread from netfilter
could have taken a read lock on the table before trying to offload.
Since the driver is taking a rtnl_lock this can lead into a deadlock
situation, where the netfilter offload will wait for the cls_flower
rtnl_lock to be released, but this cannot happen since this is
waiting for the nf_table read_lock to be released before it can
delete the callback.

Solve this by completely removing the nf_flow_table_offload_del_cb
call, as this will now be cleaned up by act_ct itself when cleaning
up the specific nf_table.

Fixes: 62268e7814 ("nfp: flower-ct: add nft callback stubs")
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:36:35 -07:00
Louis Peens
77ac5e40c4 net/sched: act_ct: remove and free nf_table callbacks
When cleaning up the nf_table in tcf_ct_flow_table_cleanup_work
there is no guarantee that the callback list, added to by
nf_flow_table_offload_add_cb, is empty. This means that it is
possible that the flow_block_cb memory allocated will be lost.

Fix this by iterating the list and freeing the flow_block_cb entries
before freeing the nf_table entry (via freeing ct_ft).

Fixes: 978703f425 ("netfilter: flowtable: Add API for registering to flow table events")
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:36:35 -07:00
Wolfgang Bumiller
a019abd802 net: bridge: sync fdb to new unicast-filtering ports
Since commit 2796d0c648 ("bridge: Automatically manage
port promiscuous mode.")
bridges with `vlan_filtering 1` and only 1 auto-port don't
set IFF_PROMISC for unicast-filtering-capable ports.

Normally on port changes `br_manage_promisc` is called to
update the promisc flags and unicast filters if necessary,
but it cannot distinguish between *new* ports and ones
losing their promisc flag, and new ports end up not
receiving the MAC address list.

Fix this by calling `br_fdb_sync_static` in `br_add_if`
after the port promisc flags are updated and the unicast
filter was supposed to have been filled.

Fixes: 2796d0c648 ("bridge: Automatically manage port promiscuous mode.")
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:34:07 -07:00
Eric Dumazet
81b4a0cc75 sock: fix error in sock_setsockopt()
Some tests are failing, John bisected the issue to a recent commit.

sock_set_timestamp() parameters should be :

1) sk
2) optname
3) valbool

Fixes: 371087aa47 ("sock: expose so_timestamp options for mptcp")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: John Sperbeck <jsperbeck@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:32:55 -07:00
Eric Dumazet
561022acb1 tcp: annotate data races around tp->mtu_info
While tp->mtu_info is read while socket is owned, the write
sides happen from err handlers (tcp_v[46]_mtu_reduced)
which only own the socket spinlock.

Fixes: 563d34d057 ("tcp: dont drop MTU reduction indications")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:31:48 -07:00
wenxu
8955b90c3c net/sched: act_ct: fix err check for nf_conntrack_confirm
The confirm operation should be checked. If there are any failed,
the packet should be dropped like in ovs and netfilter.

Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 12:07:08 -07:00
Bailey Forrest
1bfa4d0cb5 gve: DQO: Remove incorrect prefetch
The prefetch is incorrectly using the dma address instead of the virtual
address.

It's supposed to be:
prefetch((char *)buf_state->page_info.page_address +
	 buf_state->page_info.page_offset)

However, after correcting this mistake, there is no evidence of
performance improvement.

Fixes: 9b8dd5e5ea ("gve: DQO: Add RX path")
Signed-off-by: Bailey Forrest <bcf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 12:06:17 -07:00
Vadim Fedorenko
40fc3054b4 net: ipv6: fix return value of ip6_skb_dst_mtu
Commit 628a5c5618 ("[INET]: Add IP(V6)_PMTUDISC_RPOBE") introduced
ip6_skb_dst_mtu with return value of signed int which is inconsistent
with actually returned values. Also 2 users of this function actually
assign its value to unsigned int variable and only __xfrm6_output
assigns result of this function to signed variable but actually uses
as unsigned in further comparisons and calls. Change this function
to return unsigned int value.

Fixes: 628a5c5618 ("[INET]: Add IP(V6)_PMTUDISC_RPOBE")
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 11:57:01 -07:00
Christophe JAILLET
bde3c8ffdd gve: Simplify code and axe the use of a deprecated API
The wrappers in include/linux/pci-dma-compat.h should go away.

Replace 'pci_set_dma_mask/pci_set_consistent_dma_mask' by an equivalent
and less verbose 'dma_set_mask_and_coherent()' call.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 11:56:15 -07:00
Jesper Dangaard Brouer
633fa66640 net/sched: sch_taprio: fix typo in comment
I have checked that the IEEE standard 802.1Q-2018 section 8.6.9.4.5
is called AdminGateStates.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 11:55:28 -07:00
Vasily Averin
c23a9fd209 netfilter: ctnetlink: suspicious RCU usage in ctnetlink_dump_helpinfo
Two patches listed below removed ctnetlink_dump_helpinfo call from under
rcu_read_lock. Now its rcu_dereference generates following warning:
=============================
WARNING: suspicious RCU usage
5.13.0+ #5 Not tainted
-----------------------------
net/netfilter/nf_conntrack_netlink.c:221 suspicious rcu_dereference_check() usage!

other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
stack backtrace:
CPU: 1 PID: 2251 Comm: conntrack Not tainted 5.13.0+ #5
Call Trace:
 dump_stack+0x7f/0xa1
 ctnetlink_dump_helpinfo+0x134/0x150 [nf_conntrack_netlink]
 ctnetlink_fill_info+0x2c2/0x390 [nf_conntrack_netlink]
 ctnetlink_dump_table+0x13f/0x370 [nf_conntrack_netlink]
 netlink_dump+0x10c/0x370
 __netlink_dump_start+0x1a7/0x260
 ctnetlink_get_conntrack+0x1e5/0x250 [nf_conntrack_netlink]
 nfnetlink_rcv_msg+0x613/0x993 [nfnetlink]
 netlink_rcv_skb+0x50/0x100
 nfnetlink_rcv+0x55/0x120 [nfnetlink]
 netlink_unicast+0x181/0x260
 netlink_sendmsg+0x23f/0x460
 sock_sendmsg+0x5b/0x60
 __sys_sendto+0xf1/0x160
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0x36/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: 49ca022bcc ("netfilter: ctnetlink: don't dump ct extensions of unconfirmed conntracks")
Fixes: 0b35f6031a ("netfilter: Remove duplicated rcu_read_lock.")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-02 02:29:20 +02:00
Vasily Averin
a23f89a999 netfilter: conntrack: nf_ct_gre_keymap_flush() removal
nf_ct_gre_keymap_flush() is useless.
It is called from nf_conntrack_cleanup_net_list() only and tries to remove
nf_ct_gre_keymap entries from pernet gre keymap list. Though:
a) at this point the list should already be empty, all its entries were
deleted during the conntracks cleanup, because
nf_conntrack_cleanup_net_list() executes nf_ct_iterate_cleanup(kill_all)
before nf_conntrack_proto_pernet_fini():
 nf_conntrack_cleanup_net_list
  +- nf_ct_iterate_cleanup
  |   nf_ct_put
  |    nf_conntrack_put
  |     nf_conntrack_destroy
  |      destroy_conntrack
  |       destroy_gre_conntrack
  |        nf_ct_gre_keymap_destroy
  `- nf_conntrack_proto_pernet_fini
      nf_ct_gre_keymap_flush

b) Let's say we find that the keymap list is not empty. This means netns
still has a conntrack associated with gre, in which case we should not free
its memory, because this will lead to a double free and related crashes.
However I doubt it could have gone unnoticed for years, obviously
this does not happen in real life. So I think we can remove
both nf_ct_gre_keymap_flush() and nf_conntrack_proto_pernet_fini().

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-02 02:07:01 +02:00
Colin Ian King
4ca041f919 netfilter: nf_tables: Fix dereference of null pointer flow
In the case where chain->flags & NFT_CHAIN_HW_OFFLOAD is false then
nft_flow_rule_create is not called and flow is NULL. The subsequent
error handling execution via label err_destroy_flow_rule will lead
to a null pointer dereference on flow when calling nft_flow_rule_destroy.
Since the error path to err_destroy_flow_rule has to cater for null
and non-null flows, only call nft_flow_rule_destroy if flow is non-null
to fix this issue.

Addresses-Coverity: ("Explicity null dereference")
Fixes: 3c5e446220 ("netfilter: nf_tables: memleak in hw offload abort path")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-02 02:05:59 +02:00
Florian Westphal
e15d4cdf27 netfilter: conntrack: do not renew entry stuck in tcp SYN_SENT state
Consider:
  client -----> conntrack ---> Host

client sends a SYN, but $Host is unreachable/silent.
Client eventually gives up and the conntrack entry will time out.

However, if the client is restarted with same addr/port pair, it
may prevent the conntrack entry from timing out.

This is noticeable when the existing conntrack entry has no NAT
transformation or an outdated one and port reuse happens either
on client or due to a NAT middlebox.

This change prevents refresh of the timeout for SYN retransmits,
so entry is going away after nf_conntrack_tcp_timeout_syn_sent
seconds (default: 60).

Entry will be re-created on next connection attempt, but then
nat rules will be evaluated again.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-02 02:05:59 +02:00
Florian Westphal
37d220b58d selftest: netfilter: add test case for unreplied tcp connections
TCP connections in UNREPLIED state (only SYN seen) can be kept alive
indefinitely, as each SYN re-sets the timeout.

This means that even if a peer has closed its socket the entry
never times out.

This also prevents re-evaluation of configured NAT rules.
Add a test case that sets SYN timeout to 10 seconds, then check
that the nat redirection added later eventually takes effect.

This is based off a repro script from Antonio Ojea.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-02 02:05:59 +02:00
Kees Cook
5140aaa460 s390: iucv: Avoid field over-reading memcpy()
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally reading across neighboring array fields.

Add a wrapping struct to serve as the memcpy() source so the compiler
can perform appropriate bounds checking, avoiding this future warning:

In function '__fortify_memcpy',
    inlined from 'iucv_message_pending' at net/iucv/iucv.c:1663:4:
./include/linux/fortify-string.h:246:4: error: call to '__read_overflow2_field' declared with attribute error: detected read beyond size of field (2nd parameter)

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 15:54:01 -07:00
Christophe JAILLET
6dce38b4b7 gve: Propagate error codes to caller
If 'gve_probe()' fails, we should propagate the error code, instead of
hard coding a -ENXIO value.
Make sure that all error handling paths set a correct value for 'err'.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 15:45:01 -07:00
Christophe JAILLET
2342ae10d1 gve: Fix an error handling path in 'gve_probe()'
If the 'register_netdev() call fails, we must release the resources
allocated by the previous 'gve_init_priv()' call, as already done in the
remove function.

Add a new label and the missing 'gve_teardown_priv_resources()' in the
error handling path.

Fixes: 893ce44df5 ("gve: Add basic driver framework for Compute Engine Virtual NIC")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 15:45:01 -07:00
David S. Miller
aa3cf240b0 Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/t
nguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2021-07-01

This series contains updates to igb, igc, ixgbe, e1000e, fm10k, and iavf
drivers.

Vinicius fixes a use-after-free issue present in igc and igb.

Tom Rix fixes the return value for igc_read_phy_reg() when the
operation is not supported for igc.

Christophe Jaillet fixes unrolling of PCIe error reporting for ixgbe,
igc, igb, fm10k, e10000e, and iavf.

Alex ensures that q_vector array is not accessed beyond its bounds for
igb.

Jedrzej moves ring assignment to occur after bounds have been checked in
igb.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 14:39:28 -07:00
Mohammad Athari Bin Ismail
6b28a86d6c net: stmmac: Terminate FPE workqueue in suspend
Add stmmac_fpe_stop_wq() in stmmac_suspend() to terminate FPE workqueue
during suspend. So, in suspend mode, there will be no FPE workqueue
available. Without this fix, new additional FPE workqueue will be created
in every suspend->resume cycle.

Fixes: 5a5586112b ("net: stmmac: support FPE link partner hand-shaking procedure")
Signed-off-by: Mohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:51:46 -07:00
David S. Miller
1c88995dfb Merge branch 'sms911x-dts'
Geert Uytterhoeven says:

====================
sms911x: DTS fixes and DT binding to json-schema conversion

This patch series converts the Smart Mixed-Signal Connectivity (SMSC)
LAN911x/912x Controller Device Tree binding documentation to
json-schema, after fixing a few issues in DTS files.

Changed compared to v1[1]:
  - Dropped applied patches,
  - Add Reviewed-by,
  - Drop bogus double quotes in compatible values,
  - Add comment explaining why "additionalProperties: true" is needed.

[1] [PATCH 0/5] sms911x: DTS fixes and DT binding to json-schema conversion
    https://lore.kernel.org/r/cover.1621518686.git.geert+renesas@glider.be
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:27:11 -07:00
Geert Uytterhoeven
19373d0233 dt-bindings: net: sms911x: Convert to json-schema
Convert the Smart Mixed-Signal Connectivity (SMSC) LAN911x/912x
Controller Device Tree binding documentation to json-schema.

Document missing properties.
Make "phy-mode" not required, as many DTS files do not have it, and the
Linux drivers falls back to PHY_INTERFACE_MODE_NA.
Correct nodename in example.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:27:10 -07:00
Geert Uytterhoeven
b6c8801038 ARM: dts: qcom-apq8060: Correct Ethernet node name and drop bogus irq property
make dtbs_check:

    ethernet-ebi2@2,0: $nodename:0: 'ethernet-ebi2@2,0' does not match '^ethernet(@.*)?$'
    ethernet-ebi2@2,0: 'smsc,irq-active-low' does not match any of the regexes: 'pinctrl-[0-9]+'

There is no "smsc,irq-active-low" property, as active low is the
default.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:27:10 -07:00
Eric Dumazet
18a419bad6 udp: annotate data races around unix_sk(sk)->gso_size
Accesses to unix_sk(sk)->gso_size are lockless.
Add READ_ONCE()/WRITE_ONCE() around them.

BUG: KCSAN: data-race in udp_lib_setsockopt / udpv6_sendmsg

write to 0xffff88812d78f47c of 2 bytes by task 10849 on cpu 1:
 udp_lib_setsockopt+0x3b3/0x710 net/ipv4/udp.c:2696
 udpv6_setsockopt+0x63/0x90 net/ipv6/udp.c:1630
 sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3265
 __sys_setsockopt+0x18f/0x200 net/socket.c:2104
 __do_sys_setsockopt net/socket.c:2115 [inline]
 __se_sys_setsockopt net/socket.c:2112 [inline]
 __x64_sys_setsockopt+0x62/0x70 net/socket.c:2112
 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff88812d78f47c of 2 bytes by task 10852 on cpu 0:
 udpv6_sendmsg+0x161/0x16b0 net/ipv6/udp.c:1299
 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg net/socket.c:674 [inline]
 ____sys_sendmsg+0x360/0x4d0 net/socket.c:2337
 ___sys_sendmsg net/socket.c:2391 [inline]
 __sys_sendmmsg+0x315/0x4b0 net/socket.c:2477
 __do_sys_sendmmsg net/socket.c:2506 [inline]
 __se_sys_sendmmsg net/socket.c:2503 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2503
 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000 -> 0x0005

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10852 Comm: syz-executor.0 Not tainted 5.13.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:23:19 -07:00
Paolo Abeni
71158bb1f2 tcp: consistently disable header prediction for mptcp
The MPTCP receive path is hooked only into the TCP slow-path.
The DSS presence allows plain MPTCP traffic to hit that
consistently.

Since commit e1ff9e82e2 ("net: mptcp: improve fallback to TCP"),
when an MPTCP socket falls back to TCP, it can hit the TCP receive
fast-path, and delay or stop triggering the event notification.

Address the issue explicitly disabling the header prediction
for MPTCP sockets.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/200
Fixes: e1ff9e82e2 ("net: mptcp: improve fallback to TCP")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:22:40 -07:00
Christoph Hellwig
ca75bcf0a8 net: remove the caif_hsi driver
The caif_hsi driver relies on a cfhsi_get_ops symbol using symbol_get,
but this symbol is not provided anywhere in the kernel tree.  Remove
this driver given that it is dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:19:48 -07:00
Xin Long
09ef17863f Documentation: add more details in tipc.rst
kernel-doc for TIPC is too simple, we need to add more information for it.

This patch is to extend the abstract, and add the Features and Links items.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:18:18 -07:00
Sukadev Bhattiprolu
4f408e1fa6 ibmvnic: retry reset if there are no other resets
Normally, if a reset fails due to failover or other communication error
there is another reset (eg: FAILOVER) in the queue and we would process
that reset. But if we are unable to communicate with PHYP or VIOS after
H_FREE_CRQ, there would be no other resets in the queue and the adapter
would be in an undefined state even though it was in the OPEN state
earlier. While starting the reset we set the carrier to off state so
we won't even get the timeout resets.

If the last queued reset fails, retry it as a hard reset (after the
usual 60 second settling time).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:11:12 -07:00
David S. Miller
b2bc814817 Merge branch 'ptp-virtual-clocks-and-timestamping'
Yangbo Lu says:

====================
ptp: support virtual clocks and timestamping

Current PTP driver exposes one PTP device to user which binds network
interface/interfaces to provide timestamping. Actually we have a way
utilizing timecounter/cyclecounter to virtualize any number of PTP
clocks based on a same free running physical clock for using.
The purpose of having multiple PTP virtual clocks is for user space
to directly/easily use them for multiple domains synchronization.

user
space:     ^                                  ^
           | SO_TIMESTAMPING new flag:        | Packets with
           | SOF_TIMESTAMPING_BIND_PHC        | TX/RX HW timestamps
           v                                  v
         +--------------------------------------------+
sock:    |     sock (new member sk_bind_phc)          |
         +--------------------------------------------+
           ^                                  ^
           | ethtool_get_phc_vclocks          | Convert HW timestamps
           |                                  | to sk_bind_phc
           v                                  v
         +--------------+--------------+--------------+
vclock:  | ptp1         | ptp2         | ptpN         |
         +--------------+--------------+--------------+
pclock:  |             ptp0 free running              |
         +--------------------------------------------+

The block diagram may explain how it works. Besides the PTP virtual
clocks, the packet HW timestamp converting to the bound PHC is also
done in sock driver. For user space, PTP virtual clocks can be
created via sysfs, and extended SO_TIMESTAMPING API (new flag
SOF_TIMESTAMPING_BIND_PHC) can be used to bind one PTP virtual clock
for timestamping.

The test tool timestamping.c (together with linuxptp phc_ctl tool) can
be used to verify:

  # echo 4 > /sys/class/ptp/ptp0/n_vclocks
  [  129.399472] ptp ptp0: new virtual clock ptp2
  [  129.404234] ptp ptp0: new virtual clock ptp3
  [  129.409532] ptp ptp0: new virtual clock ptp4
  [  129.413942] ptp ptp0: new virtual clock ptp5
  [  129.418257] ptp ptp0: guarantee physical clock free running
  #
  # phc_ctl /dev/ptp2 set 10000
  # phc_ctl /dev/ptp3 set 20000
  #
  # timestamping eno0 2 SOF_TIMESTAMPING_TX_HARDWARE SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_BIND_PHC
  # timestamping eno0 2 SOF_TIMESTAMPING_RX_HARDWARE SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_BIND_PHC
  # timestamping eno0 3 SOF_TIMESTAMPING_TX_HARDWARE SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_BIND_PHC
  # timestamping eno0 3 SOF_TIMESTAMPING_RX_HARDWARE SOF_TIMESTAMPING_RAW_HARDWARE SOF_TIMESTAMPING_BIND_PHC

Changes for v2:
	- Converted to num_vclocks for creating virtual clocks.
	- Guranteed physical clock free running when using virtual
	  clocks.
	- Fixed build warning.
	- Updated copyright.
Changes for v3:
	- Supported PTP virtual clock in default in PTP driver.
	- Protected concurrency of ptp->num_vclocks accessing.
	- Supported PHC vclocks query via ethtool.
	- Extended SO_TIMESTAMPING API for PHC binding.
	- Converted HW timestamps to PHC bound, instead of previous
	  binding domain value to PHC idea.
	- Other minor fixes.
Changes for v4:
	- Used do_aux_work callback for vclock refreshing instead.
	- Used unsigned int for vclocks number, and max_vclocks
	  for limitiation.
	- Fixed mutex locking.
	- Dynamically allocated memory for vclock index storage.
	- Removed ethtool ioctl command for vclocks getting.
	- Updated doc for ethtool phc vclocks get.
	- Converted to mptcp_setsockopt_sol_socket_timestamping().
	- Passed so_timestamping for sock_set_timestamping.
	- Fixed checkpatch/build.
	- Other minor fixed.
Changes for v5:
	- Fixed checkpatch/build/bug reported by test robot.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:08:19 -07:00
Yangbo Lu
5ce15f2783 MAINTAINERS: add entry for PTP virtual clock driver
Add entry for PTP virtual clock driver.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01 13:08:19 -07:00