Commit Graph

77362 Commits

Author SHA1 Message Date
Eric Dumazet
adbe695a97 tcp: move inet_reqsk_alloc() close to inet_reqsk_clone()
inet_reqsk_alloc() does not belong to tcp_input.c,
move it to inet_connection_sock.c instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:18:04 +02:00
Davide Caratti
92f74c1e05 mptcp: refer to 'MPTCP' socket in comments
We used to call it 'master' socket at the early stages of MPTCP
development, but the correct wording is 'MPTCP' socket opposed to 'TCP
subflows': convert the last 3 comments to use a more appropriate term.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:13:47 +02:00
Geliang Tang
5cdedad62e mptcp: add mptcp_space_from_win helper
As a wrapper of __tcp_space_from_win(), this patch adds a MPTCP dedicated
space_from_win helper mptcp_space_from_win() in protocol.h to paired with
mptcp_win_from_space().

Use it instead of __tcp_space_from_win() in both mptcp_rcv_space_adjust()
and mptcp_set_rcvlowat().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:13:47 +02:00
Geliang Tang
5f0d0649c8 mptcp: use mptcp_win_from_space helper
The MPTCP dedicated win_from_space helper mptcp_win_from_space() is defined
in protocol.h, use it in mptcp_rcv_space_adjust() instead of using the TCP
one. Here scaling_ratio is the same as msk->scaling_ratio.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:13:47 +02:00
Jason Xing
9b6a30febd net: allow rps/rfs related configs to be switched
After John Sperbeck reported a compile error if the CONFIG_RFS_ACCEL
is off, I found that I cannot easily enable/disable the config
because of lack of the prompt when using 'make menuconfig'. Therefore,
I decided to change rps/rfc related configs altogether.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://lore.kernel.org/r/20240605022932.33703-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 13:18:48 +02:00
Eric Dumazet
98aa546af5 inet: remove (struct uncached_list)->quarantine
This list is used to tranfert dst that are handled by
rt_flush_dev() and rt6_uncached_list_flush_dev() out
of the per-cpu lists.

But quarantine list is not used later.

If we simply use list_del_init(&rt->dst.rt_uncached),
this also removes the dst from per-cpu list.

This patch also makes the future calls to rt_del_uncached_list()
and rt6_uncached_list_del() faster, because no spinlock
acquisition is needed anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240604165150.726382-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:33:25 +02:00
Eric Dumazet
b4cb4a1391 net: use unrcu_pointer() helper
Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.

Also make inet_rcv_compat const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 11:52:52 +02:00
Kevin Yang
f086edef71 tcp: add sysctl_tcp_rto_min_us
Adding a sysctl knob to allow user to specify a default
rto_min at socket init time, other than using the hard
coded 200ms default rto_min.

Note that the rto_min route option has the highest precedence
for configuring this setting, followed by the TCP_BPF_RTO_MIN
socket option, followed by the tcp_rto_min_us sysctl.

Signed-off-by: Kevin Yang <yyd@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 13:42:54 +01:00
Kevin Yang
512bd0f9f9 tcp: derive delack_max with tcp_rto_min helper
Rto_min now has multiple sources, ordered by preprecedence high to
low: ip route option rto_min, icsk->icsk_rto_min.

When derive delack_max from rto_min, we should not only use ip
route option, but should use tcp_rto_min helper to get the correct
rto_min.

Signed-off-by: Kevin Yang <yyd@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 13:42:54 +01:00
Eric Dumazet
69e0b33a7f tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp
These fields can be read and written locklessly, add annotations
around these minor races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 12:30:09 +01:00
Christophe JAILLET
82dc29b973 devlink: Constify the 'table_ops' parameter of devl_dpipe_table_register()
"struct devlink_dpipe_table_ops" only contains some function pointers.

Update "struct devlink_dpipe_table" and the 'table_ops' parameter of
devl_dpipe_table_register() so that structures in drivers can be
constified.

Constifying these structures will move some data to a read-only section, so
increase overall security.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 10:24:57 +01:00
Dr. David Alan Gilbert
a23b0034e9 net: ethtool: remove unused struct 'cable_test_tdr_req_info'
'cable_test_tdr_req_info' is unused since the original
commit f2bc8ad31a ("net: ethtool: Allow PHY cable test TDR data to
configured").

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 10:19:08 +01:00
Dr. David Alan Gilbert
6f49c3fb56 net: caif: remove unused structs
'cfpktq' has been unused since
commit 73d6ac633c ("caif: code cleanup").

'caif_packet_funcs' is declared but never defined.

Remove both of them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 10:18:06 +01:00
Jason Xing
61e2bbafb0 net: remove NULL-pointer net parameter in ip_metrics_convert
When I was doing some experiments, I found that when using the first
parameter, namely, struct net, in ip_metrics_convert() always triggers NULL
pointer crash. Then I digged into this part, realizing that we can remove
this one due to its uselessness.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 10:06:00 +01:00
Chen Hanxiao
cdbdb3c62a net: bridge: fix an inconsistent indentation
Smatch complains:
net/bridge/br_netlink_tunnel.c:
   318 br_process_vlan_tunnel_info() warn: inconsistent indenting

Fix it with a proper indenting

Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 10:04:47 +01:00
Breno Leitao
2b438c5774 openvswitch: Remove generic .ndo_get_stats64
Commit 3e2f544dd8 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://lore.kernel.org/r/20240531111552.3209198-2-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 16:06:37 +02:00
Breno Leitao
8c3fdff217 openvswitch: Move stats allocation to core
With commit 34d21de99c ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core instead
of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Move openvswitch driver to leverage the core allocation.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240531111552.3209198-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 16:06:37 +02:00
Jakub Kicinski
99b8add01f net: skb: add compatibility warnings to skb_shift()
According to current semantics we should never try to shift data
between skbs which differ on decrypted or pp_recycle status.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 13:23:30 +02:00
Jakub Kicinski
1be68a87ab tcp: add a helper for setting EOR on tail skb
TLS (and hopefully soon PSP will) use EOR to prevent skbs
with different decrypted state from getting merged, without
adding new tests to the skb handling. In both cases once
the connection switches to an "encrypted" state, all subsequent
skbs will be encrypted, so a single "EOR fence" is sufficient
to prevent mixing.

Add a helper for setting the EOR bit, to make this arrangement
more explicit.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 13:23:30 +02:00
Jakub Kicinski
0711153018 tcp: wrap mptcp and decrypted checks into tcp_skb_can_collapse_rx()
tcp_skb_can_collapse() checks for conditions which don't make
sense on input. Because of this we ended up sprinkling a few
pairs of mptcp_skb_can_collapse() and skb_cmp_decrypted() calls
on the input path. Group them in a new helper. This should make
it less likely that someone will check mptcp and not decrypted
or vice versa when adding new code.

This implicitly adds a decrypted check early in tcp_collapse().
AFAIU this will very slightly increase our ability to collapse
packets under memory pressure, not a real bug.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 13:23:30 +02:00
Davide Caratti
1d17568e74 net/sched: cls_flower: add support for matching tunnel control flags
extend cls_flower to match TUNNEL_FLAGS_PRESENT bits in tunnel metadata.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 11:16:38 +02:00
Davide Caratti
668b6a2ef8 flow_dissector: add support for tunnel control flags
Dissect [no]csum, [no]dontfrag, [no]oam, [no]crit flags from skb metadata.
This is a prerequisite for matching these control flags using TC flower.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 11:16:38 +02:00
Jakub Kicinski
4fdb6b6063 net: count drops due to missing qdisc as dev->tx_drops
Catching and debugging missing qdiscs is pretty tricky. When qdisc
is deleted we replace it with a noop qdisc, which silently drops
all the packets. Since the noop qdisc has a single static instance
we can't count drops at the qdisc level. Count them as dev->tx_drops.

  ip netns add red
  ip link add type veth peer netns red
  ip            link set dev veth0 up
  ip -netns red link set dev veth0 up
  ip            a a dev veth0 10.0.0.1/24
  ip -netns red a a dev veth0 10.0.0.2/24
  ping -c 2 10.0.0.2
  #  2 packets transmitted, 2 received, 0% packet loss, time 1031ms
  ip -s link show dev veth0
  #  TX:  bytes packets errors dropped carrier collsns
  #        1314      17      0       0       0       0

  tc qdisc replace dev veth0 root handle 1234: mq
  tc qdisc replace dev veth0 parent 1234:1 pfifo
  tc qdisc del dev veth0 parent 1234:1
  ping -c 2 10.0.0.2
  #  2 packets transmitted, 0 received, 100% packet loss, time 1034ms
  ip -s link show dev veth0
  #  TX:  bytes packets errors dropped carrier collsns
  #        1314      17      0       3       0       0

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20240529162527.3688979-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 10:39:31 +02:00
Guangguan Wang
2f4b101c54 net/smc: change SMCR_RMBE_SIZES from 5 to 15
SMCR_RMBE_SIZES is the upper boundary of SMC-R's snd_buf and rcv_buf.
The maximum bytes of snd_buf and rcv_buf can be calculated by 2^SMCR_
RMBE_SIZES * 16KB. SMCR_RMBE_SIZES = 5 means the upper boundary is 512KB.
TCP's snd_buf and rcv_buf max size is configured by net.ipv4.tcp_w/rmem[2]
whose default value is 4MB or 6MB, is much larger than SMC-R's upper
boundary.

In some scenarios, such as Recommendation System, the communication
pattern is mainly large size send/recv, where the size of snd_buf and
rcv_buf greatly affects performance. Due to the upper boundary
disadvantage, SMC-R performs poor than TCP in those scenarios. So it
is time to enlarge the upper boundary size of SMC-R's snd_buf and rcv_buf,
so that the SMC-R's snd_buf and rcv_buf can be configured to larger size
for performance gain in such scenarios.

The SMC-R rcv_buf's size will be transferred to peer by the field
rmbe_size in clc accept and confirm message. The length of the field
rmbe_size is four bits, which means the maximum value of SMCR_RMBE_SIZES
is 15. In case of frequently adjusting the value of SMCR_RMBE_SIZES
in different scenarios, set the value of SMCR_RMBE_SIZES to the maximum
value 15, which means the upper boundary of SMC-R's snd_buf and rcv_buf
is 512MB. As the real memory usage is determined by the value of
net.smc.w/rmem, not by the upper boundary, set the value of SMCR_RMBE_SIZES
to the maximum value has no side affects.

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Co-developed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-03 12:12:42 +01:00
Guangguan Wang
3ac14b9dfb net/smc: set rmb's SG_MAX_SINGLE_ALLOC limitation only when CONFIG_ARCH_NO_SG_CHAIN is defined
SG_MAX_SINGLE_ALLOC is used to limit maximum number of entries that
will be allocated in one piece of scatterlist. When the entries of
scatterlist exceeds SG_MAX_SINGLE_ALLOC, sg chain will be used. From
commit 7c703e54cc ("arch: switch the default on ARCH_HAS_SG_CHAIN"),
we can know that the macro CONFIG_ARCH_NO_SG_CHAIN is used to identify
whether sg chain is supported. So, SMC-R's rmb buffer should be limited
by SG_MAX_SINGLE_ALLOC only when the macro CONFIG_ARCH_NO_SG_CHAIN is
defined.

Fixes: a3fe3d01bd ("net/smc: introduce sg-logic for RMBs")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Co-developed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-03 12:12:41 +01:00
Kuniyuki Iwashima
b5c0898807 af_unix: Remove dead code in unix_stream_read_generic().
When splice() support was added in commit 2b514574f7 ("net:
af_unix: implement splice for stream af_unix sockets"), we had
to release unix_sk(sk)->readlock (current iolock) before calling
splice_to_pipe().

Due to the unlock, commit 73ed5d25dc ("af-unix: fix use-after-free
with concurrent readers while splicing") added a safeguard in
unix_stream_read_generic(); we had to bump the skb refcount before
calling ->recv_actor() and then check if the skb was consumed by a
concurrent reader.

However, the pipe side locking was refactored, and since commit
25869262ef ("skb_splice_bits(): get rid of callback"), we can
call splice_to_pipe() without releasing unix_sk(sk)->iolock.

Now, the skb is always alive after the ->recv_actor() callback,
so let's remove the unnecessary drop_skb logic.

This is mostly the revert of commit 73ed5d25dc ("af-unix: fix
use-after-free with concurrent readers while splicing").

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240529144648.68591-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01 16:28:55 -07:00
Matteo Croce
19249c0724 net: make net.core.{r,w}mem_{default,max} namespaced
The following sysctl are global and can't be read from a netns:

net.core.rmem_default
net.core.rmem_max
net.core.wmem_default
net.core.wmem_max

Make the following sysctl parameters available readonly from within a
network namespace, allowing a container to read them.

Signed-off-by: Matteo Croce <teknoraver@meta.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20240530232722.45255-2-technoboy85@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01 16:03:21 -07:00
Abhishek Chauhan
73451e9aaa net: validate SO_TXTIME clockid coming from userspace
Currently there are no strict checks while setting SO_TXTIME
from userspace. With the recent development in skb->tstamp_type
clockid with unsupported clocks results in warn_on_once, which causes
unnecessary aborts in some systems which enables panic on warns.

Add validation in setsockopt to support only CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_TAI to be set from userspace.

Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
Link: https://lore.kernel.org/lkml/6bdba7b6-fd22-4ea5-a356-12268674def1@quicinc.com/
Fixes: 1693c5db6a ("net: Add additional bit to support clockid_t timestamp type")
Reported-by: syzbot+d7b227731ec589e7f4f0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d7b227731ec589e7f4f0
Reported-by: syzbot+30a35a2e9c5067cc43fa@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=30a35a2e9c5067cc43fa
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240529183130.1717083-1-quic_abchauha@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01 15:47:23 -07:00
Jakub Kicinski
e19de2064f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_classifier.c
  abd5576b9c ("net: ti: icssg-prueth: Add support for ICSSG switch firmware")
  56a5cf538c ("net: ti: icssg-prueth: Fix start counter for ft1 filter")
https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/

No other adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-31 14:10:28 -07:00
Hangbin Liu
a79d8fe2ff ipv6: sr: restruct ifdefines
There are too many ifdef in IPv6 segment routing code that may cause logic
problems. like commit 160e9d2752 ("ipv6: sr: fix invalid unregister error
path"). To avoid this, the init functions are redefined for both cases. The
code could be more clear after all fidefs are removed.

Suggested-by: Simon Horman <horms@kernel.org>
Suggested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240529040908.3472952-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-30 18:29:38 -07:00
Linus Torvalds
d8ec19857b Including fixes from bpf and netfilter.
Current release - regressions:
 
   - gro: initialize network_offset in network layer
 
   - tcp: reduce accepted window in NEW_SYN_RECV state
 
 Current release - new code bugs:
 
   - eth: mlx5e: do not use ptp structure for tx ts stats when not initialized
 
   - eth: ice: check for unregistering correct number of devlink params
 
 Previous releases - regressions:
 
   - bpf: Allow delete from sockmap/sockhash only if update is allowed
 
   - sched: taprio: extend minimum interval restriction to entire cycle too
 
   - netfilter: ipset: add list flush to cancel_gc
 
   - ipv4: fix address dump when IPv4 is disabled on an interface
 
   - sock_map: avoid race between sock_map_close and sk_psock_put
 
   - eth: mlx5: use mlx5_ipsec_rx_status_destroy to correctly delete status rules
 
 Previous releases - always broken:
 
   - core: fix __dst_negative_advice() race
 
   - bpf:
     - fix multi-uprobe PID filtering logic
     - fix pkt_type override upon netkit pass verdict
 
   - netfilter: tproxy: bail out if IP has been disabled on the device
 
   - af_unix: annotate data-race around unix_sk(sk)->addr
 
   - eth: mlx5e: fix UDP GSO for encapsulated packets
 
   - eth: idpf: don't enable NAPI and interrupts prior to allocating Rx buffers
 
   - eth: i40e: fully suspend and resume IO operations in EEH case
 
   - eth: octeontx2-pf: free send queue buffers incase of leaf to inner
 
   - eth: ipvlan: dont Use skb->sk in ipvlan_process_v{4,6}_outbound
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmZYaP0SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk5+QP/3wc2ktY/whZvLyJyM6NsVl1DYohnjua
 H05bveXgUMd4NNxEfQ31IMGCct6d2fe+fAIJrefxdjxbjyY38SY5xd1zpXLQDxqB
 ks6T9vZ4ITgwpqWT5Z1XafIgV/bYlf42+GHUIPuFFlBisoUqkAm7Wzw/T+Ap3rVX
 7Y2p7ulvdh85GyMGsAi5Bz9EkyiSQUsMvbtGOA9a9WopIyqoxTgV5Unk1L/FXlEU
 ZO8L7hrwZKWL1UDlaqnfESD9DBEbNc85WRoagFM4EdHl8vTwxwvTQ6+SDMtLO8jW
 8DSeb9CCin/VagqPhrylj5u72QGz+i7gDUMZIZVU6mHJc8WB13tIflOq0qKLnfNE
 n63/4zu9kWCznb7IKqg99mo1+bDcg1fyZusih+aguCGNYEQ/yrAf5ll2OMfjmZWa
 FFOuaVoLmN0f6XMb4L38Wwd9obvC3EbpnNveco3lmTp+4kRk1H/Ox2UI2jaFbUnG
 Nim4LZD4iGXJh1qnnQ0xkTjrltFAvnY9zUwo2Yv7TUQOi0JAXxsZwXwY6UjsiNrC
 QWdKL5VcdI0N1Y1MrmpQQKpRE9Lu1dTvbIRvFtQHmWgV7gqwTmShoSARBL1IM+lp
 tm+jfZOmznjYTaVnc1xnBCaIqs925gvnkniZpzru53xb5UegenadNXvQtYlaAokJ
 j13QKA6NrZVI
 =xkIZ
 -----END PGP SIGNATURE-----

Merge tag 'net-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - gro: initialize network_offset in network layer

   - tcp: reduce accepted window in NEW_SYN_RECV state

  Current release - new code bugs:

   - eth: mlx5e: do not use ptp structure for tx ts stats when not
     initialized

   - eth: ice: check for unregistering correct number of devlink params

  Previous releases - regressions:

   - bpf: Allow delete from sockmap/sockhash only if update is allowed

   - sched: taprio: extend minimum interval restriction to entire cycle
     too

   - netfilter: ipset: add list flush to cancel_gc

   - ipv4: fix address dump when IPv4 is disabled on an interface

   - sock_map: avoid race between sock_map_close and sk_psock_put

   - eth: mlx5: use mlx5_ipsec_rx_status_destroy to correctly delete
     status rules

  Previous releases - always broken:

   - core: fix __dst_negative_advice() race

   - bpf:
       - fix multi-uprobe PID filtering logic
       - fix pkt_type override upon netkit pass verdict

   - netfilter: tproxy: bail out if IP has been disabled on the device

   - af_unix: annotate data-race around unix_sk(sk)->addr

   - eth: mlx5e: fix UDP GSO for encapsulated packets

   - eth: idpf: don't enable NAPI and interrupts prior to allocating Rx
     buffers

   - eth: i40e: fully suspend and resume IO operations in EEH case

   - eth: octeontx2-pf: free send queue buffers incase of leaf to inner

   - eth: ipvlan: dont Use skb->sk in ipvlan_process_v{4,6}_outbound"

* tag 'net-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  netdev: add qstat for csum complete
  ipvlan: Dont Use skb->sk in ipvlan_process_v{4,6}_outbound
  net: ena: Fix redundant device NUMA node override
  ice: check for unregistering correct number of devlink params
  ice: fix 200G PHY types to link speed mapping
  i40e: Fully suspend and resume IO operations in EEH case
  i40e: factoring out i40e_suspend/i40e_resume
  e1000e: move force SMBUS near the end of enable_ulp function
  net: dsa: microchip: fix RGMII error in KSZ DSA driver
  ipv4: correctly iterate over the target netns in inet_dump_ifaddr()
  net: fix __dst_negative_advice() race
  nfc/nci: Add the inconsistency check between the input data length and count
  MAINTAINERS: dwmac: starfive: update Maintainer
  net/sched: taprio: extend minimum interval restriction to entire cycle too
  net/sched: taprio: make q->picos_per_byte available to fill_sched_entry()
  netfilter: nft_fib: allow from forward/input without iif selector
  netfilter: tproxy: bail out if IP has been disabled on the device
  netfilter: nft_payload: skbuff vlan metadata mangle support
  net: ti: icssg-prueth: Fix start counter for ft1 filter
  sock_map: avoid race between sock_map_close and sk_psock_put
  ...
2024-05-30 08:33:04 -07:00
Paolo Abeni
e889eb17f4 netfilter pull request 24-05-29
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmZWXboACgkQ1V2XiooU
 IOQW/BAAld3NcyYFzwhUmiAcig7jvYkIZwGE7wBVPz82G+gC4n8k+cIt50XlhTQt
 lctjKHI7y96hWqlAPbu96lRn/KFgFCCRF3sACpJ/rR5rLn+hLBgIiXKZohLoAkkA
 zoSfpHXQo8u8tGmdNJiweKNZbJWXj4Wmaj3SLqbf/0mMyIQUigZ+4EHDHWGP8JvQ
 6ilcEndEy9IBD3M5jIE3ubPwbrXwBQjaGRgT+e7yLgNDP8FEg3IeWcPW5hMtPEW3
 mizkxamzdNzhaBH2U33hKvoQv0q9QmtcBa8angvx9OCEDRa9c/RpfIYwj8SsdQW1
 kdzSAMNpYho+lRmq708G02vYQya5+wfof9NlJHrRqPj4BwFpj53sGm11ksfrYkId
 WSJj4WAAcv4zNavS3jQD8SeGrA7bxyvyfO6JUSTEThyMYWvN0s8erGNZYa3Mk1AD
 PH3WqMp8WV6EtG4+jQymvLcJ6HrzncdNeBdmbvVhGa9xPGaDIQFs3c4WRRjRu9jd
 0cPPITOtb0cOGbt/w9jK1ot7wcm00uwQ1xcFlJO88I7dkgqHxYEsQ8gUC+cJoqL6
 s2oPVKMvYzJnyhkXumP1/w3hqkk9EJq5pZZiu47UHbtfT3HhQRBPNydwTvB3Mj30
 X8rczN2sFFcDp1KkMQF6C8tQZSXIEEAf5XTUm+q69MmuiT0lrn8=
 =cexP
 -----END PGP SIGNATURE-----

Merge tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 syzbot reports that nf_reinject() could be called without
         rcu_read_lock() when flushing pending packets at nfnetlink
         queue removal, from Eric Dumazet.

Patch #2 flushes ipset list:set when canceling garbage collection to
         reference to other lists to fix a race, from Jozsef Kadlecsik.

Patch #3 restores q-in-q matching with nft_payload by reverting
         f6ae9f120d ("netfilter: nft_payload: add C-VLAN support").

Patch #4 fixes vlan mangling in skbuff when vlan offload is present
         in skbuff, without this patch nft_payload corrupts packets
         in this case.

Patch #5 fixes possible nul-deref in tproxy no IP address is found in
         netdevice, reported by syzbot and patch from Florian Westphal.

Patch #6 removes a superfluous restriction which prevents loose fib
         lookups from input and forward hooks, from Eric Garver.

My assessment is that patches #1, #2 and #5 address possible kernel
crash, anything else in this batch fixes broken features.

netfilter pull request 24-05-29

* tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_fib: allow from forward/input without iif selector
  netfilter: tproxy: bail out if IP has been disabled on the device
  netfilter: nft_payload: skbuff vlan metadata mangle support
  netfilter: nft_payload: restore vlan q-in-q match support
  netfilter: ipset: Add list flush to cancel_gc
  netfilter: nfnetlink_queue: acquire rcu_read_lock() in instance_destroy_rcu()
====================

Link: https://lore.kernel.org/r/20240528225519.1155786-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-30 10:14:56 +02:00
Alexander Mikhalitsyn
b8c8abefc0 ipv4: correctly iterate over the target netns in inet_dump_ifaddr()
A recent change to inet_dump_ifaddr had the function incorrectly iterate
over net rather than tgt_net, resulting in the data coming for the
incorrect network namespace.

Fixes: cdb2f80f1c ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Stéphane Graber <stgraber@stgraber.org>
Closes: https://github.com/lxc/incus/issues/892
Bisected-by: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Tested-by: Stéphane Graber <stgraber@stgraber.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240528203030.10839-1-aleksandr.mikhalitsyn@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29 18:43:42 -07:00
Russell King (Oracle)
bbb31b7ae1 net: dsa: remove mac_prepare()/mac_finish() shims
No DSA driver makes use of the mac_prepare()/mac_finish() shimmed
operations anymore, so we can remove these.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1sByNx-00ELW1-Vp@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29 18:41:15 -07:00
Eric Dumazet
92f1655aa2 net: fix __dst_negative_advice() race
__dst_negative_advice() does not enforce proper RCU rules when
sk->dst_cache must be cleared, leading to possible UAF.

RCU rules are that we must first clear sk->sk_dst_cache,
then call dst_release(old_dst).

Note that sk_dst_reset(sk) is implementing this protocol correctly,
while __dst_negative_advice() uses the wrong order.

Given that ip6_negative_advice() has special logic
against RTF_CACHE, this means each of the three ->negative_advice()
existing methods must perform the sk_dst_reset() themselves.

Note the check against NULL dst is centralized in
__dst_negative_advice(), there is no need to duplicate
it in various callbacks.

Many thanks to Clement Lecigne for tracking this issue.

This old bug became visible after the blamed commit, using UDP sockets.

Fixes: a87cb3e48e ("net: Facility to report route quality of connected sockets")
Reported-by: Clement Lecigne <clecigne@google.com>
Diagnosed-by: Clement Lecigne <clecigne@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240528114353.1794151-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29 17:34:49 -07:00
Eric Dumazet
fde6f897f2 tcp: fix races in tcp_v[46]_err()
These functions have races when they:

1) Write sk->sk_err
2) call sk_error_report(sk)
3) call tcp_done(sk)

As described in prior patches in this series:

An smp_wmb() is missing.
We should call tcp_done() before sk_error_report(sk)
to have consistent tcp_poll() results on SMP hosts.

Use tcp_done_with_error() where we centralized the
correct sequence.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29 17:21:36 -07:00
Eric Dumazet
5ce4645c23 tcp: fix races in tcp_abort()
tcp_abort() has the same issue than the one fixed in the prior patch
in tcp_write_err().

In order to get consistent results from tcp_poll(), we must call
sk_error_report() after tcp_done().

We can use tcp_done_with_error() to centralize this logic.

Fixes: c1e64e298b ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29 17:21:35 -07:00
Eric Dumazet
853c3bd7b7 tcp: fix race in tcp_write_err()
I noticed flakes in a packetdrill test, expecting an epoll_wait()
to return EPOLLERR | EPOLLHUP on a failed connect() attempt,
after multiple SYN retransmits. It sometimes return EPOLLERR only.

The issue is that tcp_write_err():
 1) writes an error in sk->sk_err,
 2) calls sk_error_report(),
 3) then calls tcp_done().

tcp_done() is writing SHUTDOWN_MASK into sk->sk_shutdown,
among other things.

Problem is that the awaken user thread (from 2) sk_error_report())
might call tcp_poll() before tcp_done() has written sk->sk_shutdown.

tcp_poll() only sees a non zero sk->sk_err and returns EPOLLERR.

This patch fixes the issue by making sure to call sk_error_report()
after tcp_done().

tcp_write_err() also lacks an smp_wmb().

We can reuse tcp_done_with_error() to factor out the details,
as Neal suggested.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29 17:21:35 -07:00
Eric Dumazet
5e514f1cba tcp: add tcp_done_with_error() helper
tcp_reset() ends with a sequence that is carefuly ordered.

We need to fix [e]poll bugs in the following patches,
it makes sense to use a common helper.

Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29 17:21:35 -07:00
Linus Torvalds
397a83ab97 Two fixes headed to stable trees:
- some trace event was dumping uninitialized values
 - a missing lock somewhere that was thought to have exclusive access,
 and it turned out not to
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmZWgYUACgkQq06b7GqY
 5nD+QQ/+JODfSn9l9JyHgXco0mIpQeldFUYoHPv3UwHr14MF8nux5HjzsupviAik
 vHBV5C2v6nOgAZWHpX4Rz+EaMNgjIwL2f0wLZMYh1Ho+lLr6+G0fN5iN3vHWmE4C
 w90qstKKhWf493pW+65IzzFp55vG7PPG8S81ZqbdxpdgMoBVpdtXDjedPOf9uzFi
 hkfGYWlmbrqkJ8pW4cvnlBkcraKgDDQndTRG4AQLtiLctpDk8/n95KeJpYZvgxX8
 30Vu09QjgFzTGur/QFdB8UC0ZEaDALtSKfBDjVwTZBvxA1uM6S1v2Ll6wiufvJ2H
 gTPtSwZ7CP601NDFdNmtDIsrJSp617d9xjBzFPIwJmX8tJplzy6sKYuVB0xe+gic
 4u3xK2I60H5D1Fw0dpWhW4MdgHkyKcEOb+EJ2zj3SmosgmOvLb7hDZ81Vc6FH4SX
 oLmMIj99Ks8U+TTZvY2lt51wxCXYaHF93feIOKnDEa7dF8gYy3/+C/0ztWbE+csF
 xqy7iIB2HWhN8/jtIOruiQlcx4JBJr2eZd+Vw/mCWhVvLXA5zbdAqKXEEBFNSyNk
 RXnk5KnlpgSoNec/z4lv1RJRidwic7TBkA6Q3/cgUuP39SoF4AnS0qcuUbivjKhc
 8RTInCO/iaruMJEZdlksFUCRo9iJQL/DdM4F9elX+DHL15TpZ+E=
 =7DGy
 -----END PGP SIGNATURE-----

Merge tag '9p-for-6.10-rc2' of https://github.com/martinetd/linux

Pull 9p fixes from Dominique Martinet:
 "Two fixes headed to stable trees:

   - a trace event was dumping uninitialized values

   - a missing lock that was thought to have exclusive access, and it
     turned out not to"

* tag '9p-for-6.10-rc2' of https://github.com/martinetd/linux:
  9p: add missing locking around taking dentry fid list
  net/9p: fix uninit-value in p9_client_rpc()
2024-05-29 09:25:15 -07:00
Thomas Weißschuh
0a9f788fdd ipvs: constify ctl_table arguments of utility functions
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.

As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.

To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".

No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-5-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:49:47 -07:00
Thomas Weißschuh
7a20cd1e71 net/ipv6/ndisc: constify ctl_table arguments of utility function
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.

As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.

To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".

No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-4-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:49:47 -07:00
Thomas Weißschuh
c55eb03765 net/ipv6/addrconf: constify ctl_table arguments of utility functions
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.

As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.

To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".

No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-3-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:49:47 -07:00
Thomas Weißschuh
551814313f net/ipv4/sysctl: constify ctl_table arguments of utility functions
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.

As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.

To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".

No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-2-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:49:47 -07:00
Thomas Weißschuh
874aa96d78 net/neighbour: constify ctl_table arguments of utility function
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.

As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.

To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".

No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-1-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:49:47 -07:00
Vladimir Oltean
fb66df20a7 net/sched: taprio: extend minimum interval restriction to entire cycle too
It is possible for syzbot to side-step the restriction imposed by the
blamed commit in the Fixes: tag, because the taprio UAPI permits a
cycle-time different from (and potentially shorter than) the sum of
entry intervals.

We need one more restriction, which is that the cycle time itself must
be larger than N * ETH_ZLEN bit times, where N is the number of schedule
entries. This restriction needs to apply regardless of whether the cycle
time came from the user or was the implicit, auto-calculated value, so
we move the existing "cycle == 0" check outside the "if "(!new->cycle_time)"
branch. This way covers both conditions and scenarios.

Add a selftest which illustrates the issue triggered by syzbot.

Fixes: b5b73b26b3 ("taprio: Fix allowing too small intervals")
Reported-by: syzbot+a7d2b1d5d1af83035567@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/0000000000007d66bc06196e7c66@google.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20240527153955.553333-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:46:41 -07:00
Vladimir Oltean
e634134180 net/sched: taprio: make q->picos_per_byte available to fill_sched_entry()
In commit b5b73b26b3 ("taprio: Fix allowing too small intervals"), a
comparison of user input against length_to_duration(q, ETH_ZLEN) was
introduced, to avoid RCU stalls due to frequent hrtimers.

The implementation of length_to_duration() depends on q->picos_per_byte
being set for the link speed. The blamed commit in the Fixes: tag has
moved this too late, so the checks introduced above are ineffective.
The q->picos_per_byte is zero at parse_taprio_schedule() ->
parse_sched_list() -> parse_sched_entry() -> fill_sched_entry() time.

Move the taprio_set_picos_per_byte() call as one of the first things in
taprio_change(), before the bulk of the netlink attribute parsing is
done. That's because it is needed there.

Add a selftest to make sure the issue doesn't get reintroduced.

Fixes: 09dbdf28f9 ("net/sched: taprio: fix calculation of maximum gate durations")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240527153955.553333-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:46:41 -07:00
Eric Garver
e8ded22ef0 netfilter: nft_fib: allow from forward/input without iif selector
This removes the restriction of needing iif selector in the
forward/input hooks for fib lookups when requested result is
oif/oifname.

Removing this restriction allows "loose" lookups from the forward hooks.

Fixes: be8be04e5d ("netfilter: nft_fib: reverse path filter for policy-based routing on iif")
Signed-off-by: Eric Garver <eric@garver.life>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-29 00:37:51 +02:00
Florian Westphal
21a673bddc netfilter: tproxy: bail out if IP has been disabled on the device
syzbot reports:
general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
[..]
RIP: 0010:nf_tproxy_laddr4+0xb7/0x340 net/ipv4/netfilter/nf_tproxy_ipv4.c:62
Call Trace:
 nft_tproxy_eval_v4 net/netfilter/nft_tproxy.c:56 [inline]
 nft_tproxy_eval+0xa9a/0x1a00 net/netfilter/nft_tproxy.c:168

__in_dev_get_rcu() can return NULL, so check for this.

Reported-and-tested-by: syzbot+b94a6818504ea90d7661@syzkaller.appspotmail.com
Fixes: cc6eb43385 ("tproxy: use the interface primary IP address as a default value for --on-ip")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-29 00:37:51 +02:00
Pablo Neira Ayuso
33c563ebf8 netfilter: nft_payload: skbuff vlan metadata mangle support
Userspace assumes vlan header is present at a given offset, but vlan
offload allows to store this in metadata fields of the skbuff. Hence
mangling vlan results in a garbled packet. Handle this transparently by
adding a parser to the kernel.

If vlan metadata is present and payload offset is over 12 bytes (source
and destination mac address fields), then subtract vlan header present
in vlan metadata, otherwise mangle vlan metadata based on offset and
length, extracting data from the source register.

This is similar to:

  8cfd23e674 ("netfilter: nft_payload: work around vlan header stripping")

to deal with vlan payload mangling.

Fixes: 7ec3f7b47b ("netfilter: nft_payload: add packet mangling support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-29 00:35:32 +02:00