linux

Author	SHA1	Message	Date
Li Zhong	05cd823863	ethtool: tunnels: check the return value of nla_nest_start() Check the return value of nla_nest_start(). When starting the entry level nested attributes, if the tailroom of socket buffer is insufficient to store the attribute header and payload, the return value will be NULL. There is, however, no real bug here since if the skb is full nla_put_be16() will fail as well and we'll error out. Signed-off-by: Li Zhong <floridsleeves@gmail.com> Link: https://lore.kernel.org/r/20220921181716.1629541-1-floridsleeves@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 19:28:10 -07:00
Zhengchao Shao	e046fa895c	net/sched: use tc_qdisc_stats_dump() in qdisc use tc_qdisc_stats_dump() in qdisc. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 17:34:10 -07:00
Vladimir Oltean	a2c2a4ddc2	net/sched: taprio: remove unnecessary taprio_list_lock The 3 functions that want access to the taprio_list: taprio_dev_notifier(), taprio_destroy() and taprio_init() are all called with the rtnl_mutex held, therefore implicitly serialized with respect to each other. A spin lock serves no purpose. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20220921095632.1379251-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 17:28:51 -07:00
Gal Pressman	56e5a6d3aa	net/tls: Support 256 bit keys with TX device offload Add the missing clause for 256 bit keys in tls_set_device_offload(), and the needed adjustments in tls_device_fallback.c. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 17:27:42 -07:00
Gal Pressman	ea7a9d88ba	net/tls: Use cipher sizes structs Use the newly introduced cipher sizes structs instead of the repeated switch cases churn. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 17:27:42 -07:00
Tariq Toukan	2d2c5ea242	net/tls: Describe ciphers sizes by const structs Introduce cipher sizes descriptor. It helps reducing the amount of code duplications and repeated switch/cases that assigns the proper sizes according to the cipher type. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 17:27:41 -07:00
Jakub Kicinski	0140a7168f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ethernet/freescale/fec.h `7b15515fc1` ("Revert "fec: Restart PPS after link state change"") `40c79ce13b` ("net: fec: add stop mode support for imx8 platform") https://lore.kernel.org/all/20220921105337.62b41047@canb.auug.org.au/ drivers/pinctrl/pinctrl-ocelot.c `c297561bc9` ("pinctrl: ocelot: Fix interrupt controller") `181f604b33` ("pinctrl: ocelot: add ability to be used in a non-mmio configuration") https://lore.kernel.org/all/20220921110032.7cd28114@canb.auug.org.au/ tools/testing/selftests/drivers/net/bonding/Makefile `bbb774d921` ("net: Add tests for bonding and team address list management") `152e8ec776` ("selftests/bonding: add a test for bonding lladdr target") https://lore.kernel.org/all/20220921110437.5b7dbd82@canb.auug.org.au/ drivers/net/can/usb/gs_usb.c `5440428b3d` ("can: gs_usb: gs_can_open(): fix race dev->can.state condition") `45dfa45f52` ("can: gs_usb: add RX and TX hardware timestamp support") https://lore.kernel.org/all/84f45a7d-92b6-4dc5-d7a1-072152fab6ff@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 13:02:10 -07:00
Linus Torvalds	504c25cb76	Including fixes from wifi, netfilter and can. A handful of awaited fixes here - revert of the FEC changes, bluetooth fix, fixes for iwlwifi spew. We added a warning in PHY/MDIO code which is triggering on a couple of platforms in a false-positive-ish way. If we can't iron that out over the week we'll drop it and re-add for 6.1. I've added a new "follow up fixes" section for fixes to fixes in 6.0-rcs but it may actually give the false impression that those are problematic or that more testing time would have caught them. So likely a one time thing. Follow up fixes: - nf_tables_addchain: fix nft_counters_enabled underflow - ebtables: fix memory leak when blob is malformed - nf_ct_ftp: fix deadlock when nat rewrite is needed Current release - regressions: - Revert "fec: Restart PPS after link state change" - Revert "net: fec: Use a spinlock to guard `fep->ptp_clk_on`" - Bluetooth: fix HCIGETDEVINFO regression - wifi: mt76: fix 5 GHz connection regression on mt76x0/mt76x2 - mptcp: fix fwd memory accounting on coalesce - rwlock removal fall out: - ipmr: always call ip{,6}_mr_forward() from RCU read-side critical section - ipv6: fix crash when IPv6 is administratively disabled - tcp: read multiple skbs in tcp_read_skb() - mdio_bus_phy_resume state warning fallout: - eth: ravb: fix PHY state warning splat during system resume - eth: sh_eth: fix PHY state warning splat during system resume Current release - new code bugs: - wifi: iwlwifi: don't spam logs with NSS>2 messages - eth: mtk_eth_soc: enable XDP support just for MT7986 SoC Previous releases - regressions: - bonding: fix NULL deref in bond_rr_gen_slave_id - wifi: iwlwifi: mark IWLMEI as broken Previous releases - always broken: - nf_conntrack helpers: - irc: tighten matching on DCC message - sip: fix ct_sip_walk_headers - osf: fix possible bogus match in nf_osf_find() - ipvlan: fix out-of-bound bugs caused by unset skb->mac_header - core: fix flow symmetric hash - bonding, team: unsync device addresses on ndo_stop - phy: micrel: fix shared interrupt on LAN8814 Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmMsj3EACgkQMUZtbf5S IrsUgQ//eXxuUZeGTg7cgJKPFJelrZ3iL16B1+s2qX94GPIqXRAShgC78iM7IbSe y3vR/7YVE7sKXm88wnLefMQVXPp0cE2p0+8++E/j4zcRZsM5sHb2+d3gW6nos2ed U8Ldm7LzWUNt/o1ZHDqZWBSoreFkmbFyHO6FVPCuH11tFUJqxJ/SP860mwo6tbuT HOoVphKis41IMEXCgybs2V0DAQewba0gejzAmySDy8epNhOj2F4Vo6aadnUCI68U HrIFYe2wiEi6MZDsB9zpRXc9seb6ZBKbBjgQnTK7MwfBEQCzxtR2lkNobJM1WbdL nYwHBOJ16yX0BnlSpUEepv6iJYY5Q7FS35Wk3Rq5Mik6DaEir6vVSBdRxHpYOkO2 KPIyyMMAA5E8mAtqH3PcpnwDK+9c3KlZYYKXxIp2IjQm87DpOZJFynwsC3Crmbzo C7UTMav2nkHljoapMLUwzqyw2ip+Qo14XA043FDPUru1sXY9CY6q50XZa5GmrNKh xyaBdp4Ckj1kOuXUR9jz3Rq8skOZ8lNGHtiCdgPZitWhNKW1YORJihC7/9zdieCR 1gOE7Dpz/MhVmFn2e8S5O3TkU5lXfALfPDJi4QiML5VLHXd/nCE5sHPiOBWcoo4w 2djKbIGpLRnO6qMs4NkWNmPbG+/ouvpM+lewqn+xU4TGyn/NTbI= =wrep -----END PGP SIGNATURE----- Merge tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wifi, netfilter and can. A handful of awaited fixes here - revert of the FEC changes, bluetooth fix, fixes for iwlwifi spew. We added a warning in PHY/MDIO code which is triggering on a couple of platforms in a false-positive-ish way. If we can't iron that out over the week we'll drop it and re-add for 6.1. I've added a new "follow up fixes" section for fixes to fixes in 6.0-rcs but it may actually give the false impression that those are problematic or that more testing time would have caught them. So likely a one time thing. Follow up fixes: - nf_tables_addchain: fix nft_counters_enabled underflow - ebtables: fix memory leak when blob is malformed - nf_ct_ftp: fix deadlock when nat rewrite is needed Current release - regressions: - Revert "fec: Restart PPS after link state change" and the related "net: fec: Use a spinlock to guard `fep->ptp_clk_on`" - Bluetooth: fix HCIGETDEVINFO regression - wifi: mt76: fix 5 GHz connection regression on mt76x0/mt76x2 - mptcp: fix fwd memory accounting on coalesce - rwlock removal fall out: - ipmr: always call ip{,6}_mr_forward() from RCU read-side critical section - ipv6: fix crash when IPv6 is administratively disabled - tcp: read multiple skbs in tcp_read_skb() - mdio_bus_phy_resume state warning fallout: - eth: ravb: fix PHY state warning splat during system resume - eth: sh_eth: fix PHY state warning splat during system resume Current release - new code bugs: - wifi: iwlwifi: don't spam logs with NSS>2 messages - eth: mtk_eth_soc: enable XDP support just for MT7986 SoC Previous releases - regressions: - bonding: fix NULL deref in bond_rr_gen_slave_id - wifi: iwlwifi: mark IWLMEI as broken Previous releases - always broken: - nf_conntrack helpers: - irc: tighten matching on DCC message - sip: fix ct_sip_walk_headers - osf: fix possible bogus match in nf_osf_find() - ipvlan: fix out-of-bound bugs caused by unset skb->mac_header - core: fix flow symmetric hash - bonding, team: unsync device addresses on ndo_stop - phy: micrel: fix shared interrupt on LAN8814" * tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits) selftests: forwarding: add shebang for sch_red.sh bnxt: prevent skb UAF after handing over to PTP worker net: marvell: Fix refcounting bugs in prestera_port_sfp_bind() net: sched: fix possible refcount leak in tc_new_tfilter() net: sunhme: Fix packet reception for len < RX_COPY_THRESHOLD udp: Use WARN_ON_ONCE() in udp_read_skb() selftests: bonding: cause oops in bond_rr_gen_slave_id bonding: fix NULL deref in bond_rr_gen_slave_id net: phy: micrel: fix shared interrupt on LAN8814 net/smc: Stop the CLC flow if no link to map buffers on ice: Fix ice_xdp_xmit() when XDP TX queue number is not sufficient net: atlantic: fix potential memory leak in aq_ndev_close() can: gs_usb: gs_usb_set_phys_id(): return with error if identify is not supported can: gs_usb: gs_can_open(): fix race dev->can.state condition can: flexcan: flexcan_mailbox_read() fix return value for drop = true net: sh_eth: Fix PHY state warning splat during system resume net: ravb: Fix PHY state warning splat during system resume netfilter: nf_ct_ftp: fix deadlock when nat rewrite is needed netfilter: ebtables: fix memory leak when blob is malformed netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain() ...	2022-09-22 10:58:13 -07:00
Hangyu Hua	c2e1cfefca	net: sched: fix possible refcount leak in tc_new_tfilter() tfilter_put need to be called to put the refount got by tp->ops->get to avoid possible refcount leak when chain->tmplt_ops != NULL and chain->tmplt_ops != tp->ops. Fixes: `7d5509fa0d` ("net: sched: extend proto ops with 'put' callback") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Link: https://lore.kernel.org/r/20220921092734.31700-1-hbh25y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 07:04:47 -07:00
Peilin Ye	db39dfdc1c	udp: Use WARN_ON_ONCE() in udp_read_skb() Prevent udp_read_skb() from flooding the syslog. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/20220921005915.2697-1-yepeilin.cs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-22 06:42:57 -07:00
Tony Lu	0227f058aa	net/smc: Unbind r/w buffer size from clcsock and make them tunable Currently, SMC uses smc->sk.sk_{rcv\|snd}buf to create buffers for send buffer and RMB. And the values of buffer size are from tcp_{w\|r}mem in clcsock. The buffer size from TCP socket doesn't fit SMC well. Generally, buffers are usually larger than TCP for SMC-R/-D to get higher performance, for they are different underlay devices and paths. So this patch unbinds buffer size from TCP, and introduces two sysctl knobs to tune them independently. Also, these knobs are per net namespace and work for containers. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-09-22 12:58:21 +02:00
Wen Gu	77eee32514	net/smc: Introduce a specific sysctl for TEST_LINK time SMC-R tests the viability of link by sending out TEST_LINK LLC messages over RoCE fabric when connections on link have been idle for a time longer than keepalive interval (testlink time). But using tcp_keepalive_time as testlink time maybe not quite suitable because it is default no less than two hours[1], which is too long for single link to find peer dead. The active host will still use peer-dead link (QP) sending messages, and can't find out until get IB_WC_RETRY_EXC_ERR error CQEs, which takes more time than TEST_LINK timeout (SMC_LLC_WAIT_TIME) normally. So this patch introduces a independent sysctl for SMC-R to set link keepalive time, in order to detect link down in time. The default value is 30 seconds. [1] https://www.rfc-editor.org/rfc/rfc1122#page-101 Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-09-22 12:58:21 +02:00
Wen Gu	e738455b2c	net/smc: Stop the CLC flow if no link to map buffers on There might be a potential race between SMC-R buffer map and link group termination. smc_smcr_terminate_all() \| smc_connect_rdma() -------------------------------------------------------------- \| smc_conn_create() for links in smcibdev \| schedule links down \| \| smc_buf_create() \| \- smcr_buf_map_usable_links() \| \- no usable links found, \| (rmb->mr = NULL) \| \| smc_clc_send_confirm() \| \- access conn->rmb_desc->mr[]->rkey \| (panic) During reboot and IB device module remove, all links will be set down and no usable links remain in link groups. In such situation smcr_buf_map_usable_links() should return an error and stop the CLC flow accessing to uninitialized mr. Fixes: `b9247544c1` ("net/smc: convert static link ID instances to support multiple links") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Link: https://lore.kernel.org/r/1663656189-32090-1-git-send-email-guwen@linux.alibaba.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-09-22 12:53:53 +02:00
Jakub Kicinski	7a5d48c446	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter patches for net-next Remove GPL license copypastry in uapi files, those have SPDX tags. From Christophe Jaillet. Remove unused variable in rpfilter, from Guillaume Nault. Rework gc resched delay computation in conntrack, from Antoine Tenart. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: rpfilter: Remove unused variable 'ret'. headers: Remove some left-over license text in include/uapi/linux/netfilter/ netfilter: conntrack: revisit the gc initial rescheduling bias netfilter: conntrack: fix the gc rescheduling delay ==================== Link: https://lore.kernel.org/r/20220921095000.29569-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-21 18:42:54 -07:00
Qingqing Yang	9f87eb4246	flow_dissector: Do not count vlan tags inside tunnel payload We've met the problem that when there is a vlan tag inside GRE encapsulation, the match of num_of_vlans fails. It is caused by the vlan tag inside GRE payload has been counted into num_of_vlans, which is not expected. One example packet is like this: Ethernet II, Src: Broadcom_68:56:07 (00:10:18:68:56:07) Dst: Broadcom_68:56:08 (00:10:18:68:56:08) 802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 100 Internet Protocol Version 4, Src: 192.168.1.4, Dst: 192.168.1.200 Generic Routing Encapsulation (Transparent Ethernet bridging) Ethernet II, Src: Broadcom_68:58:07 (00:10:18:68:58:07) Dst: Broadcom_68:58:08 (00:10:18:68:58:08) 802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 200 ... It should match the (num_of_vlans 1) rule, but it matches the (num_of_vlans 2) rule. The vlan tags inside the GRE or other tunnel encapsulated payload should not be taken into num_of_vlans. The fix is to stop counting the vlan number when the encapsulation bit is set. Fixes: `34951fcf26` ("flow_dissector: Add number of vlan tags dissector") Signed-off-by: Qingqing Yang <qingqing.yang@broadcom.com> Reviewed-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Link: https://lore.kernel.org/r/20220919074808.136640-1-qingqing.yang@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-21 18:36:57 -07:00
Jamal Hadi Salim	1d14b30b5a	net: sched: remove unused tcf_result extension Added by: commit `e5cf1baf92` ("act_mirred: use TC_ACT_REINSERT when possible") but no longer useful. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20220919130627.3551233-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-21 18:32:33 -07:00
William Dean	2801f30e2c	net: sched: simplify code in mall_reoffload such expression: if (err) return err; return 0; can simplify to: return err; Signed-off-by: William Dean <williamsukatube@163.com> Link: https://lore.kernel.org/r/20220917063556.2673-1-williamsukatube@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-21 18:22:04 -07:00
Ziyang Xuan	63b7c2ebcc	net/af_packet: registration process optimization in packet_init() Now, register_pernet_subsys() and register_netdevice_notifier() are both after sock_register(). It can create PF_PACKET socket and process socket once sock_register() successfully. It is possible PF_PACKET socket is creating but register_pernet_subsys() and register_netdevice_notifier() are not registered yet. Thus net->packet.sklist_lock and net->packet.sklist will be accessed without initialization that is done in packet_net_init(). Although this is a low probability scenario. Move register_pernet_subsys() and register_netdevice_notifier() to the front in packet_init(). Correspondingly, adjust the unregister process in packet_exit(). Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-09-21 12:59:22 +01:00
Jinpeng Cui	2a566f0148	net: sched: act_ct: remove redundant variable err Return value directly from pskb_trim_rcsum() instead of getting value from redundant variable err. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Jinpeng Cui <cui.jinpeng2@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-09-21 12:49:32 +01:00
Guillaume Nault	72f5c89804	netfilter: rpfilter: Remove unused variable 'ret'. Commit `91a178258a` ("netfilter: rpfilter: Convert rpfilter_lookup_reverse to new dev helper") removed the need for the 'ret' variable. This went unnoticed because of the __maybe_unused annotation. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2022-09-21 10:44:56 +02:00
Antoine Tenart	2aa1927570	netfilter: conntrack: revisit the gc initial rescheduling bias The previous commit changed the way the rescheduling delay is computed which has a side effect: the bias is now represented as much as the other entries in the rescheduling delay which makes the logic to kick in only with very large sets, as the initial interval is very large (INT_MAX). Revisit the GC initial bias to allow more frequent GC for smaller sets while still avoiding wakeups when a machine is mostly idle. We're moving from a large initial value to pretending we have 100 entries expiring at the upper bound. This way only a few entries having a small timeout won't impact much the rescheduling delay and non-idle machines will have enough entries to lower the delay when needed. This also improves readability as the initial bias is now linked to what is computed instead of being an arbitrary large value. Fixes: `2cfadb761d` ("netfilter: conntrack: revisit gc autotuning") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2022-09-21 10:44:56 +02:00
Antoine Tenart	95eabdd207	netfilter: conntrack: fix the gc rescheduling delay Commit `2cfadb761d` ("netfilter: conntrack: revisit gc autotuning") changed the eviction rescheduling to the use average expiry of scanned entries (within 1-60s) by doing: for (...) { expires = clamp(nf_ct_expires(tmp), ...); next_run += expires; next_run /= 2; } The issue is the above will make the average ('next_run' here) more dependent on the last expiration values than the firsts (for sets > 2). Depending on the expiration values used to compute the average, the result can be quite different than what's expected. To fix this we can do the following: for (...) { expires = clamp(nf_ct_expires(tmp), ...); next_run += (expires - next_run) / ++count; } Fixes: `2cfadb761d` ("netfilter: conntrack: revisit gc autotuning") Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2022-09-21 10:44:56 +02:00
Zhengchao Shao	5508ff7cf3	net/sched: use tc_cls_stats_dump() in filter use tc_cls_stats_dump() in filter. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 15:54:13 -07:00
Florian Westphal	d250889322	netfilter: nf_ct_ftp: fix deadlock when nat rewrite is needed We can't use ct->lock, this is already used by the seqadj internals. When using ftp helper + nat, seqadj will attempt to acquire ct->lock again. Revert back to a global lock for now. Fixes: `c783a29c7e` ("netfilter: nf_ct_ftp: prefer skb_linearize") Reported-by: Bruno de Paula Larini <bruno.larini@riosoft.com.br> Signed-off-by: Florian Westphal <fw@strlen.de>	2022-09-20 23:50:03 +02:00
Florian Westphal	62ce44c4ff	netfilter: ebtables: fix memory leak when blob is malformed The bug fix was incomplete, it "replaced" crash with a memory leak. The old code had an assignment to "ret" embedded into the conditional, restore this. Fixes: `7997eff828` ("netfilter: ebtables: reject blobs that don't provide all entry points") Reported-and-tested-by: syzbot+a24c5252f3e3ab733464@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de>	2022-09-20 23:50:03 +02:00
Tetsuo Handa	9a4d6dd554	netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain() It seems to me that percpu memory for chain stats started leaking since commit `3bc158f8d0` ("netfilter: nf_tables: map basechain priority to hardware priority") when nft_chain_offload_priority() returned an error. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: `3bc158f8d0` ("netfilter: nf_tables: map basechain priority to hardware priority") Signed-off-by: Florian Westphal <fw@strlen.de>	2022-09-20 23:50:03 +02:00
Tetsuo Handa	921ebde3c0	netfilter: nf_tables: fix nft_counters_enabled underflow at nf_tables_addchain() syzbot is reporting underflow of nft_counters_enabled counter at nf_tables_addchain() [1], for commit `43eb8949cf` ("netfilter: nf_tables: do not leave chain stats enabled on error") missed that nf_tables_chain_destroy() after nft_basechain_init() in the error path of nf_tables_addchain() decrements the counter because nft_basechain_init() makes nft_is_base_chain() return true by setting NFT_CHAIN_BASE flag. Increment the counter immediately after returning from nft_basechain_init(). Link: https://syzkaller.appspot.com/bug?extid=b5d82a651b71cd8a75ab [1] Reported-by: syzbot <syzbot+b5d82a651b71cd8a75ab@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+b5d82a651b71cd8a75ab@syzkaller.appspotmail.com> Fixes: `43eb8949cf` ("netfilter: nf_tables: do not leave chain stats enabled on error") Signed-off-by: Florian Westphal <fw@strlen.de>	2022-09-20 23:50:03 +02:00
Lukas Bulwahn	caddb4e0d6	net: make NET_(DEV\|NS)_REFCNT_TRACKER depend on NET It makes little sense to ask if networking namespace or net device refcount tracking shall be enabled for debug kernel builds without network support. This is similar to the commit `eb0b39efb7` ("net: CONFIG_DEBUG_NET depends on CONFIG_NET"). Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20220915124256.32512-1-lukas.bulwahn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 14:23:56 -07:00
Vladimir Oltean	2c08a4f898	net/sched: taprio: replace safety precautions with comments The WARN_ON_ONCE() checks introduced in commit `13511704f8` ("net: taprio offload: enforce qdisc to netdev queue mapping") take a small toll on performance, but otherwise, the conditions are never expected to happen. Replace them with comments, such that the information is still conveyed to developers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 13:53:34 -07:00
Vladimir Oltean	026de64d7b	net/sched: taprio: add extack messages in taprio_init Stop contributing to the proverbial user unfriendliness of tc, and tell the user what is wrong wherever possible. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 13:53:34 -07:00
Vladimir Oltean	25becba629	net/sched: taprio: stop going through private ops for dequeue and peek Since commit `13511704f8` ("net: taprio offload: enforce qdisc to netdev queue mapping"), taprio_dequeue_soft() and taprio_peek_soft() are de facto the only implementations for Qdisc_ops :: dequeue and Qdisc_ops :: peek that taprio provides. This is because in full offload mode, __dev_queue_xmit() will select a txq->qdisc which is never root taprio qdisc. So if nothing is enqueued in the root qdisc, it will never be run and nothing will get dequeued from it. Therefore, we can remove the private indirection from taprio, and always point Qdisc_ops :: dequeue to taprio_dequeue_soft (now simply named taprio_dequeue) and Qdisc_ops :: peek to taprio_peek_soft (now simply named taprio_peek). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 13:53:34 -07:00
Vladimir Oltean	fa65edde5e	net/sched: taprio: remove redundant FULL_OFFLOAD_IS_ENABLED check in taprio_enqueue Since commit `13511704f8` ("net: taprio offload: enforce qdisc to netdev queue mapping"), __dev_queue_xmit() will select a txq->qdisc for the full offload case of taprio which isn't the root taprio qdisc, so qdisc enqueues will never pass through taprio_enqueue(). That commit already introduced one safety precaution check for FULL_OFFLOAD_IS_ENABLED(); a second one is really not needed, so simplify the conditional for entering into the GSO segmentation logic. Also reword the comment a little, to appear more natural after the code change. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 13:53:34 -07:00
Vladimir Oltean	9af23657b3	net/sched: taprio: use rtnl_dereference for oper and admin sched in taprio_destroy() Sparse complains that taprio_destroy() dereferences q->oper_sched and q->admin_sched without rcu_dereference(), since they are marked as __rcu in the taprio private structure. 1671:28: warning: incorrect type in argument 1 (different address spaces) 1671:28: expected struct callback_head head 1671:28: got struct callback_head [noderef] __rcu 1674:28: warning: incorrect type in argument 1 (different address spaces) 1674:28: expected struct callback_head head 1674:28: got struct callback_head [noderef] __rcu To silence that build warning, do actually use rtnl_dereference(), since we know the rtnl_mutex is held at the time of q->destroy(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 13:53:33 -07:00
Vladimir Oltean	18cdd2f099	net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex Since the writer-side lock is taken here, we do not need to open an RCU read-side critical section, instead we can use rtnl_dereference() to tell lockdep we are serialized with concurrent writes. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 13:53:33 -07:00
Vladimir Oltean	c8cbe123be	net/sched: taprio: taprio_offload_config_changed() is protected by rtnl_mutex The locking in taprio_offload_config_changed() is wrong (but also inconsequentially so). The current_entry_lock does not serialize changes to the admin and oper schedules, only to the current entry. In fact, the rtnl_mutex does that, and that is taken at the time when taprio_change() is called. Replace the rcu_dereference_protected() method with the proper RCU annotation, and drop the unnecessary spin lock. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 13:53:33 -07:00
Vladimir Oltean	1461d212ab	net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs taprio can only operate as root qdisc, and to that end, there exists the following check in taprio_init(), just as in mqprio: if (sch->parent != TC_H_ROOT) return -EOPNOTSUPP; And indeed, when we try to attach taprio to an mqprio child, it fails as expected: $ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \ map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0 $ tc qdisc replace dev swp0 parent 1:2 taprio num_tc 8 \ map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \ flags 0x0 clockid CLOCK_TAI Error: sch_taprio: Can only be attached as root qdisc. (extack message added by me) But when we try to attach a taprio child to a taprio root qdisc, surprisingly it doesn't fail: $ tc qdisc replace dev swp0 root handle 1: taprio num_tc 8 \ map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \ flags 0x0 clockid CLOCK_TAI $ tc qdisc replace dev swp0 parent 1:2 taprio num_tc 8 \ map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \ flags 0x0 clockid CLOCK_TAI This is because tc_modify_qdisc() behaves differently when mqprio is root, vs when taprio is root. In the mqprio case, it finds the parent qdisc through p = qdisc_lookup(dev, TC_H_MAJ(clid)), and then the child qdisc through q = qdisc_leaf(p, clid). This leaf qdisc q has handle 0, so it is ignored according to the comment right below ("It may be default qdisc, ignore it"). As a result, tc_modify_qdisc() goes through the qdisc_create() code path, and this gives taprio_init() a chance to check for sch_parent != TC_H_ROOT and error out. Whereas in the taprio case, the returned q = qdisc_leaf(p, clid) is different. It is not the default qdisc created for each netdev queue (both taprio and mqprio call qdisc_create_dflt() and keep them in a private q->qdiscs[], or priv->qdiscs[], respectively). Instead, taprio makes qdisc_leaf() return the _root_ qdisc, aka itself. When taprio does that, tc_modify_qdisc() goes through the qdisc_change() code path, because the qdisc layer never finds out about the child qdisc of the root. And through the ->change() ops, taprio has no reason to check whether its parent is root or not, just through ->init(), which is not called. The problem is the taprio_leaf() implementation. Even though code wise, it does the exact same thing as mqprio_leaf() which it is copied from, it works with different input data. This is because mqprio does not attach itself (the root) to each device TX queue, but one of the default qdiscs from its private array. In fact, since commit `13511704f8` ("net: taprio offload: enforce qdisc to netdev queue mapping"), taprio does this too, but just for the full offload case. So if we tried to attach a taprio child to a fully offloaded taprio root qdisc, it would properly fail too; just not to a software root taprio. To fix the problem, stop looking at the Qdisc that's attached to the TX queue, and instead, always return the default qdiscs that we've allocated (and to which we privately enqueue and dequeue, in software scheduling mode). Since Qdisc_class_ops :: leaf is only called from tc_modify_qdisc(), the risk of unforeseen side effects introduced by this change is minimal. Fixes: `5a781ccbd1` ("tc: Add support for configuring the taprio scheduler") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 11:41:14 -07:00
Vladimir Oltean	db46e3a88a	net/sched: taprio: avoid disabling offload when it was never enabled In an incredibly strange API design decision, qdisc->destroy() gets called even if qdisc->init() never succeeded, not exclusively since commit `87b60cfacf` ("net_sched: fix error recovery at qdisc creation"), but apparently also earlier (in the case of qdisc_create_dflt()). The taprio qdisc does not fully acknowledge this when it attempts full offload, because it starts off with q->flags = TAPRIO_FLAGS_INVALID in taprio_init(), then it replaces q->flags with TCA_TAPRIO_ATTR_FLAGS parsed from netlink (in taprio_change(), tail called from taprio_init()). But in taprio_destroy(), we call taprio_disable_offload(), and this determines what to do based on FULL_OFFLOAD_IS_ENABLED(q->flags). But looking at the implementation of FULL_OFFLOAD_IS_ENABLED() (a bitwise check of bit 1 in q->flags), it is invalid to call this macro on q->flags when it contains TAPRIO_FLAGS_INVALID, because that is set to U32_MAX, and therefore FULL_OFFLOAD_IS_ENABLED() will return true on an invalid set of flags. As a result, it is possible to crash the kernel if user space forces an error between setting q->flags = TAPRIO_FLAGS_INVALID, and the calling of taprio_enable_offload(). This is because drivers do not expect the offload to be disabled when it was never enabled. The error that we force here is to attach taprio as a non-root qdisc, but instead as child of an mqprio root qdisc: $ tc qdisc add dev swp0 root handle 1: \ mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0 $ tc qdisc replace dev swp0 parent 1:1 \ taprio num_tc 8 map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 base-time 0 \ sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \ flags 0x0 clockid CLOCK_TAI Unable to handle kernel paging request at virtual address fffffffffffffff8 [fffffffffffffff8] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] PREEMPT SMP Call trace: taprio_dump+0x27c/0x310 vsc9959_port_setup_tc+0x1f4/0x460 felix_port_setup_tc+0x24/0x3c dsa_slave_setup_tc+0x54/0x27c taprio_disable_offload.isra.0+0x58/0xe0 taprio_destroy+0x80/0x104 qdisc_create+0x240/0x470 tc_modify_qdisc+0x1fc/0x6b0 rtnetlink_rcv_msg+0x12c/0x390 netlink_rcv_skb+0x5c/0x130 rtnetlink_rcv+0x1c/0x2c Fix this by keeping track of the operations we made, and undo the offload only if we actually did it. I've added "bool offloaded" inside a 4 byte hole between "int clockid" and "atomic64_t picos_per_byte". Now the first cache line looks like below: $ pahole -C taprio_sched net/sched/sch_taprio.o struct taprio_sched { struct Qdisc * * qdiscs; /* 0 8 / struct Qdisc root; /* 8 8 / u32 flags; / 16 4 / enum tk_offsets tk_offset; / 20 4 / int clockid; / 24 4 / bool offloaded; / 28 1 / / XXX 3 bytes hole, try to pack / atomic64_t picos_per_byte; / 32 0 / / XXX 8 bytes hole, try to pack / spinlock_t current_entry_lock; / 40 0 / / XXX 8 bytes hole, try to pack / struct sched_entry current_entry; /* 48 8 / struct sched_gate_list oper_sched; /* 56 8 / / --- cacheline 1 boundary (64 bytes) --- */ Fixes: `9c66d15646` ("taprio: Add support for hardware offloading") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 11:41:14 -07:00
Ido Schimmel	76dd072813	ipv6: Fix crash when IPv6 is administratively disabled The global 'raw_v6_hashinfo' variable can be accessed even when IPv6 is administratively disabled via the 'ipv6.disable=1' kernel command line option, leading to a crash [1]. Fix by restoring the original behavior and always initializing the variable, regardless of IPv6 support being administratively disabled or not. [1] BUG: unable to handle page fault for address: ffffffffffffffc8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 173e18067 P4D 173e18067 PUD 173e1a067 PMD 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 3 PID: 271 Comm: ss Not tainted 6.0.0-rc4-custom-00136-g0727a9a5fbc1 #1396 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:raw_diag_dump+0x310/0x7f0 [...] Call Trace: <TASK> __inet_diag_dump+0x10f/0x2e0 netlink_dump+0x575/0xfd0 __netlink_dump_start+0x67b/0x940 inet_diag_handler_cmd+0x273/0x2d0 sock_diag_rcv_msg+0x317/0x440 netlink_rcv_skb+0x15e/0x430 sock_diag_rcv+0x2b/0x40 netlink_unicast+0x53b/0x800 netlink_sendmsg+0x945/0xe60 ____sys_sendmsg+0x747/0x960 ___sys_sendmsg+0x13a/0x1e0 __sys_sendmsg+0x118/0x1e0 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Fixes: `0daf07e527` ("raw: convert raw sockets to RCU") Reported-by: Roberto Ricci <rroberto2r@gmail.com> Tested-by: Roberto Ricci <rroberto2r@gmail.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220916084821.229287-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 11:27:32 -07:00
Kuniyuki Iwashima	d1e5e6408b	tcp: Introduce optional per-netns ehash. The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg \| cut -d ' ' -f 5- \| grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Control write on sysctl by BPF We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications. Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent. In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 10:21:50 -07:00
Kuniyuki Iwashima	edc12f032a	tcp: Save unnecessary inet_twsk_purge() calls. While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch() and tcpv6_net_exit_batch() for AF_INET and AF_INET6. These commands trigger the kernel to walk through the potentially big ehash twice even though the netns has no TIME_WAIT sockets. # ip netns add test # ip netns del test or # unshare -n /bin/true >/dev/null When tw_refcount is 1, we need not call inet_twsk_purge() at least for the net. We can save such unneeded iterations if all netns in net_exit_list have no TIME_WAIT sockets. This change eliminates the tax by the additional unshare() described in the next patch to guarantee the per-netns ehash size. Tested: # mount -t debugfs none /sys/kernel/debug/ # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter # echo function > /sys/kernel/debug/tracing/current_tracer # cat ./add_del_unshare.sh for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait; # ./add_del_unshare.sh Before the patch: # cat /sys/kernel/debug/tracing/trace_pipe kworker/u128:0-8 [031] ...1. 174.162765: cleanup_net <-process_one_work kworker/u128:0-8 [031] ...1. 174.240796: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [032] ...1. 174.244759: inet_twsk_purge <-tcp_sk_exit_batch kworker/u128:0-8 [034] ...1. 174.290861: cleanup_net <-process_one_work kworker/u128:0-8 [039] ...1. 175.245027: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [046] ...1. 175.290541: inet_twsk_purge <-tcp_sk_exit_batch kworker/u128:0-8 [037] ...1. 175.321046: cleanup_net <-process_one_work kworker/u128:0-8 [024] ...1. 175.941633: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [025] ...1. 176.242539: inet_twsk_purge <-tcp_sk_exit_batch After: # cat /sys/kernel/debug/tracing/trace_pipe kworker/u128:0-8 [038] ...1. 428.116174: cleanup_net <-process_one_work kworker/u128:0-8 [038] ...1. 428.262532: cleanup_net <-process_one_work kworker/u128:0-8 [030] ...1. 429.292645: cleanup_net <-process_one_work Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 10:21:50 -07:00
Kuniyuki Iwashima	4461568aa4	tcp: Access &tcp_hashinfo via net. We will soon introduce an optional per-netns ehash. This means we cannot use tcp_hashinfo directly in most places. Instead, access it via net->ipv4.tcp_death_row.hashinfo. The access will be valid only while initialising tcp_hashinfo itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 10:21:49 -07:00
Kuniyuki Iwashima	429e42c1c5	tcp: Set NULL to sk->sk_prot->h.hashinfo. We will soon introduce an optional per-netns ehash. This means we cannot use the global sk->sk_prot->h.hashinfo to fetch a TCP hashinfo. Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get a proper hashinfo from net->ipv4.tcp_death_row.hashinfo. Note that we need not use sk->sk_prot->h.hashinfo if DCCP is disabled. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 10:21:49 -07:00
Kuniyuki Iwashima	e9bd0cca09	tcp: Don't allocate tcp_death_row outside of struct netns_ipv4. We will soon introduce an optional per-netns ehash and access hash tables via net->ipv4.tcp_death_row->hashinfo instead of &tcp_hashinfo in most places. It could harm the fast path because dereferences of two fields in net and tcp_death_row might incur two extra cache line misses. To save one dereference, let's place tcp_death_row back in netns_ipv4 and fetch hashinfo via net->ipv4.tcp_death_row"."hashinfo. Note tcp_death_row was initially placed in netns_ipv4, and commit `fbb8295248` ("tcp: allocate tcp_death_row outside of struct netns_ipv4") changed it to a pointer so that we can fire TIME_WAIT timers after freeing net. However, we don't do so after commit `04c494e68a` ("Revert "tcp/dccp: get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a pointer. Also, we move refcount_dec_and_test(&tw_refcount) from tcp_sk_exit() to tcp_sk_exit_batch() as a debug check. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 10:21:49 -07:00
Kuniyuki Iwashima	08eaef9040	tcp: Clean up some functions. This patch adds no functional change and cleans up some functions that the following patches touch around so that we make them tidy and easy to review/revert. The changes are - Keep reverse christmas tree order - Remove unnecessary init of port in inet_csk_find_open_port() - Use req_to_sk() once in reqsk_queue_unlink() - Use sock_net(sk) once in tcp_time_wait() and tcp_v[46]_connect() Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 10:21:49 -07:00
Phil Sutter	a4abfa627c	net: rtnetlink: Enslave device before bringing it up Unlike with bridges, one can't add an interface to a bond and set it up at the same time: \| # ip link set dummy0 down \| # ip link set dummy0 master bond0 up \| Error: Device can not be enslaved while up. Of all drivers with ndo_add_slave callback, bond and team decline if IFF_UP flag is set, vrf cycles the interface (i.e., sets it down and immediately up again) and the others just don't care. Support the common notion of setting the interface up after enslaving it by sorting the operations accordingly. Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220914150623.24152-1-phil@nwl.cc Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 08:37:44 -07:00
Tetsuo Handa	d547c1b717	net: clear msg_get_inq in __get_compat_msghdr() syzbot is still complaining uninit-value in tcp_recvmsg(), for commit `1228b34c8d` ("net: clear msg_get_inq in __sys_recvfrom() and __copy_msghdr_from_user()") missed that __get_compat_msghdr() is called instead of copy_msghdr_from_user() when MSG_CMSG_COMPAT is specified. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: `1228b34c8d` ("net: clear msg_get_inq in __sys_recvfrom() and __copy_msghdr_from_user()") Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/d06d0f7f-696c-83b4-b2d5-70b5f2730a37@I-love.SAKURA.ne.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 08:23:20 -07:00
Ido Schimmel	b07a9b26e2	ipmr: Always call ip{,6}_mr_forward() from RCU read-side critical section These functions expect to be called from RCU read-side critical section, but this only happens when invoked from the data path via ip{,6}_mr_input(). They can also be invoked from process context in response to user space adding a multicast route which resolves a cache entry with queued packets [1][2]. Fix by adding missing rcu_read_lock() / rcu_read_unlock() in these call paths. [1] WARNING: suspicious RCU usage 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted ----------------------------- net/ipv4/ipmr.c:84 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by smcrouted/246: #0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip_mroute_setsockopt+0x11c/0x1420 stack backtrace: CPU: 0 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x91/0xb9 vif_dev_read+0xbf/0xd0 ipmr_queue_xmit+0x135/0x1ab0 ip_mr_forward+0xe7b/0x13d0 ipmr_mfc_add+0x1a06/0x2ad0 ip_mroute_setsockopt+0x5c1/0x1420 do_ip_setsockopt+0x23d/0x37f0 ip_setsockopt+0x56/0x80 raw_setsockopt+0x219/0x290 __sys_setsockopt+0x236/0x4d0 __x64_sys_setsockopt+0xbe/0x160 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [2] WARNING: suspicious RCU usage 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted ----------------------------- net/ipv6/ip6mr.c:69 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by smcrouted/246: #0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip6_mroute_setsockopt+0x6b9/0x2630 stack backtrace: CPU: 1 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x91/0xb9 vif_dev_read+0xbf/0xd0 ip6mr_forward2.isra.0+0xc9/0x1160 ip6_mr_forward+0xef0/0x13f0 ip6mr_mfc_add+0x1ff2/0x31f0 ip6_mroute_setsockopt+0x1825/0x2630 do_ipv6_setsockopt+0x462/0x4440 ipv6_setsockopt+0x105/0x140 rawv6_setsockopt+0xd8/0x690 __sys_setsockopt+0x236/0x4d0 __x64_sys_setsockopt+0xbe/0x160 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: `ebc3197963` ("ipmr: add rcu protection over (struct vif_device)->dev") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-09-20 08:22:15 -07:00
Cong Wang	db4192a754	tcp: read multiple skbs in tcp_read_skb() Before we switched to ->read_skb(), ->read_sock() was passed with desc.count=1, which technically indicates we only read one skb per ->sk_data_ready() call. However, for TCP, this is not true. TCP at least has sk_rcvlowat which intentionally holds skb's in receive queue until this watermark is reached. This means when ->sk_data_ready() is invoked there could be multiple skb's in the queue, therefore we have to read multiple skbs in tcp_read_skb() instead of one. Fixes: `965b57b469` ("net: Introduce a new proto_ops ->read_skb()") Reported-by: Peilin Ye <peilin.ye@bytedance.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20220912173553.235838-1-xiyou.wangcong@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-09-20 14:47:21 +02:00
Andrea Mayer	848f3c0d47	seg6: add NEXT-C-SID support for SRv6 End behavior The NEXT-C-SID mechanism described in [1] offers the possibility of encoding several SRv6 segments within a single 128 bit SID address. Such a SID address is called a Compressed SID (C-SID) container. In this way, the length of the SID List can be drastically reduced. A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address logically structured in three main blocks: i) Locator-Block; ii) Locator-Node Function; iii) Argument. C-SID container +------------------------------------------------------------------+ \| Locator-Block \|Loc-Node\| Argument \| \| \|Function\| \| +------------------------------------------------------------------+ <--------- B -----------> <- NF -> <------------- A ---------------> (i) The Locator-Block can be any IPv6 prefix available to the provider; (ii) The Locator-Node Function represents the node and the function to be triggered when a packet is received on the node; (iii) The Argument carries the remaining C-SIDs in the current C-SID container. The NEXT-C-SID mechanism relies on the "flavors" framework defined in [2]. The flavors represent additional operations that can modify or extend a subset of the existing behaviors. This patch introduces the support for flavors in SRv6 End behavior implementing the NEXT-C-SID one. An SRv6 End behavior with NEXT-C-SID flavor works as an End behavior but it is capable of processing the compressed SID List encoded in C-SID containers. An SRv6 End behavior with NEXT-C-SID flavor can be configured to support user-provided Locator-Block and Locator-Node Function lengths. In this implementation, such lengths must be evenly divisible by 8 (i.e. must be byte-aligned), otherwise the kernel informs the user about invalid values with a meaningful error code and message through netlink_ext_ack. If Locator-Block and/or Locator-Node Function lengths are not provided by the user during configuration of an SRv6 End behavior instance with NEXT-C-SID flavor, the kernel will choose their default values i.e., 32-bit Locator-Block and 16-bit Locator-Node Function. [1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression [2] - https://datatracker.ietf.org/doc/html/rfc8986 Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-09-20 12:33:22 +02:00
Andrea Mayer	e2a8ecc451	seg6: add netlink_ext_ack support in parsing SRv6 behavior attributes An SRv6 behavior instance can be set up using mandatory and/or optional attributes. In the setup phase, each supplied attribute is parsed and processed. If the parsing operation fails, the creation of the behavior instance stops and an error number/code is reported to the user. In many cases, it is challenging for the user to figure out exactly what happened by relying only on the error code. For this reason, we add the support for netlink_ext_ack in parsing SRv6 behavior attributes. In this way, when an SRv6 behavior attribute is parsed and an error occurs, the kernel can send a message to the userspace describing the error through a meaningful text message in addition to the classic error code. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-09-20 12:33:22 +02:00

1 2 3 4 5 ...

70682 Commits