The mptcp socket and its subflow sockets in accept queue can't be
released after the process exit.
While the release of a mptcp socket in listening state, the
corresponding tcp socket will be released too. Meanwhile, the tcp
socket in the unaccept queue will be released too. However, only init
subflow is in the unaccept queue, and the joined subflow is not in the
unaccept queue, which makes the joined subflow won't be released, and
therefore the corresponding unaccepted mptcp socket will not be released
to.
This can be reproduced easily with following steps:
1. create 2 namespace and veth:
$ ip netns add mptcp-client
$ ip netns add mptcp-server
$ sysctl -w net.ipv4.conf.all.rp_filter=0
$ ip netns exec mptcp-client sysctl -w net.mptcp.enabled=1
$ ip netns exec mptcp-server sysctl -w net.mptcp.enabled=1
$ ip link add red-client netns mptcp-client type veth peer red-server \
netns mptcp-server
$ ip -n mptcp-server address add 10.0.0.1/24 dev red-server
$ ip -n mptcp-server address add 192.168.0.1/24 dev red-server
$ ip -n mptcp-client address add 10.0.0.2/24 dev red-client
$ ip -n mptcp-client address add 192.168.0.2/24 dev red-client
$ ip -n mptcp-server link set red-server up
$ ip -n mptcp-client link set red-client up
2. configure the endpoint and limit for client and server:
$ ip -n mptcp-server mptcp endpoint flush
$ ip -n mptcp-server mptcp limits set subflow 2 add_addr_accepted 2
$ ip -n mptcp-client mptcp endpoint flush
$ ip -n mptcp-client mptcp limits set subflow 2 add_addr_accepted 2
$ ip -n mptcp-client mptcp endpoint add 192.168.0.2 dev red-client id \
1 subflow
3. listen and accept on a port, such as 9999. The nc command we used
here is modified, which makes it use mptcp protocol by default.
$ ip netns exec mptcp-server nc -l -k -p 9999
4. open another *two* terminal and use each of them to connect to the
server with the following command:
$ ip netns exec mptcp-client nc 10.0.0.1 9999
Input something after connect to trigger the connection of the second
subflow. So that there are two established mptcp connections, with the
second one still unaccepted.
5. exit all the nc command, and check the tcp socket in server namespace.
And you will find that there is one tcp socket in CLOSE_WAIT state
and can't release forever.
Fix this by closing all of the unaccepted mptcp socket in
mptcp_subflow_queue_clean() with __mptcp_close().
Now, we can ensure that all unaccepted mptcp sockets will be cleaned by
__mptcp_close() before they are released, so mptcp_sock_destruct(), which
is used to clean the unaccepted mptcp socket, is not needed anymore.
The selftests for mptcp is ran for this commit, and no new failures.
Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Fixes: 6aeed90450 ("mptcp: fix race on unaccepted mptcp sockets")
Cc: stable@vger.kernel.org
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Factor out __mptcp_close() from mptcp_close(). The caller of
__mptcp_close() should hold the socket lock, and cancel mptcp work when
__mptcp_close() returns true.
This function will be used in the next commit.
Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Fixes: 6aeed90450 ("mptcp: fix race on unaccepted mptcp sockets")
Cc: stable@vger.kernel.org
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 1d0403d20f.
Anatoly Pugachev reported that the commit 1d0403d20f ("net: set proper
memcg for net_init hooks allocations") is somehow causing the sparc64
VMs failed to boot and the VMs boot fine with that patch reverted. So,
revert the patch for now and later we can debug the issue.
Link: https://lore.kernel.org/all/20220918092849.GA10314@u164.east.ru/
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Vasily Averin <vvs@openvz.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Anatoly Pugachev <matorola@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Fixes: 1d0403d20f ("net: set proper memcg for net_init hooks allocations")
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 6911458dc4 ("wifi: mac80211: mlme: refactor assoc success
handling") moved the per-link setup out of ieee80211_assoc_success() into a
new function ieee80211_assoc_config_link() but missed to remove the unlock
of 'sta_mtx' in case of HE capability/operation missing on HE AP, which
leads to a double unlock:
ieee80211_assoc_success() {
...
ieee80211_assoc_config_link() {
...
if (!(link->u.mgd.conn_flags & IEEE80211_CONN_DISABLE_HE) &&
(!elems->he_cap || !elems->he_operation)) {
mutex_unlock(&sdata->local->sta_mtx);
...
}
...
}
...
mutex_unlock(&sdata->local->sta_mtx);
...
}
Fixes: 6911458dc4 ("wifi: mac80211: mlme: refactor assoc success handling")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Link: https://lore.kernel.org/r/20220925143420.784975-1-rafaelmendsr@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit 98b0b46746 ("wifi: mac80211: mlme: use correct link_sta")
switched to link station instead of deflink and added some checks to do
that, which are done with the 'sta_mtx' mutex held. However, the error
path of these checks does not unlock 'sta_mtx' before returning.
Fixes: 98b0b46746 ("wifi: mac80211: mlme: use correct link_sta")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Link: https://lore.kernel.org/r/20220924184042.778676-1-rafaelmendsr@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
During our testing of WFM200 module over SDIO on i.MX6Q-based platform,
we discovered a memory corruption on the system, tracing back to the wfx
driver. Using kfence, it was possible to trace it back to the root
cause, which is hw->max_rates set to 8 in wfx_init_common,
while the maximum defined by IEEE80211_TX_TABLE_SIZE is 4.
This causes array out-of-bounds writes during updates of the rate table,
as seen below:
BUG: KFENCE: memory corruption in kfree_rcu_work+0x320/0x36c
Corrupted memory at 0xe0a4ffe0 [ 0x03 0x03 0x03 0x03 0x01 0x00 0x00
0x02 0x02 0x02 0x09 0x00 0x21 0xbb 0xbb 0xbb ] (in kfence-#81):
kfree_rcu_work+0x320/0x36c
process_one_work+0x3ec/0x920
worker_thread+0x60/0x7a4
kthread+0x174/0x1b4
ret_from_fork+0x14/0x2c
0x0
kfence-#81: 0xe0a4ffc0-0xe0a4ffdf, size=32, cache=kmalloc-64
allocated by task 297 on cpu 0 at 631.039555s:
minstrel_ht_update_rates+0x38/0x2b0 [mac80211]
rate_control_tx_status+0xb4/0x148 [mac80211]
ieee80211_tx_status_ext+0x364/0x1030 [mac80211]
ieee80211_tx_status+0xe0/0x118 [mac80211]
ieee80211_tasklet_handler+0xb0/0xe0 [mac80211]
tasklet_action_common.constprop.0+0x11c/0x148
__do_softirq+0x1a4/0x61c
irq_exit+0xcc/0x104
call_with_stack+0x18/0x20
__irq_svc+0x80/0xb0
wq_worker_sleeping+0x10/0x100
wq_worker_sleeping+0x10/0x100
schedule+0x50/0xe0
schedule_timeout+0x2e0/0x474
wait_for_completion+0xdc/0x1ec
mmc_wait_for_req_done+0xc4/0xf8
mmc_io_rw_extended+0x3b4/0x4ec
sdio_io_rw_ext_helper+0x290/0x384
sdio_memcpy_toio+0x30/0x38
wfx_sdio_copy_to_io+0x88/0x108 [wfx]
wfx_data_write+0x88/0x1f0 [wfx]
bh_work+0x1c8/0xcc0 [wfx]
process_one_work+0x3ec/0x920
worker_thread+0x60/0x7a4
kthread+0x174/0x1b4
ret_from_fork+0x14/0x2c 0x0
After discussion on the wireless mailing list it was clarified
that the issue has been introduced by:
commit ee0e16ab75 ("mac80211: minstrel_ht: fill all requested rates")
and fix shall be in minstrel_ht_update_rates in rc80211_minstrel_ht.c.
Fixes: ee0e16ab75 ("mac80211: minstrel_ht: fill all requested rates")
Link: https://lore.kernel.org/all/12e5adcd-8aed-f0f7-70cc-4fb7b656b829@camlingroup.com/
Link: https://lore.kernel.org/linux-wireless/20220915131445.30600-1-lech.perczak@camlingroup.com/
Cc: Jérôme Pouiller <jerome.pouiller@silabs.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Peter Seiderer <ps.report@gmx.net>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Krzysztof Drobiński <krzysztof.drobinski@camlingroup.com>,
Signed-off-by: Paweł Lenkow <pawel.lenkow@camlingroup.com>
Signed-off-by: Lech Perczak <lech.perczak@camlingroup.com>
Reviewed-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Acked-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit 10cb8e6175 ("mac80211: enable QoS support for nl80211 ctrl port")
changed ieee80211_tx_control_port() to aways call
__ieee80211_select_queue() without checking local->hw.queues.
__ieee80211_select_queue() returns a queue-id between 0 and 3, which means
that now ieee80211_tx_control_port() may end up setting the queue-mapping
for a skb to a value higher then local->hw.queues if local->hw.queues
is less then 4.
Specifically this is a problem for ralink rt2500-pci cards where
local->hw.queues is 2. There this causes rt2x00queue_get_tx_queue() to
return NULL and the following error to be logged: "ieee80211 phy0:
rt2x00mac_tx: Error - Attempt to send packet over invalid queue 2",
after which association with the AP fails.
Other callers of __ieee80211_select_queue() skip calling it when
local->hw.queues < IEEE80211_NUM_ACS, add the same check to
ieee80211_tx_control_port(). This fixes ralink rt2500-pci and
similar cards when less then 4 tx-queues no longer working.
Fixes: 10cb8e6175 ("mac80211: enable QoS support for nl80211 ctrl port")
Cc: Markus Theil <markus.theil@tu-ilmenau.de>
Suggested-by: Stanislaw Gruszka <stf_xl@wp.pl>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20220918192052.443529-1-hdegoede@redhat.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.
When a new vif is created the queues may end up in an inconsistent state
and be inoperable:
Communication not using iTXQ will work, allowing to e.g. complete the
association. But the 4-way handshake will time out. The sta will not
send out any skbs queued in iTXQs.
All normal attempts to start the queues will fail when reaching this
state.
local->queue_stop_reasons will have marked all queues as operational but
vif.txqs_stopped will still be set, creating an inconsistent internal
state.
In reality this seems to be race between the mac80211 function
ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet:
Depending on the driver and the timing the queues may end up to be
operational or not.
Cc: stable@vger.kernel.org
Fixes: f856373e2f ("wifi: mac80211: do not wake queues on a vif that is being stopped")
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Acked-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20220915130946.302803-1-alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_txq_purge() calls fq_tin_reset() and
ieee80211_purge_tx_queue(); Both are then calling
ieee80211_free_txskb(). Which can decide to TX the skb again.
There are at least two ways to get a deadlock:
1) When we have a TDLS teardown packet queued in either tin or frags
ieee80211_tdls_td_tx_handle() will call ieee80211_subif_start_xmit()
while we still hold fq->lock. ieee80211_txq_enqueue() will thus
deadlock.
2) A variant of the above happens if aggregation is up and running:
In that case ieee80211_iface_work() will deadlock with the original
task: The original tasks already holds fq->lock and tries to get
sta->lock after kicking off ieee80211_iface_work(). But the worker
can get sta->lock prior to the original task and will then spin for
fq->lock.
Avoid these deadlocks by not sending out any skbs when called via
ieee80211_free_txskb().
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Link: https://lore.kernel.org/r/20220915124120.301918-1-alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The Bitrate for HE/EHT MCS6 is calculated wrongly due to the
incorrect MCS divisor value for mcs6. Fix it with the proper
value.
previous mcs_divisor value = (11769/6144) = 1.915527
fixed mcs_divisor value = (11377/6144) = 1.851725
Fixes: 9c97c88d2f ("cfg80211: Add support to calculate and report 4096-QAM HE rates")
Signed-off-by: Tamizh Chelvam Raja <quic_tamizhr@quicinc.com>
Link: https://lore.kernel.org/r/20220908181034.9936-1-quic_tamizhr@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
nf_ct_put need to be called to put the refcount got by tcf_ct_fill_params
to avoid possible refcount leak when tcf_ct_flow_table_get fails.
Fixes: c34b961a24 ("net/sched: act_ct: Create nf flow table per zone")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Link: https://lore.kernel.org/r/20220923020046.8021-1-hbh25y@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A handful of awaited fixes here - revert of the FEC changes,
bluetooth fix, fixes for iwlwifi spew.
We added a warning in PHY/MDIO code which is triggering on
a couple of platforms in a false-positive-ish way. If we can't
iron that out over the week we'll drop it and re-add for 6.1.
I've added a new "follow up fixes" section for fixes to fixes
in 6.0-rcs but it may actually give the false impression that
those are problematic or that more testing time would have
caught them. So likely a one time thing.
Follow up fixes:
- nf_tables_addchain: fix nft_counters_enabled underflow
- ebtables: fix memory leak when blob is malformed
- nf_ct_ftp: fix deadlock when nat rewrite is needed
Current release - regressions:
- Revert "fec: Restart PPS after link state change"
- Revert "net: fec: Use a spinlock to guard `fep->ptp_clk_on`"
- Bluetooth: fix HCIGETDEVINFO regression
- wifi: mt76: fix 5 GHz connection regression on mt76x0/mt76x2
- mptcp: fix fwd memory accounting on coalesce
- rwlock removal fall out:
- ipmr: always call ip{,6}_mr_forward() from RCU read-side
critical section
- ipv6: fix crash when IPv6 is administratively disabled
- tcp: read multiple skbs in tcp_read_skb()
- mdio_bus_phy_resume state warning fallout:
- eth: ravb: fix PHY state warning splat during system resume
- eth: sh_eth: fix PHY state warning splat during system resume
Current release - new code bugs:
- wifi: iwlwifi: don't spam logs with NSS>2 messages
- eth: mtk_eth_soc: enable XDP support just for MT7986 SoC
Previous releases - regressions:
- bonding: fix NULL deref in bond_rr_gen_slave_id
- wifi: iwlwifi: mark IWLMEI as broken
Previous releases - always broken:
- nf_conntrack helpers:
- irc: tighten matching on DCC message
- sip: fix ct_sip_walk_headers
- osf: fix possible bogus match in nf_osf_find()
- ipvlan: fix out-of-bound bugs caused by unset skb->mac_header
- core: fix flow symmetric hash
- bonding, team: unsync device addresses on ndo_stop
- phy: micrel: fix shared interrupt on LAN8814
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmMsj3EACgkQMUZtbf5S
IrsUgQ//eXxuUZeGTg7cgJKPFJelrZ3iL16B1+s2qX94GPIqXRAShgC78iM7IbSe
y3vR/7YVE7sKXm88wnLefMQVXPp0cE2p0+8++E/j4zcRZsM5sHb2+d3gW6nos2ed
U8Ldm7LzWUNt/o1ZHDqZWBSoreFkmbFyHO6FVPCuH11tFUJqxJ/SP860mwo6tbuT
HOoVphKis41IMEXCgybs2V0DAQewba0gejzAmySDy8epNhOj2F4Vo6aadnUCI68U
HrIFYe2wiEi6MZDsB9zpRXc9seb6ZBKbBjgQnTK7MwfBEQCzxtR2lkNobJM1WbdL
nYwHBOJ16yX0BnlSpUEepv6iJYY5Q7FS35Wk3Rq5Mik6DaEir6vVSBdRxHpYOkO2
KPIyyMMAA5E8mAtqH3PcpnwDK+9c3KlZYYKXxIp2IjQm87DpOZJFynwsC3Crmbzo
C7UTMav2nkHljoapMLUwzqyw2ip+Qo14XA043FDPUru1sXY9CY6q50XZa5GmrNKh
xyaBdp4Ckj1kOuXUR9jz3Rq8skOZ8lNGHtiCdgPZitWhNKW1YORJihC7/9zdieCR
1gOE7Dpz/MhVmFn2e8S5O3TkU5lXfALfPDJi4QiML5VLHXd/nCE5sHPiOBWcoo4w
2djKbIGpLRnO6qMs4NkWNmPbG+/ouvpM+lewqn+xU4TGyn/NTbI=
=wrep
-----END PGP SIGNATURE-----
Merge tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wifi, netfilter and can.
A handful of awaited fixes here - revert of the FEC changes, bluetooth
fix, fixes for iwlwifi spew.
We added a warning in PHY/MDIO code which is triggering on a couple of
platforms in a false-positive-ish way. If we can't iron that out over
the week we'll drop it and re-add for 6.1.
I've added a new "follow up fixes" section for fixes to fixes in
6.0-rcs but it may actually give the false impression that those are
problematic or that more testing time would have caught them. So
likely a one time thing.
Follow up fixes:
- nf_tables_addchain: fix nft_counters_enabled underflow
- ebtables: fix memory leak when blob is malformed
- nf_ct_ftp: fix deadlock when nat rewrite is needed
Current release - regressions:
- Revert "fec: Restart PPS after link state change" and the related
"net: fec: Use a spinlock to guard `fep->ptp_clk_on`"
- Bluetooth: fix HCIGETDEVINFO regression
- wifi: mt76: fix 5 GHz connection regression on mt76x0/mt76x2
- mptcp: fix fwd memory accounting on coalesce
- rwlock removal fall out:
- ipmr: always call ip{,6}_mr_forward() from RCU read-side
critical section
- ipv6: fix crash when IPv6 is administratively disabled
- tcp: read multiple skbs in tcp_read_skb()
- mdio_bus_phy_resume state warning fallout:
- eth: ravb: fix PHY state warning splat during system resume
- eth: sh_eth: fix PHY state warning splat during system resume
Current release - new code bugs:
- wifi: iwlwifi: don't spam logs with NSS>2 messages
- eth: mtk_eth_soc: enable XDP support just for MT7986 SoC
Previous releases - regressions:
- bonding: fix NULL deref in bond_rr_gen_slave_id
- wifi: iwlwifi: mark IWLMEI as broken
Previous releases - always broken:
- nf_conntrack helpers:
- irc: tighten matching on DCC message
- sip: fix ct_sip_walk_headers
- osf: fix possible bogus match in nf_osf_find()
- ipvlan: fix out-of-bound bugs caused by unset skb->mac_header
- core: fix flow symmetric hash
- bonding, team: unsync device addresses on ndo_stop
- phy: micrel: fix shared interrupt on LAN8814"
* tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
selftests: forwarding: add shebang for sch_red.sh
bnxt: prevent skb UAF after handing over to PTP worker
net: marvell: Fix refcounting bugs in prestera_port_sfp_bind()
net: sched: fix possible refcount leak in tc_new_tfilter()
net: sunhme: Fix packet reception for len < RX_COPY_THRESHOLD
udp: Use WARN_ON_ONCE() in udp_read_skb()
selftests: bonding: cause oops in bond_rr_gen_slave_id
bonding: fix NULL deref in bond_rr_gen_slave_id
net: phy: micrel: fix shared interrupt on LAN8814
net/smc: Stop the CLC flow if no link to map buffers on
ice: Fix ice_xdp_xmit() when XDP TX queue number is not sufficient
net: atlantic: fix potential memory leak in aq_ndev_close()
can: gs_usb: gs_usb_set_phys_id(): return with error if identify is not supported
can: gs_usb: gs_can_open(): fix race dev->can.state condition
can: flexcan: flexcan_mailbox_read() fix return value for drop = true
net: sh_eth: Fix PHY state warning splat during system resume
net: ravb: Fix PHY state warning splat during system resume
netfilter: nf_ct_ftp: fix deadlock when nat rewrite is needed
netfilter: ebtables: fix memory leak when blob is malformed
netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain()
...
tfilter_put need to be called to put the refount got by tp->ops->get to
avoid possible refcount leak when chain->tmplt_ops != NULL and
chain->tmplt_ops != tp->ops.
Fixes: 7d5509fa0d ("net: sched: extend proto ops with 'put' callback")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20220921092734.31700-1-hbh25y@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There might be a potential race between SMC-R buffer map and
link group termination.
smc_smcr_terminate_all() | smc_connect_rdma()
--------------------------------------------------------------
| smc_conn_create()
for links in smcibdev |
schedule links down |
| smc_buf_create()
| \- smcr_buf_map_usable_links()
| \- no usable links found,
| (rmb->mr = NULL)
|
| smc_clc_send_confirm()
| \- access conn->rmb_desc->mr[]->rkey
| (panic)
During reboot and IB device module remove, all links will be set
down and no usable links remain in link groups. In such situation
smcr_buf_map_usable_links() should return an error and stop the
CLC flow accessing to uninitialized mr.
Fixes: b9247544c1 ("net/smc: convert static link ID instances to support multiple links")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Link: https://lore.kernel.org/r/1663656189-32090-1-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We can't use ct->lock, this is already used by the seqadj internals.
When using ftp helper + nat, seqadj will attempt to acquire ct->lock
again.
Revert back to a global lock for now.
Fixes: c783a29c7e ("netfilter: nf_ct_ftp: prefer skb_linearize")
Reported-by: Bruno de Paula Larini <bruno.larini@riosoft.com.br>
Signed-off-by: Florian Westphal <fw@strlen.de>
The bug fix was incomplete, it "replaced" crash with a memory leak.
The old code had an assignment to "ret" embedded into the conditional,
restore this.
Fixes: 7997eff828 ("netfilter: ebtables: reject blobs that don't provide all entry points")
Reported-and-tested-by: syzbot+a24c5252f3e3ab733464@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
It seems to me that percpu memory for chain stats started leaking since
commit 3bc158f8d0 ("netfilter: nf_tables: map basechain priority to
hardware priority") when nft_chain_offload_priority() returned an error.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 3bc158f8d0 ("netfilter: nf_tables: map basechain priority to hardware priority")
Signed-off-by: Florian Westphal <fw@strlen.de>
syzbot is reporting underflow of nft_counters_enabled counter at
nf_tables_addchain() [1], for commit 43eb8949cf ("netfilter:
nf_tables: do not leave chain stats enabled on error") missed that
nf_tables_chain_destroy() after nft_basechain_init() in the error path of
nf_tables_addchain() decrements the counter because nft_basechain_init()
makes nft_is_base_chain() return true by setting NFT_CHAIN_BASE flag.
Increment the counter immediately after returning from
nft_basechain_init().
Link: https://syzkaller.appspot.com/bug?extid=b5d82a651b71cd8a75ab [1]
Reported-by: syzbot <syzbot+b5d82a651b71cd8a75ab@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+b5d82a651b71cd8a75ab@syzkaller.appspotmail.com>
Fixes: 43eb8949cf ("netfilter: nf_tables: do not leave chain stats enabled on error")
Signed-off-by: Florian Westphal <fw@strlen.de>
taprio can only operate as root qdisc, and to that end, there exists the
following check in taprio_init(), just as in mqprio:
if (sch->parent != TC_H_ROOT)
return -EOPNOTSUPP;
And indeed, when we try to attach taprio to an mqprio child, it fails as
expected:
$ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
$ tc qdisc replace dev swp0 parent 1:2 taprio num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
flags 0x0 clockid CLOCK_TAI
Error: sch_taprio: Can only be attached as root qdisc.
(extack message added by me)
But when we try to attach a taprio child to a taprio root qdisc,
surprisingly it doesn't fail:
$ tc qdisc replace dev swp0 root handle 1: taprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
flags 0x0 clockid CLOCK_TAI
$ tc qdisc replace dev swp0 parent 1:2 taprio num_tc 8 \
map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
flags 0x0 clockid CLOCK_TAI
This is because tc_modify_qdisc() behaves differently when mqprio is
root, vs when taprio is root.
In the mqprio case, it finds the parent qdisc through
p = qdisc_lookup(dev, TC_H_MAJ(clid)), and then the child qdisc through
q = qdisc_leaf(p, clid). This leaf qdisc q has handle 0, so it is
ignored according to the comment right below ("It may be default qdisc,
ignore it"). As a result, tc_modify_qdisc() goes through the
qdisc_create() code path, and this gives taprio_init() a chance to check
for sch_parent != TC_H_ROOT and error out.
Whereas in the taprio case, the returned q = qdisc_leaf(p, clid) is
different. It is not the default qdisc created for each netdev queue
(both taprio and mqprio call qdisc_create_dflt() and keep them in
a private q->qdiscs[], or priv->qdiscs[], respectively). Instead, taprio
makes qdisc_leaf() return the _root_ qdisc, aka itself.
When taprio does that, tc_modify_qdisc() goes through the qdisc_change()
code path, because the qdisc layer never finds out about the child qdisc
of the root. And through the ->change() ops, taprio has no reason to
check whether its parent is root or not, just through ->init(), which is
not called.
The problem is the taprio_leaf() implementation. Even though code wise,
it does the exact same thing as mqprio_leaf() which it is copied from,
it works with different input data. This is because mqprio does not
attach itself (the root) to each device TX queue, but one of the default
qdiscs from its private array.
In fact, since commit 13511704f8 ("net: taprio offload: enforce qdisc
to netdev queue mapping"), taprio does this too, but just for the full
offload case. So if we tried to attach a taprio child to a fully
offloaded taprio root qdisc, it would properly fail too; just not to a
software root taprio.
To fix the problem, stop looking at the Qdisc that's attached to the TX
queue, and instead, always return the default qdiscs that we've
allocated (and to which we privately enqueue and dequeue, in software
scheduling mode).
Since Qdisc_class_ops :: leaf is only called from tc_modify_qdisc(),
the risk of unforeseen side effects introduced by this change is
minimal.
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In an incredibly strange API design decision, qdisc->destroy() gets
called even if qdisc->init() never succeeded, not exclusively since
commit 87b60cfacf ("net_sched: fix error recovery at qdisc creation"),
but apparently also earlier (in the case of qdisc_create_dflt()).
The taprio qdisc does not fully acknowledge this when it attempts full
offload, because it starts off with q->flags = TAPRIO_FLAGS_INVALID in
taprio_init(), then it replaces q->flags with TCA_TAPRIO_ATTR_FLAGS
parsed from netlink (in taprio_change(), tail called from taprio_init()).
But in taprio_destroy(), we call taprio_disable_offload(), and this
determines what to do based on FULL_OFFLOAD_IS_ENABLED(q->flags).
But looking at the implementation of FULL_OFFLOAD_IS_ENABLED()
(a bitwise check of bit 1 in q->flags), it is invalid to call this macro
on q->flags when it contains TAPRIO_FLAGS_INVALID, because that is set
to U32_MAX, and therefore FULL_OFFLOAD_IS_ENABLED() will return true on
an invalid set of flags.
As a result, it is possible to crash the kernel if user space forces an
error between setting q->flags = TAPRIO_FLAGS_INVALID, and the calling
of taprio_enable_offload(). This is because drivers do not expect the
offload to be disabled when it was never enabled.
The error that we force here is to attach taprio as a non-root qdisc,
but instead as child of an mqprio root qdisc:
$ tc qdisc add dev swp0 root handle 1: \
mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
$ tc qdisc replace dev swp0 parent 1:1 \
taprio num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 base-time 0 \
sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
flags 0x0 clockid CLOCK_TAI
Unable to handle kernel paging request at virtual address fffffffffffffff8
[fffffffffffffff8] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Call trace:
taprio_dump+0x27c/0x310
vsc9959_port_setup_tc+0x1f4/0x460
felix_port_setup_tc+0x24/0x3c
dsa_slave_setup_tc+0x54/0x27c
taprio_disable_offload.isra.0+0x58/0xe0
taprio_destroy+0x80/0x104
qdisc_create+0x240/0x470
tc_modify_qdisc+0x1fc/0x6b0
rtnetlink_rcv_msg+0x12c/0x390
netlink_rcv_skb+0x5c/0x130
rtnetlink_rcv+0x1c/0x2c
Fix this by keeping track of the operations we made, and undo the
offload only if we actually did it.
I've added "bool offloaded" inside a 4 byte hole between "int clockid"
and "atomic64_t picos_per_byte". Now the first cache line looks like
below:
$ pahole -C taprio_sched net/sched/sch_taprio.o
struct taprio_sched {
struct Qdisc * * qdiscs; /* 0 8 */
struct Qdisc * root; /* 8 8 */
u32 flags; /* 16 4 */
enum tk_offsets tk_offset; /* 20 4 */
int clockid; /* 24 4 */
bool offloaded; /* 28 1 */
/* XXX 3 bytes hole, try to pack */
atomic64_t picos_per_byte; /* 32 0 */
/* XXX 8 bytes hole, try to pack */
spinlock_t current_entry_lock; /* 40 0 */
/* XXX 8 bytes hole, try to pack */
struct sched_entry * current_entry; /* 48 8 */
struct sched_gate_list * oper_sched; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
Fixes: 9c66d15646 ("taprio: Add support for hardware offloading")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The global 'raw_v6_hashinfo' variable can be accessed even when IPv6 is
administratively disabled via the 'ipv6.disable=1' kernel command line
option, leading to a crash [1].
Fix by restoring the original behavior and always initializing the
variable, regardless of IPv6 support being administratively disabled or
not.
[1]
BUG: unable to handle page fault for address: ffffffffffffffc8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 173e18067 P4D 173e18067 PUD 173e1a067 PMD 0
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 3 PID: 271 Comm: ss Not tainted 6.0.0-rc4-custom-00136-g0727a9a5fbc1 #1396
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
RIP: 0010:raw_diag_dump+0x310/0x7f0
[...]
Call Trace:
<TASK>
__inet_diag_dump+0x10f/0x2e0
netlink_dump+0x575/0xfd0
__netlink_dump_start+0x67b/0x940
inet_diag_handler_cmd+0x273/0x2d0
sock_diag_rcv_msg+0x317/0x440
netlink_rcv_skb+0x15e/0x430
sock_diag_rcv+0x2b/0x40
netlink_unicast+0x53b/0x800
netlink_sendmsg+0x945/0xe60
____sys_sendmsg+0x747/0x960
___sys_sendmsg+0x13a/0x1e0
__sys_sendmsg+0x118/0x1e0
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Fixes: 0daf07e527 ("raw: convert raw sockets to RCU")
Reported-by: Roberto Ricci <rroberto2r@gmail.com>
Tested-by: Roberto Ricci <rroberto2r@gmail.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220916084821.229287-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot is still complaining uninit-value in tcp_recvmsg(), for
commit 1228b34c8d ("net: clear msg_get_inq in __sys_recvfrom() and
__copy_msghdr_from_user()") missed that __get_compat_msghdr() is called
instead of copy_msghdr_from_user() when MSG_CMSG_COMPAT is specified.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 1228b34c8d ("net: clear msg_get_inq in __sys_recvfrom() and __copy_msghdr_from_user()")
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/d06d0f7f-696c-83b4-b2d5-70b5f2730a37@I-love.SAKURA.ne.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These functions expect to be called from RCU read-side critical section,
but this only happens when invoked from the data path via
ip{,6}_mr_input(). They can also be invoked from process context in
response to user space adding a multicast route which resolves a cache
entry with queued packets [1][2].
Fix by adding missing rcu_read_lock() / rcu_read_unlock() in these call
paths.
[1]
WARNING: suspicious RCU usage
6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted
-----------------------------
net/ipv4/ipmr.c:84 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by smcrouted/246:
#0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip_mroute_setsockopt+0x11c/0x1420
stack backtrace:
CPU: 0 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x91/0xb9
vif_dev_read+0xbf/0xd0
ipmr_queue_xmit+0x135/0x1ab0
ip_mr_forward+0xe7b/0x13d0
ipmr_mfc_add+0x1a06/0x2ad0
ip_mroute_setsockopt+0x5c1/0x1420
do_ip_setsockopt+0x23d/0x37f0
ip_setsockopt+0x56/0x80
raw_setsockopt+0x219/0x290
__sys_setsockopt+0x236/0x4d0
__x64_sys_setsockopt+0xbe/0x160
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
[2]
WARNING: suspicious RCU usage
6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted
-----------------------------
net/ipv6/ip6mr.c:69 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by smcrouted/246:
#0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip6_mroute_setsockopt+0x6b9/0x2630
stack backtrace:
CPU: 1 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x91/0xb9
vif_dev_read+0xbf/0xd0
ip6mr_forward2.isra.0+0xc9/0x1160
ip6_mr_forward+0xef0/0x13f0
ip6mr_mfc_add+0x1ff2/0x31f0
ip6_mroute_setsockopt+0x1825/0x2630
do_ipv6_setsockopt+0x462/0x4440
ipv6_setsockopt+0x105/0x140
rawv6_setsockopt+0xd8/0x690
__sys_setsockopt+0x236/0x4d0
__x64_sys_setsockopt+0xbe/0x160
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: ebc3197963 ("ipmr: add rcu protection over (struct vif_device)->dev")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before we switched to ->read_skb(), ->read_sock() was passed with
desc.count=1, which technically indicates we only read one skb per
->sk_data_ready() call. However, for TCP, this is not true.
TCP at least has sk_rcvlowat which intentionally holds skb's in
receive queue until this watermark is reached. This means when
->sk_data_ready() is invoked there could be multiple skb's in the
queue, therefore we have to read multiple skbs in tcp_read_skb()
instead of one.
Fixes: 965b57b469 ("net: Introduce a new proto_ops ->read_skb()")
Reported-by: Peilin Ye <peilin.ye@bytedance.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20220912173553.235838-1-xiyou.wangcong@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- Fix hang up with small MTU hard-interface, by Shigeru Yoshida
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmMknmcWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoVFaEACJbgSC/QcTD1D3Ga1jGPdeb6SK
Ik4RGnOIgEWWnsyxigLxWPDHi8mSFEwTGjqiGdmFbBSt0QCOYW6Y1HXxdW6rjqTm
bY1nHLcE+iFdUbZNREux9j/vpEnFNO0hjqfAWIT7W6Q/cdvIUoLIKjtoAdXXLq9j
lq0ScOfW+rT9ibmrWajoGw88LD+tGgZMHSZpzWmtpnqSNmxBcQw/GMGaXkeQwncO
HTXJneGBqs9QQPrCEVanqkg3LPR2lnCic/Y2TpCPGPjGUvxsCg/OBqTt1XjbeB70
xkc2nXoUiGtl0oWkP4xETSqv0l5Mrgs9ZJH1UhLEfnCj4Y0wUHN4XcweYMaE0TaR
gyISkFBZUokZMXs495G+vwgFgMEfydMkPBqSIKEKbUKt/XUKhWqEGchMz0HbrWvc
S/cAmqmzLxNsHkl7jbE6KRW/jRTXPkXJTlaf4Pit6CrN0fam2RvJd7B27APMt6qg
FQpqRhlY21LqNADBGl46NMDnLrewi5OL0NoigcKRmXEWG/VF8duLbBklWGe1zEov
zmpeYupXF/vHiVl6ZKWzF4U8ZBzGosBrcej2tuTuHhyDIAFgSGTqCP5bl5CErRf/
buOH9/NBo8xGyIAZkD33wUwAHf7L74cpTgHdAcnhwnhV/YewL/S6IML1azC50yFy
7tZhKTzTIxpqT+cf5Q==
=w5ul
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-pullrequest-20220916' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Fix hang up with small MTU hard-interface, by Shigeru Yoshida
* tag 'batadv-net-pullrequest-20220916' of git://git.open-mesh.org/linux-merge:
batman-adv: Fix hang up with small MTU hard-interface
====================
Link: https://lore.kernel.org/r/20220916160931.1412407-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prevent tcp_read_skb() from flooding the syslog.
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is uninit value bug in dgram_sendmsg function in
net/ieee802154/socket.c when the length of valid data pointed by the
msg->msg_name isn't verified.
We introducing a helper function ieee802154_sockaddr_check_size to
check namelen. First we check there is addr_type in ieee802154_addr_sa.
Then, we check namelen according to addr_type.
Also fixed in raw_bind, dgram_bind, dgram_connect.
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__flow_hash_consistentify() wrongly swaps ipv4 addresses in few cases.
This function is indirectly used by __skb_get_hash_symmetric(), which is
used to fanout packets in AF_PACKET.
Intrusion detection systems may be impacted by this issue.
__flow_hash_consistentify() computes the addresses difference then swaps
them if the difference is negative. In few cases src - dst and dst - src
are both negative.
The following snippet mimics __flow_hash_consistentify():
```
#include <stdio.h>
#include <stdint.h>
int main(int argc, char** argv) {
int diffs_d, diffd_s;
uint32_t dst = 0xb225a8c0; /* 178.37.168.192 --> 192.168.37.178 */
uint32_t src = 0x3225a8c0; /* 50.37.168.192 --> 192.168.37.50 */
uint32_t dst2 = 0x3325a8c0; /* 51.37.168.192 --> 192.168.37.51 */
diffs_d = src - dst;
diffd_s = dst - src;
printf("src:%08x dst:%08x, diff(s-d)=%d(0x%x) diff(d-s)=%d(0x%x)\n",
src, dst, diffs_d, diffs_d, diffd_s, diffd_s);
diffs_d = src - dst2;
diffd_s = dst2 - src;
printf("src:%08x dst:%08x, diff(s-d)=%d(0x%x) diff(d-s)=%d(0x%x)\n",
src, dst2, diffs_d, diffs_d, diffd_s, diffd_s);
return 0;
}
```
Results:
src:3225a8c0 dst:b225a8c0, \
diff(s-d)=-2147483648(0x80000000) \
diff(d-s)=-2147483648(0x80000000)
src:3225a8c0 dst:3325a8c0, \
diff(s-d)=-16777216(0xff000000) \
diff(d-s)=16777216(0x1000000)
In the first case the addresses differences are always < 0, therefore
__flow_hash_consistentify() always swaps, thus dst->src and src->dst
packets have differents hashes.
Fixes: c3f8324188 ("net: Add full IPv6 addresses to flow_keys")
Signed-off-by: Ludovic Cintrat <ludovic.cintrat@gatewatcher.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westhal says:
====================
netfilter: bugfixes for net
The following set contains four netfilter patches for your *net* tree.
When there are multiple Contact headers in a SIP message its possible
the next headers won't be found because the SIP helper confuses relative
and absolute offsets in the message. From Igor Ryzhov.
Make the nft_concat_range self-test support socat, this makes the
selftest pass on my test VM, from myself.
nf_conntrack_irc helper can be tricked into opening a local port forward
that the client never requested by embedding a DCC message in a PING
request sent to the client. Fix from David Leadbeater.
Both have been broken since the kernel 2.6.x days.
The 'osf' match might indicate success while it could not find
anything, broken since 5.2 . Fix from Pablo Neira.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit efe57fd58e.
The assumption that it is impossible to return an ERR pointer from
rpc_run_task() no longer holds due to commit 25cf32ad5d ("SUNRPC:
Handle allocation failure in rpc_new_task()").
Fixes: 25cf32ad5d ('SUNRPC: Handle allocation failure in rpc_new_task()')
Fixes: efe57fd58e ('SUNRPC: Remove unreachable error condition')
Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cong Wang noticed that the previous fix for sch_sfb accessing the queued
skb after enqueueing it to a child qdisc was incomplete: the SFB enqueue
function was also calling qdisc_qstats_backlog_inc() after enqueue, which
reads the pkt len from the skb cb field. Fix this by also storing the skb
len, and using the stored value to increment the backlog after enqueueing.
Fixes: 9efd23297c ("sch_sfb: Don't assume the skb is still around after enqueueing to child")
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20220905192137.965549-1-toke@toke.dk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As Eric reported, the 'reason' field is not presented when trace the
kfree_skb event by perf:
$ perf record -e skb:kfree_skb -a sleep 10
$ perf script
ip_defrag 14605 [021] 221.614303: skb:kfree_skb:
skbaddr=0xffff9d2851242700 protocol=34525 location=0xffffffffa39346b1
reason:
The cause seems to be passing kernel address directly to TP_printk(),
which is not right. As the enum 'skb_drop_reason' is not exported to
user space through TRACE_DEFINE_ENUM(), perf can't get the drop reason
string from the 'reason' field, which is a number.
Therefore, we introduce the macro DEFINE_DROP_REASON(), which is used
to define the trace enum by TRACE_DEFINE_ENUM(). With the help of
DEFINE_DROP_REASON(), now we can remove the auto-generate that we
introduced in the commit ec43908dd5
("net: skb: use auto-generation to convert skb drop reason to string"),
and define the string array 'drop_reasons'.
Hmmmm...now we come back to the situation that have to maintain drop
reasons in both enum skb_drop_reason and DEFINE_DROP_REASON. But they
are both in dropreason.h, which makes it easier.
After this commit, now the format of kfree_skb is like this:
$ cat /tracing/events/skb/kfree_skb/format
name: kfree_skb
ID: 1524
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:void * location; offset:16; size:8; signed:0;
field:unsigned short protocol; offset:24; size:2; signed:0;
field:enum skb_drop_reason reason; offset:28; size:4; signed:0;
print fmt: "skbaddr=%p protocol=%u location=%p reason: %s", REC->skbaddr, REC->protocol, REC->location, __print_symbolic(REC->reason, { 1, "NOT_SPECIFIED" }, { 2, "NO_SOCKET" } ......
Fixes: ec43908dd5 ("net: skb: use auto-generation to convert skb drop reason to string")
Link: https://lore.kernel.org/netdev/CANn89i+bx0ybvE55iMYf5GJM48WwV1HNpdm9Q6t-HaEstqpCSA@mail.gmail.com/
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_osf_find() incorrectly returns true on mismatch, this leads to
copying uninitialized memory area in nft_osf which can be used to leak
stale kernel stack data to userspace.
Fixes: 22c7652cda ("netfilter: nft_osf: Add version option support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
CTCP messages should only be at the start of an IRC message, not
anywhere within it.
While the helper only decodes packes in the ORIGINAL direction, its
possible to make a client send a CTCP message back by empedding one into
a PING request. As-is, thats enough to make the helper believe that it
saw a CTCP message.
Fixes: 869f37d8e4 ("[NETFILTER]: nf_conntrack/nf_nat: add IRC helper port")
Signed-off-by: David Leadbeater <dgl@dgl.cx>
Signed-off-by: Florian Westphal <fw@strlen.de>
ct_sip_next_header and ct_sip_get_header return an absolute
value of matchoff, not a shift from current dataoff.
So dataoff should be assigned matchoff, not incremented by it.
This issue can be seen in the scenario when there are multiple
Contact headers and the first one is using a hostname and other headers
use IP addresses. In this case, ct_sip_walk_headers will work as follows:
The first ct_sip_get_header call to will find the first Contact header
but will return -1 as the header uses a hostname. But matchoff will
be changed to the offset of this header. After that, dataoff should be
set to matchoff, so that the next ct_sip_get_header call find the next
Contact header. But instead of assigning dataoff to matchoff, it is
incremented by it, which is not correct, as matchoff is an absolute
value of the offset. So on the next call to the ct_sip_get_header,
dataoff will be incorrect, and the next Contact header may not be
found at all.
Fixes: 05e3ced297 ("[NETFILTER]: nf_conntrack_sip: introduce SIP-URI parsing helper")
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Fix a bug reported and analyzed by Nagaraj Arankal, where the handling
of a spurious non-SACK RTO could cause a connection to fail to clear
retrans_stamp, causing a later RTO to very prematurely time out the
connection with ETIMEDOUT.
Here is the buggy scenario, expanding upon Nagaraj Arankal's excellent
report:
(*1) Send one data packet on a non-SACK connection
(*2) Because no ACK packet is received, the packet is retransmitted
and we enter CA_Loss; but this retransmission is spurious.
(*3) The ACK for the original data is received. The transmitted packet
is acknowledged. The TCP timestamp is before the retrans_stamp,
so tcp_may_undo() returns true, and tcp_try_undo_loss() returns
true without changing state to Open (because tcp_is_sack() is
false), and tcp_process_loss() returns without calling
tcp_try_undo_recovery(). Normally after undoing a CA_Loss
episode, tcp_fastretrans_alert() would see that the connection
has returned to CA_Open and fall through and call
tcp_try_to_open(), which would set retrans_stamp to 0. However,
for non-SACK connections we hold the connection in CA_Loss, so do
not fall through to call tcp_try_to_open() and do not set
retrans_stamp to 0. So retrans_stamp is (erroneously) still
non-zero.
At this point the first "retransmission event" has passed and
been recovered from. Any future retransmission is a completely
new "event". However, retrans_stamp is erroneously still
set. (And we are still in CA_Loss, which is correct.)
(*4) After 16 minutes (to correspond with tcp_retries2=15), a new data
packet is sent. Note: No data is transmitted between (*3) and
(*4) and we disabled keep alives.
The socket's timeout SHOULD be calculated from this point in
time, but instead it's calculated from the prior "event" 16
minutes ago (step (*2)).
(*5) Because no ACK packet is received, the packet is retransmitted.
(*6) At the time of the 2nd retransmission, the socket returns
ETIMEDOUT, prematurely, because retrans_stamp is (erroneously)
too far in the past (set at the time of (*2)).
This commit fixes this bug by ensuring that we reuse in
tcp_try_undo_loss() the same careful logic for non-SACK connections
that we have in tcp_try_undo_recovery(). To avoid duplicating logic,
we factor out that logic into a new
tcp_is_non_sack_preventing_reopen() helper and call that helper from
both undo functions.
Fixes: da34ac7626 ("tcp: only undo on partial ACKs in CA_Loss")
Reported-by: Nagaraj Arankal <nagaraj.p.arankal@hpe.com>
Link: https://lore.kernel.org/all/SJ0PR84MB1847BE6C24D274C46A1B9B0EB27A9@SJ0PR84MB1847.NAMPRD84.PROD.OUTLOOK.COM/
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220903121023.866900-1-ncardwell.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The SRv6 layer allows defining HMAC data that can later be used to sign IPv6
Segment Routing Headers. This configuration is realised via netlink through
four attributes: SEG6_ATTR_HMACKEYID, SEG6_ATTR_SECRET, SEG6_ATTR_SECRETLEN and
SEG6_ATTR_ALGID. Because the SECRETLEN attribute is decoupled from the actual
length of the SECRET attribute, it is possible to provide invalid combinations
(e.g., secret = "", secretlen = 64). This case is not checked in the code and
with an appropriately crafted netlink message, an out-of-bounds read of up
to 64 bytes (max secret length) can occur past the skb end pointer and into
skb_shared_info:
Breakpoint 1, seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
208 memcpy(hinfo->secret, secret, slen);
(gdb) bt
#0 seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
#1 0xffffffff81e012e9 in genl_family_rcv_msg_doit (skb=skb@entry=0xffff88800b1f9f00, nlh=nlh@entry=0xffff88800b1b7600,
extack=extack@entry=0xffffc90000ba7af0, ops=ops@entry=0xffffc90000ba7a80, hdrlen=4, net=0xffffffff84237580 <init_net>, family=<optimized out>,
family=<optimized out>) at net/netlink/genetlink.c:731
#2 0xffffffff81e01435 in genl_family_rcv_msg (extack=0xffffc90000ba7af0, nlh=0xffff88800b1b7600, skb=0xffff88800b1f9f00,
family=0xffffffff82fef6c0 <seg6_genl_family>) at net/netlink/genetlink.c:775
#3 genl_rcv_msg (skb=0xffff88800b1f9f00, nlh=0xffff88800b1b7600, extack=0xffffc90000ba7af0) at net/netlink/genetlink.c:792
#4 0xffffffff81dfffc3 in netlink_rcv_skb (skb=skb@entry=0xffff88800b1f9f00, cb=cb@entry=0xffffffff81e01350 <genl_rcv_msg>)
at net/netlink/af_netlink.c:2501
#5 0xffffffff81e00919 in genl_rcv (skb=0xffff88800b1f9f00) at net/netlink/genetlink.c:803
#6 0xffffffff81dff6ae in netlink_unicast_kernel (ssk=0xffff888010eec800, skb=0xffff88800b1f9f00, sk=0xffff888004aed000)
at net/netlink/af_netlink.c:1319
#7 netlink_unicast (ssk=ssk@entry=0xffff888010eec800, skb=skb@entry=0xffff88800b1f9f00, portid=portid@entry=0, nonblock=<optimized out>)
at net/netlink/af_netlink.c:1345
#8 0xffffffff81dff9a4 in netlink_sendmsg (sock=<optimized out>, msg=0xffffc90000ba7e48, len=<optimized out>) at net/netlink/af_netlink.c:1921
...
(gdb) p/x ((struct sk_buff *)0xffff88800b1f9f00)->head + ((struct sk_buff *)0xffff88800b1f9f00)->end
$1 = 0xffff88800b1b76c0
(gdb) p/x secret
$2 = 0xffff88800b1b76c0
(gdb) p slen
$3 = 64 '@'
The OOB data can then be read back from userspace by dumping HMAC state. This
commit fixes this by ensuring SECRETLEN cannot exceed the actual length of
SECRET.
Reported-by: Lucas Leong <wmliang.tw@gmail.com>
Tested: verified that EINVAL is correctly returned when secretlen > len(secret)
Fixes: 4f4853dc1c ("ipv6: sr: implement API to control SR HMAC structure")
Signed-off-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link is enslave to bond, it need to set the interface down first.
This makes the slave remove mac multicast address 33:33:00:00:00:01(The
IPv6 multicast address ff02::1 is kept even when the interface down). When
bond set the slave up, ipv6_mc_up() was not called due to commit c2edacf80e
("bonding / ipv6: no addrconf for slaves separately from master").
This is not an issue before we adding the lladdr target feature for bonding,
as the mac multicast address will be added back when bond interface up and
join group ff02::1.
But after adding lladdr target feature for bonding. When user set a lladdr
target, the unsolicited NA message with all-nodes multicast dest will be
dropped as the slave interface never add 33:33:00:00:00:01 back.
Fix this by calling ipv6_mc_up() to add 33:33:00:00:00:01 back when
the slave interface up.
Reported-by: LiLiang <liali@redhat.com>
Fixes: 5e1eeef69c ("bonding: NS target should accept link local address")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- fix DMA from stack in wilc1000 driver
- fix crash on chip reset failure in mt7921e
- fix for the reported warning on aggregation timer expiry
- check packet lengths in hwsim virtio paths
- fix compiler warnings/errors with AAD construction by
using struct_group
- fix Intel 4965 driver rate scale operation
- release channel contexts correctly in mac80211 mlme code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmMTaioACgkQB8qZga/f
l8SazBAAk3xa7jhX3SxkD8hV9hH7exVGZjwK6v5qfHBF6I5XT9WpOLzCUqAoBjF3
8uAD6oQqhh9eccctaMTtjIA9IiJTdcy+tBa3WUpHh5ZKyqm1dVQEX2HEao6T9p1A
UYRboiorAXth1VybNSfofPWLKUuqOPJXwDsbdgVkDw4/YV1cJ/oNvmQqL1sw/TWY
S3vlMBE7IYFRjzD1z00EAjJRsWAprahS9wDU6Iz3eATK7Ec+QmW8EhHvRSbDGaG3
2jFj3H3JUWjzgjBzmuaq4aDvY3Y0wywCZ/4aMZj0TIqKaTZiXv0jFrYQG+NWsPX2
vQdCMLqTRQoZfY7Gbj4trL0VlallM5kcMLG1LcvTZsF0psnIqras77KecSnpa7HB
8MAd5cMfMhLZsU8duWy19WQ3vrSM4Y+5lbVUWClRtn8yruyYdXTvbvuNmLcnSVe/
2HAvIXK8PdGNBEIRoGj+h3AVHSssmVUOA53sM0uRjCshjZvjXgAlYbUkXBQ05Z+t
mbx4bFKrICLgDcnNqfygYL3Q5c2njmpSvFjdLYX8NdlwK0ASUaXF1YxvHNQgDPu9
soKj6++d7/Hu4bDb8YxFD8CUDHIj2LCoIsWR814gHnTksDpypdBM3K+mzj4jnq4i
NW1CqPR3Yhprthn4AxkU7Dq+Hz+YCFWYgMGw7K52lNH7z8Vzn+4=
=GyC3
-----END PGP SIGNATURE-----
Merge tag 'wireless-2022-09-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes berg says:
====================
We have a handful of fixes:
- fix DMA from stack in wilc1000 driver
- fix crash on chip reset failure in mt7921e
- fix for the reported warning on aggregation timer expiry
- check packet lengths in hwsim virtio paths
- fix compiler warnings/errors with AAD construction by
using struct_group
- fix Intel 4965 driver rate scale operation
- release channel contexts correctly in mac80211 mlme code
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We sometimes copy all the addresses from the 802.11 header
for the AAD, which may cause complaints from fortify checks.
Use struct_group() to avoid the compiler warnings/errors.
Change-Id: Ic3ea389105e7813b22095b295079eecdabde5045
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we hit an authentication or association timeout, we only
release the chanctx for the deflink, and the other link(s)
are released later by ieee80211_vif_set_links(), but we're
not locking this correctly.
Fix the locking here while releasing the channels and links.
Change-Id: I9e08c1a5434592bdc75253c1abfa6c788f9f39b1
Fixes: 81151ce462 ("wifi: mac80211: support MLO authentication/association with one link")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the prep_channel error case we didn't release the deflink
channel leaving it to be left around. Fix that.
Change-Id: If0dfd748125ec46a31fc6045a480dc28e03723d2
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The rx data link pointer isn't set from the RX aggregation timer,
resulting in a later warning. Fix that by setting it to the first
valid link for now, with a FIXME to worry about statistics later,
it's not very important since it's just the timeout case.
Reported-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/498d714c-76be-9d04-26db-a1206878de5e@redhat.com
Fixes: 56057da456 ("wifi: mac80211: rx: track link in RX data")
Signed-off-by: Mukesh Sisodiya <mukesh.sisodiya@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>