In idtentry_vc(), vc_switch_off_ist() determines a safe stack to
switch to, off of the IST stack. Annotate the new stack switch with
ENCODE_FRAME_POINTER in case UNWINDER_FRAME_POINTER is used.
A stack walk before looks like this:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc7+ #2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl
dump_stack
kernel_exc_vmm_communication
asm_exc_vmm_communication
? native_read_msr
? __x2apic_disable.part.0
? x2apic_setup
? cpu_init
? trap_init
? start_kernel
? x86_64_start_reservations
? x86_64_start_kernel
? secondary_startup_64_no_verify
</TASK>
and with the fix, the stack dump is exact:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc7+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl
dump_stack
kernel_exc_vmm_communication
asm_exc_vmm_communication
RIP: 0010:native_read_msr
Code: ...
< snipped regs >
? __x2apic_disable.part.0
x2apic_setup
cpu_init
trap_init
start_kernel
x86_64_start_reservations
x86_64_start_kernel
secondary_startup_64_no_verify
</TASK>
[ bp: Test in a SEV-ES guest and rewrite the commit message to
explain what exactly this does. ]
Fixes: a13644f3a5 ("x86/entry/64: Add entry code for #VC handler")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220316041612.71357-1-jiangshanlai@gmail.com
Add support for using longer timeouts during controller initialization
and letting the controller come up with namespaces that are not ready
for I/O yet. We skip these not ready namespaces during scanning and
only bring them online once anoter scan is kicked off by the AEN that
is set when the NRDY bit gets set in the I/O Command Set Independent
Identify Namespace Data Structure. This asynchronous probing avoids
blocking the kernel boot when controllers take a very long time to
recover after unclean shutdowns (up to minutes).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Descriptor table is a shared resource; two fget() on the same descriptor
may return different struct file references. get_tap_ptr_ring() is
called after we'd found (and pinned) the socket we'll be using and it
tries to find the private tun/tap data structures associated with it.
Redoing the lookup by the same file descriptor we'd used to get the
socket is racy - we need to same struct file.
Thanks to Jason for spotting a braino in the original variant of patch -
I'd missed the use of fd == -1 for disabling backend, and in that case
we can end up with sock == NULL and sock != oldsock.
Cc: stable@kernel.org
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The current code evaluates RQT size based on the configured number of
virtqueues. This can raise an issue in the following scenario:
Assume MQ was negotiated.
1. mlx5_vdpa_set_map() gets called.
2. handle_ctrl_mq() is called setting cur_num_vqs to some value, lower
than the configured max VQs.
3. A second set_map gets called, but now a smaller number of VQs is used
to evaluate the size of the RQT.
4. handle_ctrl_mq() is called with a value larger than what the RQT can
hold. This will emit errors and the driver state is compromised.
To fix this, we use a new field in struct mlx5_vdpa_net to hold the
required number of entries in the RQT. This value is evaluated in
mlx5_vdpa_set_driver_features() where we have the negotiated features
all set up.
In addition to that, we take into consideration the max capability of RQT
entries early when the device is added so we don't need to take consider
it when creating the RQT.
Last, we remove the use of mlx5_vdpa_max_qps() which just returns the
max_vas / 2 and make the code clearer.
Fixes: 52893733f2 ("vdpa/mlx5: Add multiqueue support")
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
A collection of last-minute HD- an USB-audio quirks in addition to
a fix for the legacy ISA wavefront driver.
All look small and easy.
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAmKDXUMOHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE+PRw/+Pmfan+rymr86zTZ7kVmCzeVMB3r/1W6dPsMk
rBYXOpdBrok7ovpgCvjx0WDLCCics/gSZCXg7RVzY/KypKrthPBpiUshe6on82zR
Bmv9UMDKELuMeUPBHSCXxFDlTD0nzWk5Oz+WQN3hujM8HKBhT6v2jyuiGLMzBYkC
pjnauvEVf9uNZq33oHyKtTaI43OIkrReG4q7GCpDpqMS4rKlt6hI5BGtUXpTHQsC
CaR1d6XQcfjGr1peCcz0YZddxqhVOwujDCSyNSAted39HolsQUcIXVITsKYB4WXl
rrLvUy+zxf8bKYO4bGfS04sXP9t7x0BOWUWeX0ApO97cQ1v95j4d363YFMDw3VE1
/PIQxt/F+cq5cnhhl0goDTTN6pVA7eWcIck3XKmpqbduUGmWhhTm4WSuTeceK/7s
Q0gWrX6vecLAFJjj2SBU4Se5rehDwoHbQ1xA4Pp2C1FgQf/47zg09K30UYEC3+iO
Io8ohJZonQEf3+Y+NjFy72gaJ8mxMTA/kWCUIAm+DTr/4zGo/uC7I/5i0PbmeqVb
OW26pfMYFcQ0e8MrqolQ9B0EBF+7/JEv/89w7F9CU/aF1VklWqUu0fWbXpHd9W0v
JzhDzByc5NuUrdlfvwZl40Xd8Mf+LsnTO0nbdB+KqYQFUErmWYZTdzf+8zaC0Vka
/6hpbMY=
=9TJZ
-----END PGP SIGNATURE-----
Merge tag 'sound-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of last-minute HD- an USB-audio quirks in addition to a
fix for the legacy ISA wavefront driver.
All look small and easy"
* tag 'sound-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Restore Rane SL-1 quirk
ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machine
ALSA: hda/realtek: Add quirk for TongFang devices with pop noise
ALSA: hda/realtek: Add quirk for the Framework Laptop
ALSA: wavefront: Proper check of get_user() error
ALSA: hda/realtek: Add quirk for Dell Latitude 7520
ALSA: hda - fix unused Realtek function when PM is not enabled
ALSA: usb-audio: Don't get sample rate for MCT Trigger 5 USB-to-HDMI
Either userspace or kernelspace need to pre-fetch keys inconditionally
before comparisons for this to work. Otherwise, register tracking data
is misleading and it might result in reducing expressions which are not
yet registers.
First expression is also guaranteed to be evaluated always, however,
certain expressions break before writing data to registers, before
comparing the data, leaving the register in undetermined state.
This patch disables this infrastructure by now.
Fixes: b2d306542f ("netfilter: nf_tables: do not reduce read-only expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes sporadic IPv6 packet loss when flow offloading is enabled.
IPv6 route GC and flowtable GC are not synchronized.
When dst_cache becomes stale and a packet passes through the flow before
the flowtable GC teardowns it, the packet can be dropped.
So, it is necessary to check dst every time in packet path.
Fixes: 227e1e4d0d ("netfilter: nf_flowtable: skip device lookup from interface index")
Signed-off-by: Ritaro Takenaka <ritarot634@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch addresses three possible problems:
1. ct gc may race to undo the timeout adjustment of the packet path, leaving
the conntrack entry in place with the internal offload timeout (one day).
2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE
timeout is reached before the flow offload del.
3. tcp ct is always set to ESTABLISHED with a very long timeout
in flow offload teardown/delete even though the state might be already
CLOSED. Also as a remark we cannot assume that the FIN or RST packet
is hitting flow table teardown as the packet might get bumped to the
slow path in nftables.
This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so
conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN
state transition.
Moreover, teturn the connection's ownership to conntrack upon teardown
by clearing the offload flag and fixing the established timeout value.
The flow table GC thread will asynchonrnously free the flow table and
hardware offload entries.
Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on
which is also misleading since the flow is back to classic conntrack
path.
If nf_ct_delete() removes the entry from the conntrack table, then it
calls nf_ct_put() which decrements the refcnt. This is not a problem
because the flowtable holds a reference to the conntrack object from
flow_offload_alloc() path which is released via flow_offload_free().
This patch also updates nft_flow_offload to skip packets in SYN_RECV
state. Since we might miss or bump packets to slow path, we do not know
what will happen there while we are still in SYN_RECV, this patch
postpones offload up to the next packet which also aligns to the
existing behaviour in tc-ct.
flow_offload_teardown() does not reset the existing tcp state from
flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow
path might have already update the state to CLOSE/FIN.
Joint work with Oz and Sven.
Fixes: 1e5b2471bc ("netfilter: nf_flow_table: teardown flow timeout race")
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If CONFIG_ARM64_SVE is not set:
arch/arm64/kernel/fpsimd.c:294:13: warning: ‘sve_free’ defined but not used [-Wunused-function]
Fix this by moving sve_free() and __sve_free() into the existing section
protected by "#ifdef CONFIG_ARM64_SVE", now the last user outside that
section has been removed.
Fixes: a1259dd807 ("arm64/sve: Delay freeing memory in fpsimd_flush_thread()")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/cd633284683c24cb9469f8ff429915aedf67f868.1652798894.git.geert+renesas@glider.be
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The convention for indentation seems to be a single tab. Help text is
further indented by an additional two whitespaces. Fix the lines that
violate these rules.
While add it, add trailing comments to endif and endmenu statements for
better readability.
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Link: https://lore.kernel.org/r/20220517141648.331976-2-juergh@canonical.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The AST2600 when using the i210 NIC over NC-SI has been observed to
produce incorrect checksum results with specific MTU values. This was
first observed when sending data across a long distance set of networks.
On a local network, the following test was performed using a 1MB file of
random data.
On the receiver run this script:
#!/bin/bash
while [ 1 ]; do
# Zero the stats
nstat -r > /dev/null
nc -l 9899 > test-file
# Check for checksum errors
TcpInCsumErrors=$(nstat | grep TcpInCsumErrors)
if [ -z "$TcpInCsumErrors" ]; then
echo No TcpInCsumErrors
else
echo TcpInCsumErrors = $TcpInCsumErrors
fi
done
On an AST2600 system:
# nc <IP of receiver host> 9899 < test-file
The test was repeated with various MTU values:
# ip link set mtu 1410 dev eth0
The observed results:
1500 - good
1434 - bad
1400 - good
1410 - bad
1420 - good
The test was repeated after disabling tx checksumming:
# ethtool -K eth0 tx-checksumming off
And all MTU values tested resulted in transfers without error.
An issue with the driver cannot be ruled out, however there has been no
bug discovered so far.
David has done the work to take the original bug report of slow data
transfer between long distance connections and triaged it down to this
test case.
The vendor suspects this this is a hardware issue when using NC-SI. The
fixes line refers to the patch that introduced AST2600 support.
Reported-by: David Wilder <wilder@us.ibm.com>
Reviewed-by: Dylan Hung <dylan_hung@aspeedtech.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
igb_read_phy_reg() will silently return, leaving phy_data untouched, if
hw->ops.read_reg isn't set. Depending on the uninitialized value of
phy_data, this led to the phy status check either succeeding immediately
or looping continuously for 2 seconds before emitting a noisy err-level
timeout. This message went out to the console even though there was no
actual problem.
Instead, first check if there is read_reg function pointer. If not,
proceed without trying to check the phy status register.
Fixes: b72f3f7200 ("igb: When GbE link up, wait for Remote receiver status condition")
Signed-off-by: Kevin Mitchell <kevmitch@arista.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing the pn533 device (i2c or USB), there is a logic error. The
original code first cancels the worker (flush_delayed_work) and then
destroys the workqueue (destroy_workqueue), leaving the timer the last
one to be deleted (del_timer). This result in a possible race condition
in a multi-core preempt-able kernel. That is, if the cleanup
(pn53x_common_clean) is concurrently run with the timer handler
(pn533_listen_mode_timer), the timer can queue the poll_work to the
already destroyed workqueue, causing use-after-free.
This patch reorder the cleanup: it uses the del_timer_sync to make sure
the handler is finished before the routine will destroy the workqueue.
Note that the timer cannot be activated by the worker again.
static void pn533_wq_poll(struct work_struct *work)
...
rc = pn533_send_poll_frame(dev);
if (rc)
return;
if (cur_mod->len == 0 && dev->poll_mod_count > 1)
mod_timer(&dev->listen_timer, ...);
That is, the mod_timer can be called only when pn533_send_poll_frame()
returns no error, which is impossible because the device is detaching
and the lower driver should return ENODEV code.
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
- tighten the PCI presence check (Stefan Roese):
- fix a potential NULL pointer dereference in an error path
(Kyle Miller Smith)
- fix interpretation of the DMRSL field (Tom Yan)
- relax the data transfer alignment (Keith Busch)
- verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni)
- misc cleanups (Chaitanya Kulkarni, me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmKEle4LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMUfw/8CyIcKdN5uUtMAekDhFEnskivxthtWj1PcRXSyA3M
fkfc+Jc5lOALTZ2d4zVHoDEB+8IoojfSwNuCtftCVPTv0dEnPlZ8pQ8qKe9Mj245
I/TswqlV2euiuwkP8RorLcUjPJWRGAOinCgyfgdCTBjjQIO40OsgPFk/COYpVBvJ
aWa5PXPk7JT1b+6VR27tr2lmGnqppS8Uz03fMouqvcSCGPjN0IKEGuYrpvjvJVs0
YOdIEPc/GnMUmlg3W6H1YkC8bCiCS2/YtPqXHBLlmQqDswgqgqWT8ycrV1PK3Z7K
RWwumuXiX0bgSiv6DWz3j8lVgOp79B59nAFyvnRFAWaziIE1ByCp/iVQQrZsxJcs
MjeJrUmtu+jlKN80ZBZJHDBCnT0XePrF2u7YL9j4jJC6ORlQsjKOI3BYt9dIWiYQ
PAj58bmsCu7I0w89aqKd3YB3hBRLRT70v0XIK/nbdbAQTFYQ73yi42Kds7oHw2xO
Vg7115yh1K3fiawOI7l2DyAQ4yQ0n42NOt2HkMfmRRQyWMeLLuv3GK/WAs5qpEuB
dRijsEvLFUU4uPRSR6W6upgA8Oqgx+71wHVRpYxg7k3TDn+HWOQ3YRiRtZJ0xR7W
HCtHTrDTn5AuMBrKMtTpG0VqTo5CIOS8jFAz3yvbL/6EBBWmAAj6+Jw7QD12dtLc
t50=
=w9PF
-----END PGP SIGNATURE-----
Merge tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme into for-5.19/drivers
Pull NVMe updates from Christoph:
"nvme updates for Linux 5.19
- tighten the PCI presence check (Stefan Roese):
- fix a potential NULL pointer dereference in an error path
(Kyle Miller Smith)
- fix interpretation of the DMRSL field (Tom Yan)
- relax the data transfer alignment (Keith Busch)
- verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni)
- misc cleanups (Chaitanya Kulkarni, me)"
* tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme:
nvme: split the enum used for various register constants
nvme-fabrics: add a request timeout helper
nvme-pci: harden drive presence detect in nvme_dev_disable()
nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags
nvme: mark internal passthru request RQF_QUIET
nvme: remove unneeded include from constants file
nvme: add missing status values to verbose logging
nvme: set dma alignment to dword
nvme: fix interpretation of DMRSL
POLL* are unannotated values for the userspace ABI, while everything
in-kernel should use EPOLL* and the __poll_t type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_file_get_normal isn't marked inline, so don't claim it as such in the
forward declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ERR_PTR abuses the high bits of a pointer to transport error information.
This is only safe for kernel pointers and not user pointers. Fix
io_buffer_select and its helpers to just return NULL for failure and get
rid of this abuse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provided buffers allow an application to supply io_uring with buffers
that can then be grabbed for a read/receive request, when the data
source is ready to deliver data. The existing scheme relies on using
IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
in real world applications. It's pretty efficient if the application
is able to supply back batches of provided buffers when they have been
consumed and the application is ready to recycle them, but if
fragmentation occurs in the buffer space, it can become difficult to
supply enough buffers at the time. This hurts efficiency.
Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
to setup a shared queue for each buffer group of provided buffers. The
application can then supply buffers simply by adding them to this ring,
and the kernel can consume then just as easily. The ring shares the head
with the application, the tail remains private in the kernel.
Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
ring, they must use the mapped ring. Mapped provided buffer rings can
co-exist with normal provided buffers, just not within the same group ID.
To gauge overhead of the existing scheme and evaluate the mapped ring
approach, a simple NOP benchmark was written. It uses a ring of 128
entries, and submits/completes 32 at the time. 'Replenish' is how
many buffers are provided back at the time after they have been
consumed:
Test Replenish NOPs/sec
================================================================
No provided buffers NA ~30M
Provided buffers 32 ~16M
Provided buffers 1 ~10M
Ring buffers 32 ~27M
Ring buffers 1 ~27M
The ring mapped buffers perform almost as well as not using provided
buffers at all, and they don't care if you provided 1 or more back at
the same time. This means application can just replenish as they go,
rather than need to batch and compact, further reducing overhead in the
application. The NOP benchmark above doesn't need to do any compaction,
so that overhead isn't even reflected in the above test.
Co-developed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Abstract this out from io_sqe_buffer_register() so we can use it
elsewhere too without duplicating this code.
No intended functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Obviously not really useful since it's not transferring data, but it
is helpful in benchmarking overhead of provided buffers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_provided_buffer_select() must drop the submit lock, if needed, even
in the error handling case. Failure to do so will leave us with the
ctx->uring_lock held, causing spew like:
====================================
WARNING: iou-wrk-366/368 still has locks held!
5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted
------------------------------------
1 lock held by iou-wrk-366/368:
#0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48
stack backtrace:
CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace.part.0+0xa4/0xd4
show_stack+0x14/0x5c
dump_stack_lvl+0x88/0xb0
dump_stack+0x14/0x2c
debug_check_no_locks_held+0x84/0x90
try_to_freeze.isra.0+0x18/0x44
get_signal+0x94/0x6ec
io_wqe_worker+0x1d8/0x2b4
ret_from_fork+0x10/0x20
and triggering later hangs off get_signal() because we attempt to
re-grab the lock.
Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com
Fixes: 149c69b04a ("io_uring: abstract out provided buffer list selection")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mat Martineau says:
====================
mptcp: Fix checksum byte order on little-endian
These patches address a bug in the byte ordering of MPTCP checksums on
little-endian architectures. The __sum16 type is always big endian, but
was being cast to u16 and then byte-swapped (on little-endian archs)
when reading/writing the checksum field in MPTCP option headers.
MPTCP checksums are off by default, but are enabled if one or both peers
request it in the SYN/SYNACK handshake.
The corrected code is verified to interoperate between big-endian and
little-endian machines.
Patch 1 fixes the checksum byte order, patch 2 partially mitigates
interoperation with peers sending bad checksums by falling back to TCP
instead of resetting the connection.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 8684 section 3.7 describes several opportunities for a MPTCP
connection to "fall back" to regular TCP early in the connection
process, before it has been confirmed that MPTCP options can be
successfully propagated on all SYN, SYN/ACK, and data packets. If a peer
acknowledges the first received data packet with a regular TCP header
(no MPTCP options), fallback is allowed.
If the recipient of that first data packet finds a MPTCP DSS checksum
error, this provides an opportunity to fail gracefully with a TCP
fallback rather than resetting the connection (as might happen if a
checksum failure were detected later).
This commit modifies the checksum failure code to attempt fallback on
the initial subflow of a MPTCP connection, only if it's a failure in the
first data mapping. In cases where the peer initiates the connection,
requests checksums, is the first to send data, and the peer is sending
incorrect checksums (see
https://github.com/multipath-tcp/mptcp_net-next/issues/275), this allows
the connection to proceed as TCP rather than reset.
Fixes: dd8bcd1768 ("mptcp: validate the data checksum")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MPTCP code typecasts the checksum value to u16 and
then converts it to big endian while storing the value into
the MPTCP option.
As a result, the wire encoding for little endian host is
wrong, and that causes interoperabilty interoperability
issues with other implementation or host with different endianness.
Address the issue writing in the packet the unmodified __sum16 value.
MPTCP checksum is disabled by default, interoperating with systems
with bad mptcp-level csum encoding should cause fallback to TCP.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/275
Fixes: c5b39e26d0 ("mptcp: send out checksum for DSS")
Fixes: 390b95a5fb ("mptcp: receive checksum for DSS")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-05-17
This series contains updates to ice driver only.
Arkadiusz prevents writing of timestamps when rings are being
configured to resolve null pointer dereference.
Paul changes a delayed call to baseline statistics to occur immediately
which was causing misreporting of statistics due to the delay.
Michal fixes incorrect restoration of interrupt moderation settings.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2022-05-18
1) Fix "disable_policy" flag use when arriving from different devices.
From Eyal Birger.
2) Fix error handling of pfkey_broadcast in function pfkey_process.
From Jiasheng Jiang.
3) Check the encryption module availability consistency in pfkey.
From Thomas Bartschies.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of old 31 bit leftovers within ipl code:
- convert everything to pc relative code
- use 64 bit addressing mode as early as possible
- use 64 bit arithmetics wherever possible
This way the code doesn't look as odd as before anymore.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In Thumb2, 'b . + 4' produces a branch instruction that uses a narrow
encoding, and so it does not jump to the following instruction as
expected. So use W(b) instead.
Fixes: 6c7cb60bff ("ARM: fix Thumb2 regression with Spectre BHB")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
The Spectre-BHB mitigations were inadvertently left disabled for
Cortex-A15, due to the fact that cpu_v7_bugs_init() is not called in
that case. So fix that.
Fixes: b9baf5c8c5 ("ARM: Spectre-BHB workaround")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmKEjE0ACgkQSD+KveBX
+j4alQgAuKCGEmc7blX5Xwg2qDaRc7/WlLLXyrg5IBI3Pmqi/GrKqGfT7iNnX6xD
ADR3k64mSoA941Il18xyPOztdRzp9N0JQ2r1sTcKLx+4DrMSpnQERnDb8pHjLsqw
7l2BvkH/PjEAdwByi4rpt3X3BBahkPOW33QhEj3yIOvBY6EQnAtGDqVY4ql4vG7+
4dgcycNN7MA7ylxgVlwn66EumHGZTKnuugYwmK4eFDxgyaQYBbHfe/yOj4N8rdUF
pV0Z0sAdIvg+UXP0HKFVv08dINXQLFU5aCYVK/09SHrInP+ctW3uD+3ykFSxhQGZ
JOhyJJ6dZk0vmXGikeve36WMOCcRVQ==
=rT0G
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2022-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-05-17
This series provides bug fixes to mlx5 driver.
Please pull and let me know if there is any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the recent introduction supporting the SM3 and SM4 hash algos for IPsec, the kernel
produces invalid pfkey acquire messages, when these encryption modules are disabled. This
happens because the availability of the algos wasn't checked in all necessary functions.
This patch adds these checks.
Signed-off-by: Thomas Bartschies <thomas.bartschies@cvk.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If skb_clone() returns null pointer, pfkey_broadcast() will
return error.
Therefore, it should be better to check the return value of
pfkey_broadcast() and return error if fails.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
That way percpu_ref_exit() is safe after failing percpu_ref_init().
At least one user (cgroup_create()) had a double-free that way;
there might be other similar bugs. Easier to fix in percpu_ref_init(),
rather than playing whack-a-mole in sloppy users...
Usual symptoms look like a messed refcounting in one of subsystems
that use percpu allocations (might be percpu-refcount, might be
something else). Having refcounts for two different objects share
memory is Not Nice(tm)...
Reported-by: syzbot+5b1e53987f858500ec00@syzkaller.appspotmail.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case fw sync reset is called in parallel to device removal, device
might stuck in the following deadlock:
CPU 0 CPU 1
----- -----
remove_one
uninit_one (locks intf_state_mutex)
mlx5_sync_reset_now_event()
work in fw_reset->wq.
mlx5_enter_error_state()
mutex_lock (intf_state_mutex)
cleanup_once
fw_reset_cleanup()
destroy_workqueue(fw_reset->wq)
Drain the fw_reset WQ, and make sure no new work is being queued, before
entering uninit_one().
The Drain is done before devlink_unregister() since fw_reset, in some
flows, is using devlink API devlink_remote_reload_actions_performed().
Fixes: 38b9f903f2 ("net/mlx5: Handle sync reset request event")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Cited patch sets flow_source to ANY overriding the provided spec
flow_source, avoiding the optimization done by commit c9c079b4de
("net/mlx5: CT: Set flow source hint from provided tuple device").
To fix the above, set the dr_rule flow_source from provided flow spec.
Fixes: 3ee61ebb0d ("net/mlx5: CT: Add software steering ct flow steering provider")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
cited commit removed support for GRE tuples when software steering was enabled.
To bring back support for GRE tuples, add GRE ipv4/ipv6 matchers.
Fixes: 3ee61ebb0d ("net/mlx5: CT: Add software steering ct flow steering provider")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
We got reports of certain HW-GRO flows causing kernel call traces, which
might be related to firmware. To be on the safe side, disable the
feature for now and re-enable it once a driver/firmware fix is found.
Fixes: 83439f3c37 ("net/mlx5e: Add HW-GRO offload")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
HW GRO is incompatible and mutually exclusive with XDP and XSK. However,
the needed checks are only made when enabling XDP. If HW GRO is enabled
when XDP is already active, the command will succeed, and XDP will be
skipped in the data path, although still enabled.
This commit fixes the bug by checking the XDP and XSK status in
mlx5e_fix_features and disabling HW GRO if XDP is enabled.
Fixes: 83439f3c37 ("net/mlx5e: Add HW-GRO offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
LRO is incompatible and mutually exclusive with XDP. However, the needed
checks are only made when enabling XDP. If LRO is enabled when XDP is
already active, the command will succeed, and XDP will be skipped in the
data path, although still enabled.
This commit fixes the bug by checking the XDP status in
mlx5e_fix_features and disabling LRO if XDP is enabled.
Fixes: 86994156c7 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When the driver is in switchdev mode and rx-gro-hw is set, the RQ needs
special CQE handling. Till then, block setting of rx-gro-hw feature in
switchdev mode, to avoid failure while setting the feature due to
failure while opening the RQ.
Fixes: f97d5c2a45 ("net/mlx5e: Add handle SHAMPO cqe support")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The body of mlx5e_napi_poll is wrapped into rcu_read_lock to be able to
read the XDP program pointer using rcu_dereference. However, the trap RQ
NAPI doesn't use rcu_read_lock, because the trap RQ works only in the
non-linear mode, and mlx5e_skb_from_cqe_nonlinear, until recently,
didn't support XDP and didn't call rcu_dereference.
Starting from the cited commit, mlx5e_skb_from_cqe_nonlinear supports
XDP and calls rcu_dereference, but mlx5e_trap_napi_poll doesn't wrap it
into rcu_read_lock. It leads to RCU-lockdep warnings like this:
WARNING: suspicious RCU usage
This commit fixes the issue by adding an rcu_read_lock to
mlx5e_trap_napi_poll, similarly to mlx5e_napi_poll.
Fixes: ea5d49bdae ("net/mlx5e: Add XDP multi buffer support to the non-linear legacy RQ")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When modifying TTL, packet's csum has to be recalculated.
Due to HW issue in ConnectX-5, csum recalculation for modify
TTL on RX is supported through a work-around that is specifically
enabled by configuration.
If the work-around isn't enabled, rather than adding an unsupported
action the modify TTL action on RX should be ignored.
Ignoring modify TTL action might result in zero actions, so in such
cases we will not convert the match STE to modify STE, as it is done
by FW in DMFS.
This patch fixes an issue where modify TTL action was ignored both
on RX and TX instead of only on RX.
Fixes: 4ff725e1d4 ("net/mlx5: DR, Ignore modify TTL if device doesn't support it")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to support multiple destination FTEs with SW steering
FW table is created with single FTE with multiple actions and
SW steering rule forward to it. When creating this table, flow
source isn't set according to the original FTE.
Fix this by passing the original FTE flow source to the created
FW table.
Fixes: 34583beea4 ("net/mlx5: DR, Create multi-destination table for SW-steering use")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In commit d72d827f2f, I used 'cpumask_t' incorrectly:
void iscsit_thread_get_cpumask(struct iscsi_conn *conn)
{
int ord, cpu;
cpumask_t conn_allowed_cpumask;
......
}
static ssize_t lio_target_wwn_cpus_allowed_list_store(
struct config_item *item, const char *page, size_t count)
{
int ret;
char *orig;
cpumask_t new_allowed_cpumask;
......
}
The correct pattern should be as follows:
cpumask_var_t mask;
if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
return -ENOMEM;
... use 'mask' here ...
free_cpumask_var(mask);
Link: https://lore.kernel.org/r/20220516054721.1548-1-mingzhe.zou@easystack.cn
Fixes: d72d827f2f ("scsi: target: Add iscsi/cpus_allowed_list in configfs")
Reported-by: Test Bot <zgrieee@gmail.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>