In commit f84af32cbc ("net: ip_queue_rcv_skb() helper")
I dropped the skb dst in tcp_data_queue().
This only dealt with so-called TCP input slow path.
When fast path is taken, tcp_rcv_established() calls
tcp_queue_rcv() while skb still has a dst.
This was mostly fine, because most dsts at this point
are not refcounted (thanks to early demux)
However, TCP packets sent over loopback have refcounted dst.
Then commit 68822bdf76 ("net: generalize skb freeing
deferral to per-cpu lists") came and had the effect
of delaying skb freeing for an arbitrary time.
If during this time the involved netns is dismantled, cleanup_net()
frees the struct net with embedded net->ipv6.ip6_dst_ops.
Then when eventually dst_destroy_rcu() is called,
if (dst->ops->destroy) ... triggers an use-after-free.
It is not clear if ip6_route_net_exit() lacks a rcu_barrier()
as syzbot reported similar issues before the blamed commit.
( https://groups.google.com/g/syzkaller-bugs/c/CofzW4eeA9A/m/009WjumTAAAJ )
Fixes: 68822bdf76 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Walle says:
====================
net: lan966x: remove PHY reset support
Remove the unneeded PHY reset node as well as the driver support for it.
This was already discussed [1] and I expect Microchip to Ack on this
removal. Since there is no user, no breakage is expected.
I'm not sure it this should go through net or net-next and if the patches
should have a Fixes: tag or not. In upstream linux there was never any user
of it, so there is no bug to be fixed. But OTOH if the schema fix isn't
backported, then there might be an older schema version still containing
the reset node. Thoughts?
The patches needed for the GPIO part are just waiting to be picked up by
Linus [2,3]. This patch and the GPIO parts are the last pieces of the
puzzle to get ethernet working on the LAN9668 on upstream linux.
[1] https://lore.kernel.org/netdev/20220330110210.3374165-1-michael@walle.cc/
[2] https://lore.kernel.org/linux-gpio/CACRpkdbxmN+SWt95aGHjA2ZGnN61aWaA7c5S4PaG+WePAj=htg@mail.gmail.com/
[3] https://lore.kernel.org/linux-gpio/20220420191926.3411830-1-michael@walle.cc/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The PHY subsystem as well as the MIIM mdio driver (in case of the
integrated PHYs) will take care of the resets. A separate reset driver
isn't needed. There is no in-tree user of this feature. Remove the
support.
Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PHY reset was intended to be a phandle for a special PHY reset
driver for the integrated PHYs as well as any external PHYs. It turns
out, that the culprit is how the reset of the switch device is done.
In particular, the switch reset also affects other subsystems like
the GPIO and the SGPIO block and it happens to be the case that the
reset lines of the external PHYs are connected to a common GPIO line.
Thus as soon as the switch issues a reset during probe time, all the
external PHYs will go into reset because all the GPIO lines will
switch to input and the pull-down on that signal will take effect.
So even if there was a special PHY reset driver, it (1) won't fix
the root cause of the problem and (2) it won't fix all the other
consumers of GPIO lines which will also be reset.
It turns out, the Ocelot SoC has the same weird behavior (or the
lack of a dedicated switch reset) and there the problem is already
solved and all the bits and pieces are already there and this PHY
reset property isn't not needed at all.
There are no users of this binding. Just remove it.
Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Begunkov says:
====================
generic net and ipv6 minor optimisations
1-3 inline simple functions that only reshuffle arguments possibly adding
extra zero args, and call another function. It was benchmarked before with
a bunch of extra patches, see for details
https://lore.kernel.org/netdev/cover.1648981570.git.asml.silence@gmail.com/
It may increase the binary size, but it's the right thing to do and at least
without modules it actually sheds some bytes for some standard-ish config.
text data bss dec hex filename
9627200 0 0 9627200 92e640 ./arch/x86_64/boot/bzImage
text data bss dec hex filename
9627104 0 0 9627104 92e5e0 ./arch/x86_64/boot/bzImage
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Throw neigh checks in ip6_finish_output2() under a single slow path if,
so we don't have the overhead in the hot path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two callers of __ip6_finish_output(), both are in
ip6_finish_output(). We can combine the call sites into one and handle
return code after, that will inline __ip6_finish_output().
Note, error handling under NET_XMIT_CN will only return 0 if
__ip6_finish_output() succeded, and in this case it return 0.
Considering that NET_XMIT_SUCCESS is 0, it'll be returning exactly the
same result for it as before.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small
proxy functions doing nothing but redirecting the control flow to
__dev_queue_xmit().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_zerocopy_iter_dgram() is a small proxy function, inline it. For
that, move __zerocopy_sg_from_iter into linux/skbuff.h
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_alloc_send_skb() is simple and just proxying to another function,
so we can inline it and cut associated overhead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For now just use a CQE flag for this, with big CQE support we could
return the actual number of bytes left.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge net branch with the required patch for supporting the io_uring
feature that passes back whether we had more data in the socket or not.
* 'tcp-pass-back-data-left-in-socket-after-receive' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux:
tcp: pass back data left in socket after receive
* for-5.19/io_uring-socket: (73 commits)
io_uring: use the text representation of ops in trace
io_uring: rename op -> opcode
io_uring: add io_uring_get_opcode
io_uring: add type to op enum
io_uring: add socket(2) support
net: add __sys_socket_file()
io_uring: fix trace for reduced sqe padding
io_uring: add fgetxattr and getxattr support
io_uring: add fsetxattr and setxattr support
fs: split off do_getxattr from getxattr
fs: split off setxattr_copy and do_setxattr function from setxattr
io_uring: return an error when cqe is dropped
io_uring: use constants for cq_overflow bitfield
io_uring: rework io_uring_enter to simplify return value
io_uring: trace cqe overflows
io_uring: add trace support for CQE overflow
io_uring: allow re-poll if we made progress
io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG)
io_uring: add support for IORING_ASYNC_CANCEL_ANY
io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key
...
This is currently done for CMSG_INQ, add an ability to do so via struct
msghdr as well and have CMSG_INQ use that too. If the caller sets
msghdr->msg_get_inq, then we'll pass back the hint in msghdr->msg_inq.
Rearrange struct msghdr a bit so we can add this member while shrinking
it at the same time. On a 64-bit build, it was 96 bytes before this
change and 88 bytes afterwards.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/650c22ca-cffc-0255-9a05-2413a1e20826@kernel.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hangbin Liu says:
====================
selftests: net: add missing tests to Makefile
When generating the selftests to another folder, the fixed tests are
missing as they are not in Makefile. The missing tests are generated
by command:
$ for f in $(ls *.sh); do grep -q $f Makefile || echo $f; done
====================
Link: https://lore.kernel.org/r/20220428044511.227416-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When generating the selftests to another folder, the fixed tests are
missing as they are not in Makefile, e.g.
make -C tools/testing/selftests/ install \
TARGETS="net/forwarding" INSTALL_PATH=/tmp/kselftests
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When generating the selftests to another folder, the fixed tests are
missing as they are not in Makefile, e.g.
make -C tools/testing/selftests/ install \
TARGETS="net" INSTALL_PATH=/tmp/kselftests
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 7073ea8799.
We must not try to connect the socket while the transport is under
construction, because the mechanisms to safely tear it down are not in
place. As the code stands, we end up leaking the sockets on a connection
error.
Reported-by: wanghai (M) <wanghai38@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Mat Martineau says:
====================
mptcp: Path manager mode selection
MPTCP already has an in-kernel path manager (PM) to add and remove TCP
subflows associated with a given MPTCP connection. This in-kernel PM has
been designed to handle typical server-side use cases, but is not very
flexible or configurable for client devices that may have more
complicated policies to implement.
This patch series from the MPTCP tree is the first step toward adding a
generic-netlink-based API for MPTCP path management, which a privileged
userspace daemon will be able to use to control subflow
establishment. These patches add a per-namespace sysctl to select the
default PM type (in-kernel or userspace) for new MPTCP sockets. New
self-tests confirm expected behavior when userspace PM is selected but
there is no daemon available to handle existing MPTCP PM events.
Subsequent patch series (already staged in the MPTCP tree) will add the
generic netlink path management API.
====================
Link: https://lore.kernel.org/r/20220427225002.231996-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The new net.mptcp.pm_type sysctl determines which path manager will be
used by each newly-created MPTCP socket.
v2: Handle builds without CONFIG_SYSCTL
v3: Clarify logic for type-specific PM init (Geliang Tang and Paolo Abeni)
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Userspace-managed sockets should not have their subflows or
advertisements changed by the kernel path manager.
v3: Use helper function for PM mode (Paolo Abeni)
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a MPTCP connection is managed by a userspace PM, bypass the kernel
PM for incoming advertisements and subflow events. Netlink events are
still sent to userspace.
v2: Remove unneeded check in mptcp_pm_rm_addr_received() (Kishen Maloor)
v3: Add and use helper function for PM mode (Paolo Abeni)
Acked-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Kishen Maloor <kishen.maloor@intel.com>
Signed-off-by: Kishen Maloor <kishen.maloor@intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When adding support for netlink path management commands, the kernel
needs to know whether paths are being controlled by the in-kernel path
manager or a userspace PM.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A few members of the mptcp_pm_data struct were assigned to hard-coded
values in mptcp_pm_data_reset(), and then immediately changed in
mptcp_pm_nl_data_init().
Instead, flatten all the assignments in to mptcp_pm_data_reset().
v2: Resolve conflicts due to rename of mptcp_pm_data_reset()
v4: Resolve conflict in mptcp_pm_data_reset()
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The LAN8814 has a coma mode pin which puts the PHY into isolate and
power-dowm mode. Unfortunately, the mode cannot be disabled by a
register. Usually, the input pin has a pull-up and connected to a GPIO
which can then be used to disable the mode. Try to get the GPIO and
deassert it.
Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both lan8814_ptp_init() and lan8814_ptp_probe_once() are only used if
PTP and PHY timestamping is enabed. Up until now the probe function just
returns early, if they are not needed. But we need the
phy_package_init_once() functionality for the coma mode GPIO setup. Move
the check into the functions itself.
Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The LAN8814 has a coma mode pin which is used to put the PHY into
isolate and power-down mode. Usually strapped to be asserted by default.
A GPIO is then used to take the PHY out of this mode.
Signed-off-by: Michael Walle <michael@walle.cc>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull ARM SoC fixes from Arnd Bergmann:
- A fix for a regression caused by the previous set of bugfixes
changing tegra and at91 pinctrl properties.
More work is needed to figure out what this should actually be, but a
revert makes it work for the moment.
- Defconfig regression fixes for tegra after renamed symbols
- Build-time warning and static checker fixes for imx, op-tee, sunxi,
meson, at91, and omap
- More at91 DT fixes for audio, regulator and spi nodes
- A regression fix for Renesas Hyperflash memory probe
- A stability fix for amlogic boards, modifying the allowed cpufreq
states
- Multiple fixes for system suspend on omap2+
- DT fixes for various i.MX bugs
- A probe error fix for imx6ull-colibri MMC
- A MAINTAINERS file entry for samsung bug reports
* tag 'soc-fixes-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (42 commits)
Revert "arm: dts: at91: Fix boolean properties with values"
bus: sunxi-rsb: Fix the return value of sunxi_rsb_device_create()
Revert "arm64: dts: tegra: Fix boolean properties with values"
arm64: dts: imx8mn-ddr4-evk: Describe the 32.768 kHz PMIC clock
ARM: dts: imx6ull-colibri: fix vqmmc regulator
MAINTAINERS: add Bug entry for Samsung and memory controller drivers
memory: renesas-rpc-if: Fix HF/OSPI data transfer in Manual Mode
ARM: dts: logicpd-som-lv: Fix wrong pinmuxing on OMAP35
ARM: dts: am3517-evm: Fix misc pinmuxing
ARM: dts: am33xx-l4: Add missing touchscreen clock properties
ARM: dts: Fix mmc order for omap3-gta04
ARM: dts: at91: fix pinctrl phandles
ARM: dts: at91: sama5d4_xplained: fix pinctrl phandle name
ARM: dts: at91: Describe regulators on at91sam9g20ek
ARM: dts: at91: Map MCLK for wm8731 on at91sam9g20ek
ARM: dts: at91: Fix boolean properties with values
ARM: dts: at91: use generic node name for dataflash
ARM: dts: at91: align SPI NOR node name with dtschema
ARM: dts: at91: sama7g5ek: Align the impedance of the QSPI0's HSIO and PCB lines
ARM: dts: at91: sama7g5ek: enable pull-up on flexcom3 console lines
...
Pull clk fixes from Stephen Boyd:
"A semi-large pile of clk driver fixes this time around.
Nothing is touching the core so these fixes are fairly well contained
to specific devices that use these clk drivers.
- Some Allwinner SoC fixes to gracefully handle errors and mark an
RTC clk as critical so that the RTC keeps ticking.
- Fix AXI bus clks and RTC clk design for Microchip PolarFire SoC
driver introduced this cycle. This has some devicetree bits acked
by riscv maintainers. We're fixing it now so that the prior
bindings aren't released in a major kernel version.
- Remove a reset on Microchip PolarFire SoCs that broke when enabling
CONFIG_PM.
- Set a min/max for the Qualcomm graphics clk. This got broken by the
clk rate range patches introduced this cycle"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: sunxi: sun9i-mmc: check return value after calling platform_get_resource()
clk: sunxi-ng: sun6i-rtc: Mark rtc-32k as critical
riscv: dts: microchip: reparent mpfs clocks
clk: microchip: mpfs: add RTCREF clock control
clk: microchip: mpfs: re-parent the configurable clocks
dt-bindings: rtc: add refclk to mpfs-rtc
dt-bindings: clk: mpfs: add defines for two new clocks
dt-bindings: clk: mpfs document msspll dri registers
riscv: dts: microchip: fix usage of fic clocks on mpfs
clk: microchip: mpfs: mark CLK_ATHENA as critical
clk: microchip: mpfs: fix parents for FIC clocks
clk: qcom: clk-rcg2: fix gfx3d frequency calculation
clk: microchip: mpfs: don't reset disabled peripherals
clk: sunxi-ng: fix not NULL terminated coccicheck error
Pull block fixes from Jens Axboe:
- Revert of a patch that caused timestamp issues (Tejun)
- iocost warning fix (Tejun)
- bfq warning fix (Jan)
* tag 'block-5.18-2022-04-29' of git://git.kernel.dk/linux-block:
bfq: Fix warning in bfqq_request_over_limit()
Revert "block: inherit request start time from bio for BLK_CGROUP"
iocost: don't reset the inuse weight of under-weighted debtors
Pull io_uring fixes from Jens Axboe:
"Pretty boring:
- three patches just adding reserved field checks (me, Eugene)
- Fixing a potential regression with IOPOLL caused by a block change
(Joseph)"
Boring is good.
* tag 'io_uring-5.18-2022-04-29' of git://git.kernel.dk/linux-block:
io_uring: check that data field is 0 in ringfd unregister
io_uring: fix uninitialized field in rw io_kiocb
io_uring: check reserved fields for recv/recvmsg
io_uring: check reserved fields for send/sendmsg
Pull random number generator fixes from Jason Donenfeld:
- Eric noticed that the memmove() in crng_fast_key_erasure() was bogus,
so this has been changed to a memcpy() and the confusing situation
clarified with a detailed comment.
- [Half]SipHash documentation updates from Bagas and Eric, after Eric
pointed out that the use of HalfSipHash in random.c made a bit of the
text potentially misleading.
* tag 'random-5.18-rc5-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
Documentation: siphash: disambiguate HalfSipHash algorithm from hsiphash functions
Documentation: siphash: enclose HalfSipHash usage example in the literal block
Documentation: siphash: convert danger note to warning for HalfSipHash
random: document crng_fast_key_erasure() destination possibility
sbi->s_firstinodezone is initialized to 2 and sbi->s_firstdatazone is read
from sbd. There's no guarantee that sbi->s_firstdatazone must bigger than
sbi->s_firstinodezone. If sbi->s_firstdatazone less than 2, the
filesystem can still be mounted unexpetly. At this point, sbi->s_ninodes
flip to very large value and this filesystem is broken. We can observe
this by executing 'df' command. When we execute, we will get an error
message:
"sysv_count_free_inodes: unable to read inode table"
Link: https://lkml.kernel.org/r/20220330104215.530223-1-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The task exit struct needs some crucial information to be able to provide
an enhanced version of process and thread accounting. This change
provides:
1. ac_tgid in additon to ac_pid
2. thread group execution walltime in ac_tgetime
3. flag AGROUP in ac_flag to indicate the last task
in a thread group / process
4. device ID and inode of task's /proc/self/exe in
ac_exe_dev and ac_exe_inode
5. tools/accounting/procacct as demonstrator
When a task exits, taskstats are reported to userspace including the
task's pid and ppid, but without the id of the thread group this task is
part of. Without the tgid, the stats of single tasks cannot be correlated
to each other as a thread group (process).
The taskstats documentation suggests that on process exit a data set
consisting of accumulated stats for the whole group is produced. But such
an additional set of stats is only produced for actually multithreaded
processes, not groups that had only one thread, and also those stats only
contain data about delay accounting and not the more basic information
about CPU and memory resource usage. Adding the AGROUP flag to be set
when the last task of a group exited enables determination of process end
also for single-threaded processes.
My applicaton basically does enhanced process accounting with summed
cputime, biggest maxrss, tasks per process. The data is not available
with the traditional BSD process accounting (which is not designed to be
extensible) and the taskstats interface allows more efficient on-the-fly
grouping and summing of the stats, anyway, without intermediate disk
writes.
Furthermore, I do carry statistics on which exact program binary is used
how often with associated resources, getting a picture on how important
which parts of a collection of installed scientific software in different
versions are, and how well they put load on the machine. This is enabled
by providing information on /proc/self/exe for each task. I assume the
two 64-bit fields for device ID and inode are more appropriate than the
possibly large resolved path to keep the data volume down.
Add the tgid to the stats to complete task identification, the flag AGROUP
to mark the last task of a group, the group wallclock time, and
inode-based identification of the associated executable file.
Add tools/accounting/procacct.c as a simplified fork of getdelays.c to
demonstrate process and thread accounting.
[thomas.orgis@uni-hamburg.de: fix version number in comment]
Link: https://lkml.kernel.org/r/20220405003601.7a5f6008@plasteblaster
Link: https://lkml.kernel.org/r/20220331004106.64e5616b@plasteblaster
Signed-off-by: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de>
Reviewed-by: Ismael Luceno <ismael@iodev.co.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>