tcp_minisocks.c hasn't use any macro or function declared in mm.h, module.h,
slab.h, sysctl.h, workqueue.h, static_key.h and inet_common.h. Thus, these
files can be removed from tcp_minisocks.c safely without affecting the
compilation of the net module.
Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_fastopen.c hasn't use any macro or function declared in crypto.h, err.h,
init.h, list.h, rculist.h and inetpeer.h. Thus, these files can be removed
from tcp_fastopen.c safely without affecting the compilation of the net module.
Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
route.c hasn't use any macro or function declared in uaccess.h, types.h,
string.h, sockios.h, times.h, protocol.h, arp.h and l3mdev.h. Thus, these
files can be removed from route.c safely without affecting the compilation
of the net module.
Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code for handling active queue changes is identical
between mq and mqprio, reuse it.
Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It no need barrier when assigning a NULL value to an RCU protected
pointer. So use RCU_INIT_POINTER() instead for more fast.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch fixes below Errors reported by checkpatch
ERROR: do not initialise globals to 0
+int cipso_v4_rbm_optfmt = 0;
Signed-off-by: wangzhitong <wangzhitong@uniontech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The key_domain member in struct net only exists if we define CONFIG_KEYS.
So we should add the define when we used key_domain.
Fixes: 9b24261051 ("keys: Network namespace domain tag")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
This retrieves the address pairs of all subflows currently
active for a given mptcp connection.
It re-uses the same meta-header as for MPTCP_TCPINFO.
A new structure is provided to hold the subflow
address data:
struct mptcp_subflow_addrs {
union {
__kernel_sa_family_t sa_family;
struct sockaddr sa_local;
struct sockaddr_in sin_local;
struct sockaddr_in6 sin6_local;
struct sockaddr_storage ss_local;
};
union {
struct sockaddr sa_remote;
struct sockaddr_in sin_remote;
struct sockaddr_in6 sin6_remote;
struct sockaddr_storage ss_remote;
};
};
Usage of the new getsockopt is very similar to
MPTCP_TCPINFO one.
Userspace allocates a
'struct mptcp_subflow_data', followed by one or
more 'struct mptcp_subflow_addrs', then inits the
mptcp_subflow_data structure as follows:
struct mptcp_subflow_addrs *sf_addr;
struct mptcp_subflow_data *addr;
socklen_t olen = sizeof(*addr) + (8 * sizeof(*sf_addr));
addr = malloc(olen);
addr->size_subflow_data = sizeof(*addr);
addr->num_subflows = 0;
addr->size_kernel = 0;
addr->size_user = sizeof(struct mptcp_subflow_addrs);
sf_addr = (struct mptcp_subflow_addrs *)(addr + 1);
and then retrieves the endpoint addresses via:
ret = getsockopt(fd, SOL_MPTCP, MPTCP_SUBFLOW_ADDRS,
addr, &olen);
If the call succeeds, kernel will have added up to 8
endpoint addresses after the 'mptcp_subflow_data' header.
Userspace needs to re-check 'olen' value to detect how
many bytes have been filled in by the kernel.
Userspace can check addr->num_subflows to discover when
there were more subflows that available data space.
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow users to retrieve TCP_INFO data of all subflows.
Users need to pre-initialize a meta header that has to be
prepended to the data buffer that will be filled with the tcp info data.
The meta header looks like this:
struct mptcp_subflow_data {
__u32 size_subflow_data;/* size of this structure in userspace */
__u32 num_subflows; /* must be 0, set by kernel */
__u32 size_kernel; /* must be 0, set by kernel */
__u32 size_user; /* size of one element in data[] */
} __attribute__((aligned(8)));
size_subflow_data has to be set to 'sizeof(struct mptcp_subflow_data)'.
This allows to extend mptcp_subflow_data structure later on without
breaking backwards compatibility.
If the structure is extended later on, kernel knows where the
userspace-provided meta header ends, even if userspace uses an older
(smaller) version of the structure.
num_subflows must be set to 0. If the getsockopt request succeeds (return
value is 0), it will be updated to contain the number of active subflows
for the given logical connection.
size_kernel must be set to 0. If the getsockopt request is successful,
it will contain the size of the 'struct tcp_info' as known by the kernel.
This is informational only.
size_user must be set to 'sizeof(struct tcp_info)'.
This allows the kernel to only fill in the space reserved/expected by
userspace.
Example:
struct my_tcp_info {
struct mptcp_subflow_data d;
struct tcp_info ti[2];
};
struct my_tcp_info ti;
socklen_t olen;
memset(&ti, 0, sizeof(ti));
ti.d.size_subflow_data = sizeof(struct mptcp_subflow_data);
ti.d.size_user = sizeof(struct tcp_info);
olen = sizeof(ti);
ret = getsockopt(fd, SOL_MPTCP, MPTCP_TCPINFO, &ti, &olen);
if (ret < 0)
die_perror("getsockopt MPTCP_TCPINFO");
mptcp_subflow_data.num_subflows is populated with the number of
subflows that exist on the kernel side for the logical mptcp connection.
This allows userspace to re-try with a larger tcp_info array if the number
of subflows was larger than the available space in the ti[] array.
olen has to be set to the number of bytes that userspace has allocated to
receive the kernel data. It will be updated to contain the real number
bytes that have been copied to by the kernel.
In the above example, if the number if subflows was 1, olen is equal to
'sizeof(struct mptcp_subflow_data) + sizeof(struct tcp_info).
For 2 or more subflows olen is equal to 'sizeof(struct my_tcp_info)'.
If there was more data that could not be copied due to lack of space
in the option buffer, userspace can detect this by checking
mptcp_subflow_data->num_subflows.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Its not compatible with multipath-tcp.org kernel one.
1. The out-of-tree implementation defines a different 'struct mptcp_info',
with embedded __user addresses for additional data such as
endpoint addresses.
2. Mat Martineau points out that embedded __user addresses doesn't work
with BPF_CGROUP_RUN_PROG_GETSOCKOPT() which assumes that copying in
optsize bytes from optval provides all data that got copied to userspace.
This provides mptcp_info data for the given mptcp socket.
Userspace sets optlen to the size of the structure it expects.
The kernel updates it to contain the number of bytes that it copied.
This allows to append more information to the structure later.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will be re-used from getsockopt path.
Since diag can be a module, we can't export the helper from diag, it
needs to be moved to core.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-09-17
We've added 63 non-merge commits during the last 12 day(s) which contain
a total of 65 files changed, 2653 insertions(+), 751 deletions(-).
The main changes are:
1) Streamline internal BPF program sections handling and
bpf_program__set_attach_target() in libbpf, from Andrii.
2) Add support for new btf kind BTF_KIND_TAG, from Yonghong.
3) Introduce bpf_get_branch_snapshot() to capture LBR, from Song.
4) IMUL optimization for x86-64 JIT, from Jie.
5) xsk selftest improvements, from Magnus.
6) Introduce legacy kprobe events support in libbpf, from Rafael.
7) Access hw timestamp through BPF's __sk_buff, from Vadim.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (63 commits)
selftests/bpf: Fix a few compiler warnings
libbpf: Constify all high-level program attach APIs
libbpf: Schedule open_opts.attach_prog_fd deprecation since v0.7
selftests/bpf: Switch fexit_bpf2bpf selftest to set_attach_target() API
libbpf: Allow skipping attach_func_name in bpf_program__set_attach_target()
libbpf: Deprecated bpf_object_open_opts.relaxed_core_relocs
selftests/bpf: Stop using relaxed_core_relocs which has no effect
libbpf: Use pre-setup sec_def in libbpf_find_attach_btf_id()
bpf: Update bpf_get_smp_processor_id() documentation
libbpf: Add sphinx code documentation comments
selftests/bpf: Skip btf_tag test if btf_tag attribute not supported
docs/bpf: Add documentation for BTF_KIND_TAG
selftests/bpf: Add a test with a bpf program with btf_tag attributes
selftests/bpf: Test BTF_KIND_TAG for deduplication
selftests/bpf: Add BTF_KIND_TAG unit tests
selftests/bpf: Change NAME_NTH/IS_NAME_NTH for BTF_KIND_TAG format
selftests/bpf: Test libbpf API function btf__add_tag()
bpftool: Add support for BTF_KIND_TAG
libbpf: Add support for BTF_KIND_TAG
libbpf: Rename btf_{hash,equal}_int to btf_{hash,equal}_int_tag
...
====================
Link: https://lore.kernel.org/r/20210917173738.3397064-1-ast@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Devlink core exported generously the functions calls that were used
by netdevsim tests or not used at all.
Delete such APIs with one exception - devlink_alloc_ns(). That function
should be spared from deleting because it is a special form of devlink_alloc()
needed for the netdevsim.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RFC8998 specification defines the use of the ShangMi algorithm
cipher suites in TLS 1.3, and also supports the GCM/CCM mode using
the SM4 algorithm.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
(and similarly for other ports)
What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.
But deleting that cleanup code, the warnings go away.
=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.
The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.
The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.
In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.
So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.
Fixes: 161ca59d39 ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DSA supports connecting to a phy-handle, and has a fallback to a non-OF
based method of connecting to an internal PHY on the switch's own MDIO
bus, if no phy-handle and no fixed-link nodes were present.
The -ENODEV error code from the first attempt (phylink_of_phy_connect)
is what triggers the second attempt (phylink_connect_phy).
However, when the first attempt returns a different error code than
-ENODEV, this results in an unbalance of calls to phylink_create and
phylink_destroy by the time we exit the function. The phylink instance
has leaked.
There are many other error codes that can be returned by
phylink_of_phy_connect. For example, phylink_validate returns -EINVAL.
So this is a practical issue too.
Fixes: aab9c4067d ("net: dsa: Plug in PHYLINK support")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20210914134331.2303380-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no need in specific devlink_param_*publish(), because same
output can be achieved by using devlink_params_*publish() in correct
places.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
mq / mqprio make the default child qdiscs visible. They only do
so for the qdiscs which are within real_num_tx_queues when the
device is registered. Depending on order of calls in the driver,
or if user space changes config via ethtool -L the number of
qdiscs visible under tc qdisc show will differ from the number
of queues. This is confusing to users and potentially to system
configuration scripts which try to make sure qdiscs have the
right parameters.
Add a new Qdisc_ops callback and make relevant qdiscs TTRT.
Note that this uncovers the "shortcut" created by
commit 1f27cde313 ("net: sched: use pfifo_fast for non real queues")
The default child qdiscs beyond initial real_num_tx are always
pfifo_fast, no matter what the sysfs setting is. Fixing this
gets a little tricky because we'd need to keep a reference
on whatever the default qdisc was at the time of creation.
In practice this is likely an non-issue the qdiscs likely have
to be configured to non-default settings, so whatever user space
is doing such configuration can replace the pfifos... now that
it will see them.
Reported-by: Matthew Massey <matthewmassey@fb.com>
Reviewed-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This drops the code setting bit 9 on egress frames on the
Realtek "type A" (RTL8366RB) frames.
This bit was set on ingress frames for unknown reason,
and was set on egress frames as the format of ingress
and egress frames was believed to be the same. As that
assumption turned out to be false, and since this bit
seems to have zero effect on the behaviour of the switch
let's drop this bit entirely.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210913143156.1264570-1-linus.walleij@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The __alloc_frag_align() is short, and only called by two functions,
so inline page_frag_alloc_align() for reduce the overhead of calls.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
It shouldn't happen, but can happen that readable eeprom size is smaller
than announced. Then we would be stuck in an endless loop here because
after reaching the actual end reads return eeprom.len = 0. I faced this
issue when making a mistake in driver development. Detect this scenario
and return an error.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit d7807a9adf.
As mentioned in https://lkml.org/lkml/2021/9/13/1819
5 years old commit 919483096b ("ipv4: fix memory leaks in ip_cmsg_send() callers")
was a correct fix.
ip_cmsg_send() can loop over multiple cmsghdr()
If IP_RETOPTS has been successful, but following cmsghdr generates an error,
we do not free ipc.ok
If IP_RETOPTS is not successful, we have freed the allocated temporary space,
not the one currently in ipc.opt.
Sure, code could be refactored, but let's not bring back old bugs.
Fixes: d7807a9adf ("Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit
time") may directly retrans a multiple segments TSO/GSO packet without
split, Since this commit, we can no longer assume that a retransmitted
packet is a single segment.
This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
that use the actual segments(pcount) of the retransmitted packet.
Before that commit (10d3be5692), the assumption underlying the
tp->undo_retrans-- seems correct.
Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: zhenggy <zhenggy@chinatelecom.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2021-09-14
The following pull-request contains BPF updates for your *net* tree.
We've added 7 non-merge commits during the last 13 day(s) which contain
a total of 18 files changed, 334 insertions(+), 193 deletions(-).
The main changes are:
1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song.
2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann.
3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui.
4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker.
5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former
is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With SMC-Dv2 users can configure if the static system EID should be used
during CLC handshake, or if only user EIDs are allowed.
Add generic netlink support to enable and disable the system EID, and
to retrieve the system EID and its current enabled state.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The system EID is retrieved using an registered ISM device each time
when needed. This adds some unnecessary complexity at all places where
the system EID is needed, but no ISM device is at hand.
Simplify the code and save the system EID in a static variable in
smc_ism.c.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SMC-Dv2 allows users to define EIDs which allows to create separate
name spaces enabling users to cluster their SMC-Dv2 connections.
Add support for user defined EIDs and extent the generic netlink
interface so users can add, remove and dump EIDs.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d6.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.
Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.
In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.
Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d6, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.
[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/
Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
Remove unneeded line break between pr_debug and arguments.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
only increase fib6_sernum in net namespace after add fib6_info
successfully.
Signed-off-by: zhang kai <zhangkaiheb@126.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tipc_sk_enqueue() we use hardcoded 2 jiffies to extract
socket buffer from generic queue to particular socket.
The 2 jiffies is too short in case there are other high priority
tasks get CPU cycles for multiple jiffies update. As result, no
buffer could be enqueued to particular socket.
To solve this, we switch to use constant timeout 20msecs.
Then, the function will be expired between 2 jiffies (CONFIG_100HZ)
and 20 jiffies (CONFIG_1000HZ).
Fixes: c637c10355 ("tipc: resolve race problem at unicast message reception")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turn udp_tunnel_nic work-queue to an ordered work-queue. This queue
holds the UDP-tunnel configuration commands of the different netdevs.
When the netdevs are functions of the same NIC the order of
execution may be crucial.
Problem example:
NIC with 2 PFs, both PFs declare offload quota of up to 3 UDP-ports.
$ifconfig eth2 1.1.1.1/16 up
$ip link add eth2_19503 type vxlan id 5049 remote 1.1.1.2 dev eth2 dstport 19053
$ip link set dev eth2_19503 up
$ip link add eth2_19504 type vxlan id 5049 remote 1.1.1.3 dev eth2 dstport 19054
$ip link set dev eth2_19504 up
$ip link add eth2_19505 type vxlan id 5049 remote 1.1.1.4 dev eth2 dstport 19055
$ip link set dev eth2_19505 up
$ip link add eth2_19506 type vxlan id 5049 remote 1.1.1.5 dev eth2 dstport 19056
$ip link set dev eth2_19506 up
NIC RX port offload infrastructure offloads the first 3 UDP-ports (on
all devices which sets NETIF_F_RX_UDP_TUNNEL_PORT feature) and not
UDP-port 19056. So both PFs gets this offload configuration.
$ip link set dev eth2_19504 down
This triggers udp-tunnel-core to remove the UDP-port 19504 from
offload-ports-list and offload UDP-port 19056 instead.
In this scenario it is important that the UDP-port of 19504 will be
removed from both PFs before trying to add UDP-port 19056. The NIC can
stop offloading a UDP-port only when all references are removed.
Otherwise the NIC may report exceeding of the offload quota.
Fixes: cc4e3835ef ("udp_tunnel: add central NIC RX port offload infrastructure")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 919483096b.
There is only when ip_options_get() return zero need to free.
It already called kfree() when return error.
Fixes: 919483096b ("ipv4: fix memory leaks in ip_cmsg_send() callers")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
vduse driver supporting blk
virtio-vsock support for end of record with SEQPACKET
vdpa: mac and mq support for ifcvf and mlx5
vdpa: management netlink for ifcvf
virtio-i2c, gpio dt bindings
misc fixes, cleanups
NB: when merging this with
b542e383d8 ("eventfd: Make signal recursion protection a task bit")
from Linus' tree, replace eventfd_signal_count with
eventfd_signal_allowed, and drop the export of eventfd_wake_count from
("eventfd: Export eventfd_wake_count to modules").
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmE1+awPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpt6EIAJy0qrc62lktNA0IiIVJSLbUbTMmFj8MzkGR
8UxZdhpjWqBPJPyaOuNeksAqTGm/UAPEYx3C2c95Jhej7anFpy7dbCtIXcPHLJME
DjcJg+EDrlNCj8m0FcsHpHWsFzPMERJpyEZNxgB5WazQbv+yWhGrg2FN5DCnF0Ro
ZFYeKSVty148pQ0nHl8X0JM2XMtqit+O+LvKN2HQZ+fubh7BCzMxzkHY0QLHIzUS
UeZqd3Qm8YcbqnlX38P5D6k+NPiTEgknmxaBLkPxg6H3XxDAmaIRFb8Ldd1rsgy1
zTLGDiSGpVDIpawRnuEAzqJThV3Y5/MVJ1WD+mDYQ96tmhfp+KY=
=DBH/
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- vduse driver ("vDPA Device in Userspace") supporting emulated virtio
block devices
- virtio-vsock support for end of record with SEQPACKET
- vdpa: mac and mq support for ifcvf and mlx5
- vdpa: management netlink for ifcvf
- virtio-i2c, gpio dt bindings
- misc fixes and cleanups
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits)
Documentation: Add documentation for VDUSE
vduse: Introduce VDUSE - vDPA Device in Userspace
vduse: Implement an MMU-based software IOTLB
vdpa: Support transferring virtual addressing during DMA mapping
vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
vhost-iotlb: Add an opaque pointer for vhost IOTLB
vhost-vdpa: Handle the failure of vdpa_reset()
vdpa: Add reset callback in vdpa_config_ops
vdpa: Fix some coding style issues
file: Export receive_fd() to modules
eventfd: Export eventfd_wake_count to modules
iova: Export alloc_iova_fast() and free_iova_fast()
virtio-blk: remove unneeded "likely" statements
virtio-balloon: Use virtio_find_vqs() helper
vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro
vsock_test: update message bounds test for MSG_EOR
af_vsock: rename variables in receive loop
virtio/vsock: support MSG_EOR bit processing
vhost/vsock: support MSG_EOR bit processing
...
BPF programs may want to know hardware timestamps if NIC supports
such timestamping.
Expose this data as hwtstamp field of __sk_buff the same way as
gso_segs/gso_size. This field could be accessed from the same
programs as tstamp field, but it's read-only field. Explicit test
to deny access to padding data is added to bpf_skb_is_valid_access.
Also update BPF_PROG_TEST_RUN tests of the feature.
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210909220409.8804-2-vfedorenko@novek.ru
Add pr_fmt macro to spell out the source of messages in prefix.
Before this patch:
packet size is too long (1543 > 1518)
With this patch:
af_packet: packet size is too long (1543 > 1518)
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reference count leak issue may take place in an error handling
path. If both conditions of tunnel->version == L2TP_HDR_VER_3 and the
return value of l2tp_v3_ensure_opt_in_linear is nonzero, the function
would directly jump to label invalid, without decrementing the reference
count of the l2tp_session object session increased earlier by
l2tp_tunnel_get_session(). This may result in refcount leaks.
Fix this issue by decrease the reference count before jumping to the
label invalid.
Fixes: 4522a70db7 ("l2tp: fix reading optional fields of L2TPv3")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported another data-race in af_unix [1]
Lets change __skb_insert() to use WRITE_ONCE() when changing
skb head qlen.
Also, change unix_dgram_poll() to use lockless version
of unix_recvq_full()
It is verry possible we can switch all/most unix_recvq_full()
to the lockless version, this will be done in a future kernel version.
[1] HEAD commit: 8596e589b7
BUG: KCSAN: data-race in skb_queue_tail / unix_dgram_poll
write to 0xffff88814eeb24e0 of 4 bytes by task 25815 on cpu 0:
__skb_insert include/linux/skbuff.h:1938 [inline]
__skb_queue_before include/linux/skbuff.h:2043 [inline]
__skb_queue_tail include/linux/skbuff.h:2076 [inline]
skb_queue_tail+0x80/0xa0 net/core/skbuff.c:3264
unix_dgram_sendmsg+0xff2/0x1600 net/unix/af_unix.c:1850
sock_sendmsg_nosec net/socket.c:703 [inline]
sock_sendmsg net/socket.c:723 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
___sys_sendmsg net/socket.c:2446 [inline]
__sys_sendmmsg+0x315/0x4b0 net/socket.c:2532
__do_sys_sendmmsg net/socket.c:2561 [inline]
__se_sys_sendmmsg net/socket.c:2558 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2558
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff88814eeb24e0 of 4 bytes by task 25834 on cpu 1:
skb_queue_len include/linux/skbuff.h:1869 [inline]
unix_recvq_full net/unix/af_unix.c:194 [inline]
unix_dgram_poll+0x2bc/0x3e0 net/unix/af_unix.c:2777
sock_poll+0x23e/0x260 net/socket.c:1288
vfs_poll include/linux/poll.h:90 [inline]
ep_item_poll fs/eventpoll.c:846 [inline]
ep_send_events fs/eventpoll.c:1683 [inline]
ep_poll fs/eventpoll.c:1798 [inline]
do_epoll_wait+0x6ad/0xf00 fs/eventpoll.c:2226
__do_sys_epoll_wait fs/eventpoll.c:2238 [inline]
__se_sys_epoll_wait fs/eventpoll.c:2233 [inline]
__x64_sys_epoll_wait+0xf6/0x120 fs/eventpoll.c:2233
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000001b -> 0x00000001
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 25834 Comm: syz-executor.1 Tainted: G W 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 86b18aaa2b ("skbuff: fix a data race in skb_queue_len()")
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a couple of harmless fixes, increase max tcp msize (64KB -> 1MB),
and increase default msize (8KB -> 128KB)
The default increase has been discussed with Christian
for the qemu side of things but makes sense for all supported
transports
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmE4wt4ACgkQq06b7GqY
5nC22RAAhujCsrvvwzRelIEycB5IOiBe0xcZItdyPNOAleWfL6tZ+U8/HC/8hb8z
jQIG7D6DS0y+MDFFuCXorU9WChF+Wv2Rjj9AJvpBj0gugkbUUxRD4uKRjJgKopJ3
rONXnXUnaPvxwRTBFRdzecfIxeQUDw8YJo4WmUKZsB4rCOD8wYVNg+DJHl+CoJ3t
E/D0/ztiKdQL5pGKT2fl8+MbFMBmWor7aiB5/ms8UaiN8ZaW0cUBI3JLcMJjPEbO
ip0NXVfbR1UCs8sK8If2afJ/tUnwYTje42ll3fRJZqPZM9jPjVMgXqsP8b7sn5yi
5+/SpAa3Uszi8A9RxEnCsaEx4UWhbGe+54RFGnYSEcj109ZpRDeOo8V8VVg8tb2p
y4f/xN6BdOUJekCxcF1/7e6RkXPCauCzQkN3yX6CL4Giu6jy6764hqO2plO8tlWZ
zrL7RZDc2Rx4oborDdJL5pSpCYYfs9yuQz0b1JH+NoBfohDFWN3KFNFiSNxg51Eu
hunPQK5gojEKsDD2SjD0hy4QfLt5pRaJILznwoEcu9GX9oMSj862IC+uCWExqZbE
WFroQfi2OJmbtFJB/fFEYE/mIFdIeC6++ZxEGbY5MNun8W/hMQKJpK+Y9TBS1N1j
dV5JJbTGMQLVAZkphC24L6n2iCtz9SoB5j5gbUXQZsd6LR3NL9c=
=PLhf
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.15-rc1' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"A couple of harmless fixes, increase max tcp msize (64KB -> 1MB), and
increase default msize (8KB -> 128KB)
The default increase has been discussed with Christian for the qemu
side of things but makes sense for all supported transports"
* tag '9p-for-5.15-rc1' of git://github.com/martinetd/linux:
net/9p: increase default msize to 128k
net/9p: use macro to define default msize
net/9p: increase tcp max msize to 1MB
9p/xen: Fix end of loop tests for list_for_each_entry
9p/trans_virtio: Remove sysfs file on probe failure
The kernel test robot reports:
[ 843.509974][ T345] =============================
[ 843.524220][ T345] WARNING: suspicious RCU usage
[ 843.538791][ T345] 5.14.0-rc2-00606-g889b7da23abf #1 Not tainted
[ 843.553617][ T345] -----------------------------
[ 843.567412][ T345] net/mctp/route.c:310 RCU-list traversed in non-reader section!!
- we're missing the rcu read lock acquire around the destruction path.
This change adds the acquire/release - the path is already atomic, and
we're using the _rcu list iterators.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2677d20677 ("dccp: don't free ccid2_hc_tx_sock ...") fixed
a UAF but reintroduced CVE-2017-6074.
When the sock is cloned, two dccps_hc_tx_ccid will reference to the
same ccid. So one can free the ccid object twice from two socks after
cloning.
This issue was found by "Hadar Manor" as well and assigned with
CVE-2020-16119, which was fixed in Ubuntu's kernel. So here I port
the patch from Ubuntu to fix it.
The patch prevents cloned socks from referencing the same ccid.
Fixes: 2677d20677 ("dccp: don't free ccid2_hc_tx_sock ...")
Signed-off-by: Zhenpeng Lin <zplin@psu.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf_prog_test_run_xattr takes a struct __sk_buff, but did not permit
that __skbuff to include an nonzero ingress_ifindex.
This patch updates to allow ingress_ifindex, convert the __sk_buff field to
sk_buff (skb_iif) and back, and tests that the value is present from on BPF
program side. The test sets an unlikely distinct value for ingress_ifindex
(11) from ifindex (1), which is in line with the rest of the synthetic field
tests.
Adding this support allows testing BPF that operates differently on
incoming and outgoing skbs by discriminating on this field.
Signed-off-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210831033356.1459316-1-ntspring@fb.com
wireless and can.
Current release - regressions:
- qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi
- ip_gre: validate csum_start only on pull
- bnxt_en: fix 64-bit doorbell operation on 32-bit kernels
- ionic: fix double use of queue-lock, fix a sleeping in atomic
- can: c_can: fix null-ptr-deref on ioctl()
- cs89x0: disable compile testing on powerpc
Current release - new code bugs:
- bridge: mcast: fix vlan port router deadlock, consistently disable BH
Previous releases - regressions:
- dsa: tag_rtl4_a: fix egress tags, only port 0 was working
- mptcp: fix possible divide by zero
- netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
- netfilter: socket: icmp6: fix use-after-scope
- stmmac: fix MAC not working when system resume back with WoL active
Previous releases - always broken:
- ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL
address
- seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
- mptcp: only send extra TCP acks in eligible socket states
- dsa: lantiq_gswip: fix maximum frame length
- stmmac: fix overall budget calculation for rxtx_napi
- bnxt_en: fix firmware version reporting via devlink
- renesas: sh_eth: add missing barrier to fix freeing wrong tx descriptor
Stragglers:
- netfilter: conntrack: switch to siphash
- netfilter: refuse insertion if chain has grown too large
- ncsi: add get MAC address command to get Intel i210 MAC address
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmE3uicACgkQMUZtbf5S
IrtJVA//XdE8qAmw1JukjyYC87JH2ale20eoZ6ERn7/09e4tdv3M6dOTI4YfrM6+
CMNP5MP2qit3IzY+lN0+yt9AAFH7k85z3MA8zLxsXN4z63OJcZvFv/G/OWy4Wp/0
vOo/DH+rF3LR+fZZvjJI+8Xi9/orsRpD12cwGmjGRxybh+XcnHKI/GvK2RgE6oBR
015RfBbbQBpzFQvESLnSwDzabN1XFEL1x/bz7N8ek3okfO/tab+f3E1tb6eYtTy+
jyDyOWpayd4xDttKNMUuxwS1q+/oAWOAq8PzkaF/ZG2sBH1Z4yZN9ZtsLNZmPG8N
5L1FEem/Nmgr54T9v/FhfiryhhGGysVfVgtQcCBkKRmVn1Kk2L6dFvtuanPtFFd3
llbi5PvCDJy3rbMmxKmyoM3T4jpMwWxQRZKsosw+k/WQfb8/SUOjgpY713V1Wx/P
S+2uadU4l9Ql9sF6X0IqZABnnt+j/BuDo6C6vVq7vyj0iQ9hEX9YxC0ybrAHOYpH
suHWKndodRfTxxVOg8xRNYwXyRLNbm1AP6LMDNKBlFUjwNSZ362qFX7W7DuXoRup
Rrnb8V1QFvM+pyFb2a0qNtBS68IXbjCdVQX5e8a5ELaAUnDPefNrfPN+/rrTLEtV
LnusmBF+02llVSYdr88t1e+LmzqS/aqXFy2ry4y6owjq20ld2O0=
=Zvuz
-----END PGP SIGNATURE-----
Merge tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes and stragglers from Jakub Kicinski:
"Networking stragglers and fixes, including changes from netfilter,
wireless and can.
Current release - regressions:
- qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi
- ip_gre: validate csum_start only on pull
- bnxt_en: fix 64-bit doorbell operation on 32-bit kernels
- ionic: fix double use of queue-lock, fix a sleeping in atomic
- can: c_can: fix null-ptr-deref on ioctl()
- cs89x0: disable compile testing on powerpc
Current release - new code bugs:
- bridge: mcast: fix vlan port router deadlock, consistently disable
BH
Previous releases - regressions:
- dsa: tag_rtl4_a: fix egress tags, only port 0 was working
- mptcp: fix possible divide by zero
- netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
- netfilter: socket: icmp6: fix use-after-scope
- stmmac: fix MAC not working when system resume back with WoL active
Previous releases - always broken:
- ip/ip6_gre: use the same logic as SIT interfaces when computing
v6LL address
- seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
- mptcp: only send extra TCP acks in eligible socket states
- dsa: lantiq_gswip: fix maximum frame length
- stmmac: fix overall budget calculation for rxtx_napi
- bnxt_en: fix firmware version reporting via devlink
- renesas: sh_eth: add missing barrier to fix freeing wrong tx
descriptor
Stragglers:
- netfilter: conntrack: switch to siphash
- netfilter: refuse insertion if chain has grown too large
- ncsi: add get MAC address command to get Intel i210 MAC address"
* tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
ieee802154: Remove redundant initialization of variable ret
net: stmmac: fix MAC not working when system resume back with WoL active
net: phylink: add suspend/resume support
net: renesas: sh_eth: Fix freeing wrong tx descriptor
bonding: 3ad: pass parameter bond_params by reference
cxgb3: fix oops on module removal
can: c_can: fix null-ptr-deref on ioctl()
can: rcar_canfd: add __maybe_unused annotation to silence warning
net: wwan: iosm: Unify IO accessors used in the driver
net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors
net: qcom/emac: Replace strlcpy with strscpy
ip6_gre: Revert "ip6_gre: add validation for csum_start"
net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static
selftests/bpf: Test XDP bonding nest and unwind
bonding: Fix negative jump label count on nested bonding
MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry
stmmac: dwmac-loongson:Fix missing return value
iwlwifi: fix printk format warnings in uefi.c
net: create netdev->dev_addr assignment helpers
bnxt_en: Fix possible unintended driver initiated error recovery
...
The variable ret is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 9cf448c200.
This commit was added for equivalence with a similar fix to ip_gre.
That fix proved to have a bug. Upon closer inspection, ip6_gre is not
susceptible to the original bug.
So revert the unnecessary extra check.
In short, ipgre_xmit calls skb_pull to remove ipv4 headers previously
inserted by dev_hard_header. ip6gre_tunnel_xmit does not.
Link: https://lore.kernel.org/netdev/CA+FuTSe+vJgTVLc9SojGuN-f9YQ+xWLPKE_S4f=f+w+_P2hgUg@mail.gmail.com/#t
Fixes: 9cf448c200 ("ip6_gre: add validation for csum_start")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>