The previous patch removed the last usage of the functions
devlink_rate_leaf_create() and devlink_rate_nodes_destroy(). Thus,
remove these function from devlink API.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The previous patch removed the last usage of the function
devlink_rate_nodes_destroy(). Thus, remove this function from devlink
API.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If we set XFRM security policy by calling setsockopt with option
IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock'
struct. However tcp_v6_send_response doesn't look up dst_entry with the
actual socket but looks up with tcp control socket. This may cause a
problem that a RST packet is sent without ESP encryption & peer's TCP
socket can't receive it.
This patch will make the function look up dest_entry with actual socket,
if the socket has XFRM policy(sock_policy), so that the TCP response
packet via this function can be encrypted, & aligned on the encrypted
TCP socket.
Tested: We encountered this problem when a TCP socket which is encrypted
in ESP transport mode encryption, receives challenge ACK at SYN_SENT
state. After receiving challenge ACK, TCP needs to send RST to
establish the socket at next SYN try. But the RST was not encrypted &
peer TCP socket still remains on ESTABLISHED state.
So we verified this with test step as below.
[Test step]
1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED).
2. Client tries a new connection on the same TCP ports(src & dst).
3. Server will return challenge ACK instead of SYN,ACK.
4. Client will send RST to server to clear the SOCKET.
5. Client will retransmit SYN to server on the same TCP ports.
[Expected result]
The TCP connection should be established.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Sehee Lee <seheele@google.com>
Signed-off-by: Sewook Seo <sewookseo@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-07-09
We've added 94 non-merge commits during the last 19 day(s) which contain
a total of 125 files changed, 5141 insertions(+), 6701 deletions(-).
The main changes are:
1) Add new way for performing BTF type queries to BPF, from Daniel Müller.
2) Add inlining of calls to bpf_loop() helper when its function callback is
statically known, from Eduard Zingerman.
3) Implement BPF TCP CC framework usability improvements, from Jörn-Thorben Hinz.
4) Add LSM flavor for attaching per-cgroup BPF programs to existing LSM
hooks, from Stanislav Fomichev.
5) Remove all deprecated libbpf APIs in prep for 1.0 release, from Andrii Nakryiko.
6) Add benchmarks around local_storage to BPF selftests, from Dave Marchevsky.
7) AF_XDP sample removal (given move to libxdp) and various improvements around AF_XDP
selftests, from Magnus Karlsson & Maciej Fijalkowski.
8) Add bpftool improvements for memcg probing and bash completion, from Quentin Monnet.
9) Add arm64 JIT support for BPF-2-BPF coupled with tail calls, from Jakub Sitnicki.
10) Sockmap optimizations around throughput of UDP transmissions which have been
improved by 61%, from Cong Wang.
11) Rework perf's BPF prologue code to remove deprecated functions, from Jiri Olsa.
12) Fix sockmap teardown path to avoid sleepable sk_psock_stop, from John Fastabend.
13) Fix libbpf's cleanup around legacy kprobe/uprobe on error case, from Chuang Wang.
14) Fix libbpf's bpf_helpers.h to work with gcc for the case of its sec/pragma
macro, from James Hilliard.
15) Fix libbpf's pt_regs macros for riscv to use a0 for RC register, from Yixun Lan.
16) Fix bpftool to show the name of type BPF_OBJ_LINK, from Yafang Shao.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (94 commits)
selftests/bpf: Fix xdp_synproxy build failure if CONFIG_NF_CONNTRACK=m/n
bpf: Correctly propagate errors up from bpf_core_composites_match
libbpf: Disable SEC pragma macro on GCC
bpf: Check attach_func_proto more carefully in check_return_code
selftests/bpf: Add test involving restrict type qualifier
bpftool: Add support for KIND_RESTRICT to gen min_core_btf command
MAINTAINERS: Add entry for AF_XDP selftests files
selftests, xsk: Rename AF_XDP testing app
bpf, docs: Remove deprecated xsk libbpf APIs description
selftests/bpf: Add benchmark for local_storage RCU Tasks Trace usage
libbpf, riscv: Use a0 for RC register
libbpf: Remove unnecessary usdt_rel_ip assignments
selftests/bpf: Fix few more compiler warnings
selftests/bpf: Fix bogus uninitialized variable warning
bpftool: Remove zlib feature test from Makefile
libbpf: Cleanup the legacy uprobe_event on failed add/attach_event()
libbpf: Fix wrong variable used in perf_event_uprobe_open_legacy()
libbpf: Cleanup the legacy kprobe_event on failed add/attach_event()
selftests/bpf: Add type match test against kernel's task_struct
selftests/bpf: Add nested type to type based tests
...
====================
Link: https://lore.kernel.org/r/20220708233145.32365-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move macro MPTCPOPT_HMAC_LEN definition from net/mptcp/protocol.h to
include/net/mptcp.h.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
include/net/tls.h is getting a little long, and is probably hard
for driver authors to navigate. Split out the internals into a
header which will live under net/tls/. While at it move some
static inlines with a single user into the source files, add
a few tls_ prefixes and fix spelling of 'proccess'.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
AAD size is either 5 or 13. Really no point complicating
the code for the 8B of difference. This will also let us
turn the chunked up buffer into a sane struct.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk_skb_cb lives within skb->cb[]. skb->cb[] straddles
2 cache lines, each containing 24B of data.
The first cache line does not contain much interesting
information for users of strparser, so pad things a little.
Previously strp_msg->full_len would live in the first cache
line and strp_msg->offset in the second.
We need to reorder the 8 byte temp_reg with struct tls_msg
to prevent a 4B hole which would push the struct over 48B.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since optimisitic decrypt may add extra load in case of retries
require socket owner to explicitly opt-in.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offloading police with action TC_ACT_UNSPEC was erroneously disabled even
though it was supported by mlx5 matchall offload implementation, which
didn't verify the action type but instead assumed that any single police
action attached to matchall classifier is a 'continue' action. Lack of
action type check made it non-obvious what mlx5 matchall implementation
actually supports and caused implementers and reviewers of referenced
commits to disallow it as a part of improved validation code.
Fixes: b8cd5831c6 ("net: flow_offload: add tc police action parameters")
Fixes: b50e462bc2 ("net/sched: act_police: Add extack messages for offload failure")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All callers of taprio_offload_get() and taprio_offload_free() prior to
the blamed commit are conditionally compiled based on CONFIG_NET_SCH_TAPRIO.
felix_vsc9959.c is different; it provides vsc9959_qos_port_tas_set()
even when taprio is compiled out.
Provide shim definitions for the functions exported by taprio so that
felix_vsc9959.c is able to compile. vsc9959_qos_port_tas_set() in that
case is dead code anyway, and ocelot_port->taprio remains NULL, which is
fine for the rest of the logic.
Fixes: 1c9017e44a ("net: dsa: felix: keep reference on entire tc-taprio config")
Reported-by: Colin Foster <colin.foster@in-advantage.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Colin Foster <colin.foster@in-advantage.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20220704190241.1288847-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Microchip LAN937X switches have a tagging protocol which is
very similar to KSZ tagging. So that the implementation is added to
tag_ksz.c and reused common APIs
Signed-off-by: Prasanna Vengateshan <prasanna.vengateshan@microchip.com>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no more users for the mentioned macros, just
drop them.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
9c5de246c1 ("net: sparx5: mdb add/del handle non-sparx5 devices")
fbb89d02e3 ("net: sparx5: Allow mdb entries to both CPU and ports")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
commit ed6cd6a178 ("net, neigh: Set lower cap for neigh_managed_work rearming")
fixed a case when DELAY_PROBE_TIME is configured to 0, the processing of the
system work queue hog CPU to 100%, and further more we should introduce
a new option used by periodic probe
Signed-off-by: Yuwei Wang <wangyuweihx@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When verdict is NF_STOLEN, the skb might have been freed.
When tracing is enabled, this can result in a use-after-free:
1. access to skb->nf_trace
2. access to skb->mark
3. computation of trace id
4. dump of packet payload
To avoid 1, keep a cached copy of skb->nf_trace in the
trace state struct.
Refresh this copy whenever verdict is != STOLEN.
Avoid 2 by skipping skb->mark access if verdict is STOLEN.
3 is avoided by precomputing the trace id.
Only dump the packet when verdict is not "STOLEN".
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The switch that is present on the Renesas RZ/N1 SoC uses a specific
VLAN value followed by 6 bytes which contains forwarding configuration.
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to allow dsa drivers to specify the .get_rmon_stats()
operation.
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the len fields manipulation in the skbs to a helper function.
There is a comment specifically requesting this and there are several
other areas in the code displaying the same pattern which can be
refactored.
This improves code readability.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Link: https://lore.kernel.org/r/20220622160853.GA6478@debian
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add per port priority support for bonding active slave re-selection during
failover. A higher number means higher priority in selection. The primary
slave still has the highest priority. This option also follows the
primary_reselect rules.
This option could only be configured via netlink.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, bond_opt_value are mostly used for bonding option settings. If
we want to set a value for slave, we need to re-alloc a string to store
both slave name and vlaue, like bond_option_queue_id_set() does, which
is complex and dumb.
As Jon suggested, let's add a union field slave_dev for bond_opt_value,
which will be benefit for future slave option setting. In function
__bond_opt_init(), we will always check the extra field and set it
if it's not NULL.
Suggested-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to
the new tcp_bpf_update_proto() function. I'm guessing that this
was done to allow creating psocks for non-inet sockets.
Unfortunately the destruction path for psock includes the ULP
unwind, so we need to fail the sk_psock_init() itself.
Otherwise if ULP is already present we'll notice that later,
and call tcp_update_ulp() with the sk_proto of the ULP
itself, which will most likely result in the ULP looping
its callbacks.
Fixes: 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
unix_table_locks are to protect the global hash table, unix_socket_table.
The previous commit removed it, so let's clean up the unnecessary locks.
Here is a test result on EC2 c5.9xlarge where 10 processes run concurrently
in different netns and bind 100,000 sockets for each.
without this series : 1m 38s
with this series : 11s
It is ~10x faster because the global hash table is split into 10 netns in
this case.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit replaces the global hash table with a per-netns one and removes
the global one.
We now link a socket in each netns's hash table so we can save some netns
comparisons when iterating through a hash bucket.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds a per netns hash table for AF_UNIX, which size is fixed
as UNIX_HASH_SIZE for now.
The first implementation defines a per-netns hash table as a single array
of lock and list:
struct unix_hashbucket {
spinlock_t lock;
struct hlist_head head;
};
struct netns_unix {
struct unix_hashbucket *hash;
...
};
But, Eric pointed out memory cost that the structure has holes because of
sizeof(spinlock_t), which is 4 (or more if LOCKDEP is enabled). [0] It
could be expensive on a host with thousands of netns and few AF_UNIX
sockets. For this reason, a per-netns hash table uses two dense arrays.
struct unix_table {
spinlock_t *locks;
struct hlist_head *buckets;
};
struct netns_unix {
struct unix_table table;
...
};
Note the length of the list has a significant impact rather than lock
contention, so having shared locks can be an option. But, per-netns
locks and lists still perform better than the global locks and per-netns
lists. [1]
Also, this patch adds a change so that struct netns_unix disappears from
struct net if CONFIG_UNIX is disabled.
[0]: https://lore.kernel.org/netdev/CANn89iLVxO5aqx16azNU7p7Z-nz5NrnM5QTqOzueVxEnkVTxyg@mail.gmail.com/
[1]: https://lore.kernel.org/netdev/20220617175215.1769-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the size of AF_UNIX hash table is UNIX_HASH_SIZE * 2,
the first half for bind()ed sockets and the second half for unbound
ones. UNIX_HASH_SIZE * 2 is used to define the table and iterate
over it.
In some places, we use ARRAY_SIZE(unix_socket_table) instead of
UNIX_HASH_SIZE * 2. However, we cannot use it anymore because we
will allocate the hash table dynamically. Then, we would have to
add UNIX_HASH_SIZE * 2 in many places, which would be troublesome.
This patch adapts the UNIX_HASH_SIZE definition to include bound
and unbound sockets and defines a new UNIX_HASH_MOD macro to ease
calculations.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
raw_diag_dump() can use rcu_read_lock() instead of read_lock()
Now the hashinfo lock is only used from process context,
in write mode only, we can convert it to a spinlock,
and we do not need to block BH anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220620100509.3493504-1-eric.dumazet@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently both splice() and sockmap use ->read_sock() to
read skb from receive queue, but for sockmap we only read
one entire skb at a time, so ->read_sock() is too conservative
to use. Introduce a new proto_ops ->read_skb() which supports
this sematic, with this we can finally pass the ownership of
skb to recv actors.
For non-TCP protocols, all ->read_sock() can be simply
converted to ->read_skb().
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com
This patch inroduces tcp_read_skb() based on tcp_read_sock(),
a preparation for the next patch which actually introduces
a new sock ops.
TCP is special here, because it has tcp_read_sock() which is
mainly used by splice(). tcp_read_sock() supports partial read
and arbitrary offset, neither of them is needed for sockmap.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-2-xiyou.wangcong@gmail.com
Using rwlock in networking code is extremely risky.
writers can starve if enough readers are constantly
grabing the rwlock.
I thought rwlock were at fault and sent this patch:
https://lkml.org/lkml/2022/6/17/272
But Peter and Linus essentially told me rwlock had to be unfair.
We need to get rid of rwlock in networking code.
Without this fix, following script triggers soft lockups:
for i in {1..48}
do
ping -f -n -q 127.0.0.1 &
sleep 0.1
done
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to prepare the following patch,
I change raw v4 & v6 code to use more conventional
iterators.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-06-17
We've added 72 non-merge commits during the last 15 day(s) which contain
a total of 92 files changed, 4582 insertions(+), 834 deletions(-).
The main changes are:
1) Add 64 bit enum value support to BTF, from Yonghong Song.
2) Implement support for sleepable BPF uprobe programs, from Delyan Kratunov.
3) Add new BPF helpers to issue and check TCP SYN cookies without binding to a
socket especially useful in synproxy scenarios, from Maxim Mikityanskiy.
4) Fix libbpf's internal USDT address translation logic for shared libraries as
well as uprobe's symbol file offset calculation, from Andrii Nakryiko.
5) Extend libbpf to provide an API for textual representation of the various
map/prog/attach/link types and use it in bpftool, from Daniel Müller.
6) Provide BTF line info for RV64 and RV32 JITs, and fix a put_user bug in the
core seen in 32 bit when storing BPF function addresses, from Pu Lehui.
7) Fix libbpf's BTF pointer size guessing by adding a list of various aliases
for 'long' types, from Douglas Raillard.
8) Fix bpftool to readd setting rlimit since probing for memcg-based accounting
has been unreliable and caused a regression on COS, from Quentin Monnet.
9) Fix UAF in BPF cgroup's effective program computation triggered upon BPF link
detachment, from Tadeusz Struk.
10) Fix bpftool build bootstrapping during cross compilation which was pointing
to the wrong AR process, from Shahab Vahedi.
11) Fix logic bug in libbpf's is_pow_of_2 implementation, from Yuze Chi.
12) BPF hash map optimization to avoid grabbing spinlocks of all CPUs when there
is no free element. Also add a benchmark as reproducer, from Feng Zhou.
13) Fix bpftool's codegen to bail out when there's no BTF, from Michael Mullin.
14) Various minor cleanup and improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
bpf: Fix bpf_skc_lookup comment wrt. return type
bpf: Fix non-static bpf_func_proto struct definitions
selftests/bpf: Don't force lld on non-x86 architectures
selftests/bpf: Add selftests for raw syncookie helpers in TC mode
bpf: Allow the new syncookie helpers to work with SKBs
selftests/bpf: Add selftests for raw syncookie helpers
bpf: Add helpers to issue and check SYN cookies in XDP
bpf: Allow helpers to accept pointers with a fixed size
bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
selftests/bpf: add tests for sleepable (uk)probes
libbpf: add support for sleepable uprobe programs
bpf: allow sleepable uprobe programs to attach
bpf: implement sleepable uprobes by chaining gps
bpf: move bpf_prog to bpf.h
libbpf: Fix internal USDT address translation logic for shared libraries
samples/bpf: Check detach prog exist or not in xdp_fwd
selftests/bpf: Avoid skipping certain subtests
selftests/bpf: Fix test_varlen verification failure with latest llvm
bpftool: Do not check return value from libbpf_set_strict_mode()
Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"
...
====================
Link: https://lore.kernel.org/r/20220617220836.7373-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP
program to generate SYN cookies in response to TCP SYN packets and to
check those cookies upon receiving the first ACK packet (the final
packet of the TCP handshake).
Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a
listening socket on the local machine, which allows to use them together
with synproxy to accelerate SYN cookie generation.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
sk_memory_allocated_add() has three callers, and returns
to them @memory_allocated.
sk_forced_mem_schedule() is one of them, and ignores
the returned value.
Change sk_memory_allocated_add() to return void.
Change sock_reserve_memory() and __sk_mem_raise_allocated()
to call sk_memory_allocated().
This removes one cache line miss [1] for RPC workloads,
as first skbs in TCP write queue and receive queue go through
sk_forced_mem_schedule().
[1] Cache line holding tcp_memory_allocated.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast.
Each TCP socket can forward allocate up to 2 MB of memory, even after
flow became less active.
10,000 sockets can have reserved 20 GB of memory,
and we have no shrinker in place to reclaim that.
Instead of trying to reclaim the extra allocations in some places,
just keep sk->sk_forward_alloc values as small as possible.
This should not impact performance too much now we have per-cpu
reserves: Changes to tcp_memory_allocated should not be too frequent.
For sockets not using SO_RESERVE_MEM:
- idle sockets (no packets in tx/rx queues) have zero forward alloc.
- non idle sockets have a forward alloc smaller than one page.
Note:
- Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD
is left to MPTCP maintainers as a follow up.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes,
we want to allocate 1 byte more (rounded up to one page),
instead of 150001 :/
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We plan keeping sk->sk_forward_alloc as small as possible
in future patches.
This means we are going to call sk_memory_allocated_add()
and sk_memory_allocated_sub() more often.
Implement a per-cpu cache of +1/-1 MB, to reduce number
of changes to sk->sk_prot->memory_allocated, which
would otherwise be cause of false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each protocol having a ->memory_allocated pointer gets a corresponding
per-cpu reserve, that following patches will use.
Instead of having reserved bytes per socket,
we want to have per-cpu reserves.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.
This might change in the future, but it seems better to avoid the
confusion.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit bd68a2a854.
This change broke memcg on arches with PAGE_SIZE != 4096
Later, commit 2bb2f5fb21 ("net: add new socket option SO_RESERVE_MEM")
also assumed PAGE_SIZE==SK_MEM_QUANTUM
Following patches in the series will greatly reduce the over allocations
problem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Here's a first set of patches for v5.20. This is just a
queue flush, before we get things back from net-next that
are causing conflicts, and then can start merging a lot
of MLO (multi-link operation, part of 802.11be) code.
Lots of cleanups all over.
The only notable change is perhaps wilc1000 being the
first driver to disable WEP (while enabling WPA3).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmKjUdAACgkQB8qZga/f
l8RE0xAAhVNBB3r0n8bcZXNxmb/zswjyQcRV3BrSxRwfOGppB4iqHuTEx7U7iBOK
9hMacse+myVlFNncWzGnOiZ9XIIElepPATfHXYPlVOrUO5AzqvtuuZG/6cBShO+G
A1YrdVPYd87WiowTovY2x7tknZYMoQYeVeGmIMIEViM0RjULkXPC9AhpKbiHoV4I
Ayn97E0j2+6R/gCtlhYTm0ASvzbVVoIB9cHMwvopzEXtsIjcE5Tglgrhygtw0FI3
w2EZi5091c6IA2lc+kEmN2saAX72f6G3cewYID84/l8U2+VuwzdDUnXsyXYgGFF8
UM47qizFSrwAn7eSiUNpLK0b8um/C2+ryBBUDrhbCvlR6/8shwvV1YMSX5eo00Av
rPtC7/7wXF0ox8Os+FTTqAptyWDFQMI4dYkbQjZ4KsR7/jXssReIsYLLPlYGRgU5
zemdd1onofZN4N9QXMtMxR7xwoKvPBRGqZa0YgnbSGF7dSjL+fleVlRwuhLZsWvb
KJQyut9/InC9C2kKjsdK+bcv8lLmJE65PdFM5CZBLnEZvf7stOkeg2WcuqNSzjca
VO7UIv8yQeJV2cpSBgmC4XchAU21r2rEzViz7PDLTFB9ZfYgcBIad9G10Mx5u11L
2GHmDX5r2X1QD91nsTqOBCn0xO67jpcgxMpiGC31VReV7BTKvSc=
=E0Dx
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2022-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
wireless-next patches for v5.20
Here's a first set of patches for v5.20. This is just a
queue flush, before we get things back from net-next that
are causing conflicts, and then can start merging a lot
of MLO (multi-link operation, part of 802.11be) code.
Lots of cleanups all over.
The only notable change is perhaps wilc1000 being the
first driver to disable WEP (while enabling WPA3).
* tag 'wireless-next-2022-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (29 commits)
wifi: mac80211_hwsim: Directly use ida_alloc()/free()
wifi: mac80211: refactor some key code
wifi: mac80211: remove cipher scheme support
wifi: nl80211: fix typo in comment
wifi: virt_wifi: fix typo in comment
rtw89: add new state to CFO state machine for UL-OFDMA
rtw89: 8852c: add trigger frame counter
ieee80211: add trigger frame definition
wifi: wfx: Remove redundant NULL check before release_firmware() call
wifi: rtw89: support MULTI_BSSID and correct BSSID mask of H2C
wifi: ray_cs: Drop useless status variable in parse_addr()
wifi: ray_cs: Utilize strnlen() in parse_addr()
wifi: rtw88: use %*ph to print small buffer
wifi: wilc1000: add IGTK support
wifi: wilc1000: add WPA3 SAE support
wifi: wilc1000: remove WEP security support
wifi: wilc1000: use correct sequence of RESET for chip Power-UP/Down
wifi: rtlwifi: fix error codes in rtl_debugfs_set_write_h2c()
wifi: rtw88: Fix Sparse warning for rtw8821c_hw_spec
wifi: rtw88: Fix Sparse warning for rtw8723d_hw_spec
...
====================
Link: https://lore.kernel.org/r/20220610142838.330862-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only driver using this was iwlwifi, where we just removed
the support because it was never really used. Remove the code
from mac80211 as well.
Change-Id: I1667417a5932315ee9d81f5c233c56a354923f09
Signed-off-by: Johannes Berg <johannes.berg@intel.com>