-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmZA578ACgkQ1V2XiooU
IOS18g//Zyuv+23GcUM7+FEXrlMN658xJyWiYvKjOaZZx5ZiV0QdZc4cbfPFD44p
qZBMmVC/WVoC89SLdwH1W47KoJU0xsK3/OdqGHHeNJ69111wIQMpOLfAetS2K0mb
F+Ue2vyWg1GQDICGsCdenHX7ihtVvnJJkomxc+3ObxtLCNsb2Dsr6JM5hMVP5Bil
4UZnPsrgfWy3A8O92burlPVE1sTWDFfFUGIf8geJc4QadwkgufkzxMhXNO7xHlpG
EZ99s8FPyD3R6tRPjf4gwdjr7JjinrdrYjZDuS4d3Uv8pKlUqcx8PgXG51/unr/y
qlynLXtEc1QU6SO2jENosHAG2/LQG2zsYEiiLFCP+a1JOtOxevZQKx8MyAFW8xDX
+RQhcBpTBocIyJ/tCDoM9lp69iYTR196Ct48v6pSGMNhZcddT4K4BkUL47GEs8T3
IA5x8h5gV2Q9ECMgqSaycdUsfLNgE/6fWx0ROs/wo3tMsgWrXCSJi8RFtN1sNbIO
rfuNnQiETIFBkQxBi7um8jadxdfIHm65cjgZBCyVbNNml3JwjYvXLxCXt2G7LxC4
Sg4nZvIbqWIifoMc1aQKypvFZjzzsWtFmYCuEUVLrnpj2SFTyh5CNzNo3MlHf7LG
sRb/XubdY6e0spLzd5VDjwH5qOT3poWAccatRr5BVUarxXCD5Vs=
=RD9p
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
Patch #1 skips transaction if object type provides no .update interface.
Patch #2 skips NETDEV_CHANGENAME which is unused.
Patch #3 enables conntrack to handle Multicast Router Advertisements and
Multicast Router Solicitations from the Multicast Router Discovery
protocol (RFC4286) as untracked opposed to invalid packets.
From Linus Luessing.
Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of
dropping them, from Jason Xing.
Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0,
also from Jason.
Patch #6 removes reference in netfilter's sysctl documentation on pickup
entries which were already removed by Florian Westphal.
Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which
allows to evict entries from the conntrack table,
also from Florian.
Patches #8 to #16 updates nf_tables pipapo set backend to allocate
the datastructure copy on-demand from preparation phase,
to better deal with OOM situations where .commit step is too late
to fail. Series from Florian Westphal.
Patch #17 adds a selftest with packetdrill to cover conntrack TCP state
transitions, also from Florian.
Patch #18 use GFP_KERNEL to clone elements from control plane to avoid
quick atomic reserves exhaustion with large sets, reporter refers
to million entries magnitude.
* tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: allow clone callbacks to sleep
selftests: netfilter: add packetdrill based conntrack tests
netfilter: nft_set_pipapo: remove dirty flag
netfilter: nft_set_pipapo: move cloning of match info to insert/removal path
netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone
netfilter: nft_set_pipapo: merge deactivate helper into caller
netfilter: nft_set_pipapo: prepare walk function for on-demand clone
netfilter: nft_set_pipapo: prepare destroy function for on-demand clone
netfilter: nft_set_pipapo: make pipapo_clone helper return NULL
netfilter: nft_set_pipapo: move prove_locking helper around
netfilter: conntrack: remove flowtable early-drop test
netfilter: conntrack: documentation: remove reference to non-existent sysctl
netfilter: use NF_DROP instead of -NF_DROP
netfilter: conntrack: dccp: try not to drop skb in conntrack
netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery
netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler
netfilter: nf_tables: skip transaction if update object is not implemented
====================
Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sven Auhagen reports transaction failures with following error:
./main.nft:13:1-26: Error: Could not process rule: Cannot allocate memory
percpu: allocation failed, size=16 align=8 atomic=1, atomic alloc failed, no space left
This points to failing pcpu allocation with GFP_ATOMIC flag.
However, transactions happen from user context and are allowed to sleep.
One case where we can call into percpu allocator with GFP_ATOMIC is
nft_counter expression.
Normally this happens from control plane, so this could use GFP_KERNEL
instead. But one use case, element insertion from packet path,
needs to use GFP_ATOMIC allocations (nft_dynset expression).
At this time, .clone callbacks always use GFP_ATOMIC for this reason.
Add gfp_t argument to the .clone function and pass GFP_KERNEL or
GFP_ATOMIC flag depending on context, this allows all clone memory
allocations to sleep for the normal (transaction) case.
Cc: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This set type keeps two copies of the sets' content,
priv->match (live version, used to match from packet path)
priv->clone (work-in-progress version of the 'future' priv->match).
All additions and removals are done on priv->clone. When transaction
completes, priv->clone becomes priv->match and a new clone is allocated
for use by next transaction.
Problem is that the cloning requires GFP_KERNEL allocations but we
cannot fail at either commit or abort time.
This patch defers the clone until we get an insertion or removal
request. This allows us to handle OOM situations correctly.
This also allows to remove ->dirty in a followup change:
If ->clone exists, ->dirty is always true
If ->clone is NULL, ->dirty is always false, no elements were added
or removed (except catchall elements which are external to the specific
set backend).
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The helper uses priv->clone unconditionally which will fail once we do
the clone conditionally on first insert or removal.
'nft get element' from userspace needs to use priv->match since this
runs from rcu read side lock section.
Prepare for this by passing the match backend data as argument.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.
Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice
I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.
This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.
Link: https://github.com/lxc/lxc/issues/4278 [1]
Link: b722d017a3/pkg/proxy/ipvs/proxier.go (L103) [2]
Link: 3797618f9a/osl/namespace_linux.go (L682) [3]
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Its the only remaining call site so there is no need for this to
be separated anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The existing code uses iter->type to figure out what data is needed, the
live copy (READ) or clone (UPDATE).
Without pending updates, priv->clone and priv->match will point to
different memory locations, but they have identical content.
Future patch will make priv->clone == NULL if there are no pending changes,
in this case we must copy the live data for the UPDATE case.
Currently this would require GFP_ATOMIC allocation. Split the walk
function in two parts: one that does the walk and one that decides which
data is needed.
In the UPDATE case, callers hold the transaction mutex so we do not need
the rcu read lock. This allows to use GFP_KERNEL allocation while
cloning.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once priv->clone can be NULL in case no insertions/removals occurred
in the last transaction we need to drop set elements from priv->match
if priv->clone is NULL.
While at it, condense this function by reusing the pipapo_free_match
helper instead of open-coded version.
The rcu_barrier() is removed, its not needed: old call_rcu instances
for pipapo_reclaim_match do not access struct nft_set.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently it returns an error pointer, but the only possible failure
is ENOMEM.
After a followup patch, we'd need to discard the errno code, i.e.
x = pipapo_clone()
if (IS_ERR(x))
return NULL
or make more changes to fix up callers to expect IS_ERR() code
from set->ops->deactivate().
So simplify this and make it return ptr-or-null.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Preparation patch, the helper will soon get called from insert
function too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Not sure why this special case exists. Early drop logic
(which kicks in when conntrack table is full) should be independent
of flowtable offload and only consider assured bit (i.e., two-way
traffic was seen).
flowtable entries hold a reference to the conntrack entry (struct
nf_conn) that has been offloaded. The conntrack use count is not
decremented until after the entry is free'd.
This change therefore will not result in exceeding the conntrack table
limit. It does allow early-drop of tcp flows even when they've been
offloaded, but only if they have been offloaded before syn-ack was
received or after at least one peer has sent a fin.
Currently 'fin' packet reception already stops offloading, so this
should not impact offloading either.
Cc: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
At the beginning in 2009 one patch [1] introduced collecting drop
counter in nf_conntrack_in() by returning -NF_DROP. Later, another
patch [2] changed the return value of tcp_packet() which now is
renamed to nf_conntrack_tcp_packet() from -NF_DROP to NF_DROP. As
we can see, that -NF_DROP should be corrected.
Similarly, there are other two points where the -NF_DROP is used.
Well, as NF_DROP is equal to 0, inverting NF_DROP makes no sense
as patch [2] said many years ago.
[1]
commit 7d1e04598e ("netfilter: nf_conntrack: account packets drop by tcp_packet()")
[2]
commit ec8d540969 ("netfilter: conntrack: fix dropping packet after l4proto->packet()")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It would be better not to drop skb in conntrack unless we have good
alternatives. So we can treat the result of testing skb's header
pointer as nf_conntrack_tcp_packet() does.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
So far Multicast Router Advertisements and Multicast Router
Solicitations from the Multicast Router Discovery protocol (RFC4286)
would be marked as INVALID for IPv6, even if they are in fact intact
and adhering to RFC4286.
This broke MRA reception and by that multicast reception on
IPv6 multicast routers in a Proxmox managed setup, where Proxmox
would install a rule like "-m conntrack --ctstate INVALID -j DROP"
at the top of the FORWARD chain with br-nf-call-ip6tables enabled
by default.
Similar to as it's done for MLDv1, MLDv2 and IPv6 Neighbor Discovery
already, fix this issue by excluding MRD from connection tracking
handling as MRD always uses predefined multicast destinations
for its messages, too. This changes the ct-state for ICMPv6 MRD messages
from INVALID to UNTRACKED.
This issue was found and fixed with the help of the mrdisc tool
(https://github.com/troglobit/mrdisc).
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Originally, device name used to be stored in the basechain, but it is
not the case anymore. Remove check for NETDEV_CHANGENAME.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Turn update into noop as a follow up for:
9fedd894b4 ("netfilter: nf_tables: fix unexpected EOPNOTSUPP error")
instead of adding a transaction object which is simply discarded at a
later stage of the commit protocol.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
* Remove sentinel elements from ctl_table structs
* Remove instances where an array element is zeroed out to make it look
like a sentinel. This is not longer needed and is safe after commit
c899710fe7 ("networking: Update to register_net_sysctl_sz") added
the array size to the ctl_table registration
* Remove the need for having __NF_SYSCTL_CT_LAST_SYSCTL as the
sysctl array size is now in NF_SYSCTL_CT_LAST_SYSCTL
* Remove extra element in ctl_table arrays declarations
Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I added dst_rt6_info() in commit
e8dfd42c17 ("ipv6: introduce dst_rt6_info() helper")
This patch does a similar change for IPv4.
Instead of (struct rtable *)dst casts, we can use :
#define dst_rtable(_ptr) \
container_of_const(_ptr, struct rtable, dst)
Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of (struct rt6_info *)dst casts, we can use :
#define dst_rt6_info(_ptr) \
container_of_const(_ptr, struct rt6_info, dst)
Some places needed missing const qualifiers :
ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()
v2: added missing parts (David Ahern)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was observed in the wild that pairs of consecutive packets would leave
the IPVS with the same wrong checksum, and the issue only went away when
disabling GSO.
IPVS needs to avoid computing the SCTP checksum when using GSO.
Fixes: 90017accff ("sctp: Add GSO support")
Co-developed-by: Firo Yang <firo.yang@suse.com>
Signed-off-by: Ismael Luceno <iluceno@suse.de>
Tested-by: Andreas Taschner <andreas.taschner@suse.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The NLM_F_ACK flag is ignored for nfnetlink batch begin and end
messages. This is a problem for ynl which wants to receive an ack for
every message it sends, not just the commands in between the begin/end
messages.
Add processing for ACKs for begin/end messages and provide responses
when requested.
I have checked that iproute2, pyroute2 and systemd are unaffected by
this change since none of them use NLM_F_ACK for batch begin/end.
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-5-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.
Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From abort path, nft_mapelem_activate() needs to restore refcounters to
the original state. Currently, it uses the set->ops->walk() to iterate
over these set elements. The existing set iterator skips inactive
elements in the next generation, this does not work from the abort path
to restore the original state since it has to skip active elements
instead (not inactive ones).
This patch moves the check for inactive elements to the set iterator
callback, then it reverses the logic for the .activate case which
needs to skip active elements.
Toggle next generation bit for elements when delete set command is
invoked and call nft_clear() from .activate (abort) path to restore the
next generation bit.
The splat below shows an object in mappings memleak:
[43929.457523] ------------[ cut here ]------------
[43929.457532] WARNING: CPU: 0 PID: 1139 at include/net/netfilter/nf_tables.h:1237 nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
[...]
[43929.458014] RIP: 0010:nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
[43929.458076] Code: 83 f8 01 77 ab 49 8d 7c 24 08 e8 37 5e d0 de 49 8b 6c 24 08 48 8d 7d 50 e8 e9 5c d0 de 8b 45 50 8d 50 ff 89 55 50 85 c0 75 86 <0f> 0b eb 82 0f 0b eb b3 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90
[43929.458081] RSP: 0018:ffff888140f9f4b0 EFLAGS: 00010246
[43929.458086] RAX: 0000000000000000 RBX: ffff8881434f5288 RCX: dffffc0000000000
[43929.458090] RDX: 00000000ffffffff RSI: ffffffffa26d28a7 RDI: ffff88810ecc9550
[43929.458093] RBP: ffff88810ecc9500 R08: 0000000000000001 R09: ffffed10281f3e8f
[43929.458096] R10: 0000000000000003 R11: ffff0000ffff0000 R12: ffff8881434f52a0
[43929.458100] R13: ffff888140f9f5f4 R14: ffff888151c7a800 R15: 0000000000000002
[43929.458103] FS: 00007f0c687c4740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000
[43929.458107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[43929.458111] CR2: 00007f58dbe5b008 CR3: 0000000123602005 CR4: 00000000001706f0
[43929.458114] Call Trace:
[43929.458118] <TASK>
[43929.458121] ? __warn+0x9f/0x1a0
[43929.458127] ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
[43929.458188] ? report_bug+0x1b1/0x1e0
[43929.458196] ? handle_bug+0x3c/0x70
[43929.458200] ? exc_invalid_op+0x17/0x40
[43929.458211] ? nft_setelem_data_deactivate+0xd7/0xf0 [nf_tables]
[43929.458271] ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables]
[43929.458332] nft_mapelem_deactivate+0x24/0x30 [nf_tables]
[43929.458392] nft_rhash_walk+0xdd/0x180 [nf_tables]
[43929.458453] ? __pfx_nft_rhash_walk+0x10/0x10 [nf_tables]
[43929.458512] ? rb_insert_color+0x2e/0x280
[43929.458520] nft_map_deactivate+0xdc/0x1e0 [nf_tables]
[43929.458582] ? __pfx_nft_map_deactivate+0x10/0x10 [nf_tables]
[43929.458642] ? __pfx_nft_mapelem_deactivate+0x10/0x10 [nf_tables]
[43929.458701] ? __rcu_read_unlock+0x46/0x70
[43929.458709] nft_delset+0xff/0x110 [nf_tables]
[43929.458769] nft_flush_table+0x16f/0x460 [nf_tables]
[43929.458830] nf_tables_deltable+0x501/0x580 [nf_tables]
Fixes: 628bd3e49c ("netfilter: nf_tables: drop map element references from preparation phase")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add missing decorator type to lookup expression and tighten WARN_ON_ONCE
check in pipapo to spot earlier that this is unset.
Fixes: 29b359cf6d ("netfilter: nft_set_pipapo: walk over current view on netlink dump")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
pppoe traffic reaching ingress path does not match the flowtable entry
because the pppoe header is expected to be at the network header offset.
This bug causes a mismatch in the flow table lookup, so pppoe packets
enter the classical forwarding path.
Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ensure there is sufficient room to access the protocol field of the
PPPoe header. Validate it once before the flowtable lookup, then use a
helper function to access protocol field.
Reported-by: syzbot+b6f07e1c07ef40199081@syzkaller.appspotmail.com
Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo reports a crash with large batches of elements with a
back-to-back add/remove pattern. Quoting Pablo:
add_elem("00000000") timeout 100 ms
...
add_elem("0000000X") timeout 100 ms
del_elem("0000000X") <---------------- delete one that was just added
...
add_elem("00005000") timeout 100 ms
1) nft_pipapo_remove() removes element 0000000X
Then, KASAN shows a splat.
Looking at the remove function there is a chance that we will drop a
rule that maps to a non-deactivated element.
Removal happens in two steps, first we do a lookup for key k and return the
to-be-removed element and mark it as inactive in the next generation.
Then, in a second step, the element gets removed from the set/map.
The _remove function does not work correctly if we have more than one
element that share the same key.
This can happen if we insert an element into a set when the set already
holds an element with same key, but the element mapping to the existing
key has timed out or is not active in the next generation.
In such case its possible that removal will unmap the wrong element.
If this happens, we will leak the non-deactivated element, it becomes
unreachable.
The element that got deactivated (and will be freed later) will
remain reachable in the set data structure, this can result in
a crash when such an element is retrieved during lookup (stale
pointer).
Add a check that the fully matching key does in fact map to the element
that we have marked as inactive in the deactivation step.
If not, we need to continue searching.
Add a bug/warn trap at the end of the function as well, the remove
function must not ever be called with an invisible/unreachable/non-existent
element.
v2: avoid uneeded temporary variable (Stefano)
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The generation mask can be updated while netlink dump is in progress.
The pipapo set backend walk iterator cannot rely on it to infer what
view of the datastructure is to be used. Add notation to specify if user
wants to read/update the set.
Based on patch from Florian Westphal.
Fixes: 2b84e215f8 ("netfilter: nft_set_pipapo: .walk does not deal with generations")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_unregister_obj() can concurrent with __nft_obj_type_get(),
and there is not any protection when iterate over nf_tables_objects
list in __nft_obj_type_get(). Therefore, there is potential data-race
of nf_tables_objects list entry.
Use list_for_each_entry_rcu() to iterate over nf_tables_objects
list in __nft_obj_type_get(), and use rcu_read_lock() in the caller
nft_obj_type_get() to protect the entire type query process.
Fixes: e50092404c ("netfilter: nf_tables: add stateful objects")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_unregister_expr() can concurrent with __nft_expr_type_get(),
and there is not any protection when iterate over nf_tables_expressions
list in __nft_expr_type_get(). Therefore, there is potential data-race
of nf_tables_expressions list entry.
Use list_for_each_entry_rcu() to iterate over nf_tables_expressions
list in __nft_expr_type_get(), and use rcu_read_lock() in the caller
nft_expr_type_get() to protect the entire type query process.
Fixes: ef1f7df917 ("netfilter: nf_tables: expression ops overloading")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Hook unregistration is deferred to the commit phase, same occurs with
hook updates triggered by the table dormant flag. When both commands are
combined, this results in deleting a basechain while leaving its hook
still registered in the core.
Fixes: 179d9ba555 ("netfilter: nf_tables: fix table flag updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_unregister_flowtable_type() within nf_flow_inet_module_exit() can
concurrent with __nft_flowtable_type_get() within nf_tables_newflowtable().
And thhere is not any protection when iterate over nf_tables_flowtables
list in __nft_flowtable_type_get(). Therefore, there is pertential
data-race of nf_tables_flowtables list entry.
Use list_for_each_entry_rcu() to iterate over nf_tables_flowtables list
in __nft_flowtable_type_get(), and use rcu_read_lock() in the caller
nft_flowtable_type_get() to protect the entire type query process.
Fixes: 3b49e2e94e ("netfilter: nf_tables: add flow table netlink frontend")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When dormant flag is toggled, hooks are disabled in the commit phase by
iterating over current chains in table (existing and new).
The following configuration allows for an inconsistent state:
add table x
add chain x y { type filter hook input priority 0; }
add table x { flags dormant; }
add chain x w { type filter hook input priority 1; }
which triggers the following warning when trying to unregister chain w
which is already unregistered.
[ 127.322252] WARNING: CPU: 7 PID: 1211 at net/netfilter/core.c:50 1 __nf_unregister_net_hook+0x21a/0x260
[...]
[ 127.322519] Call Trace:
[ 127.322521] <TASK>
[ 127.322524] ? __warn+0x9f/0x1a0
[ 127.322531] ? __nf_unregister_net_hook+0x21a/0x260
[ 127.322537] ? report_bug+0x1b1/0x1e0
[ 127.322545] ? handle_bug+0x3c/0x70
[ 127.322552] ? exc_invalid_op+0x17/0x40
[ 127.322556] ? asm_exc_invalid_op+0x1a/0x20
[ 127.322563] ? kasan_save_free_info+0x3b/0x60
[ 127.322570] ? __nf_unregister_net_hook+0x6a/0x260
[ 127.322577] ? __nf_unregister_net_hook+0x21a/0x260
[ 127.322583] ? __nf_unregister_net_hook+0x6a/0x260
[ 127.322590] ? __nf_tables_unregister_hook+0x8a/0xe0 [nf_tables]
[ 127.322655] nft_table_disable+0x75/0xf0 [nf_tables]
[ 127.322717] nf_tables_commit+0x2571/0x2620 [nf_tables]
Fixes: 179d9ba555 ("netfilter: nf_tables: fix table flag updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The commit mutex should not be released during the critical section
between nft_gc_seq_begin() and nft_gc_seq_end(), otherwise, async GC
worker could collect expired objects and get the released commit lock
within the same GC sequence.
nf_tables_module_autoload() temporarily releases the mutex to load
module dependencies, then it goes back to replay the transaction again.
Move it at the end of the abort phase after nft_gc_seq_end() is called.
Cc: stable@vger.kernel.org
Fixes: 720344340f ("netfilter: nf_tables: GC transaction race with abort path")
Reported-by: Kuan-Ting Chen <hexrabbit@devco.re>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unlike early commit path stage which triggers a call to abort, an
explicit release of the batch is required on abort, otherwise mutex is
released and commit_list remains in place.
Add WARN_ON_ONCE to ensure commit_list is empty from the abort path
before releasing the mutex.
After this patch, commit_list is always assumed to be empty before
grabbing the mutex, therefore
03c1f1ef15 ("netfilter: Cleanup nft_net->module_list from nf_tables_exit_net()")
only needs to release the pending modules for registration.
Cc: stable@vger.kernel.org
Fixes: c0391b6ab8 ("netfilter: nf_tables: missing validation from the abort path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Historically, tunnel flags like TUNNEL_CSUM or TUNNEL_ERSPAN_OPT
have been defined as __be16. Now all of those 16 bits are occupied
and there's no more free space for new flags.
It can't be simply switched to a bigger container with no
adjustments to the values, since it's an explicit Endian storage,
and on LE systems (__be16)0x0001 equals to
(__be64)0x0001000000000000.
We could probably define new 64-bit flags depending on the
Endianness, i.e. (__be64)0x0001 on BE and (__be64)0x00010000... on
LE, but that would introduce an Endianness dependency and spawn a
ton of Sparse warnings. To mitigate them, all of those places which
were adjusted with this change would be touched anyway, so why not
define stuff properly if there's no choice.
Define IP_TUNNEL_*_BIT counterparts as a bit number instead of the
value already coded and a fistful of <16 <-> bitmap> converters and
helpers. The two flags which have a different bit position are
SIT_ISATAP_BIT and VTI_ISVTI_BIT, as they were defined not as
__cpu_to_be16(), but as (__force __be16), i.e. had different
positions on LE and BE. Now they both have strongly defined places.
Change all __be16 fields which were used to store those flags, to
IP_TUNNEL_DECLARE_FLAGS() -> DECLARE_BITMAP(__IP_TUNNEL_FLAG_NUM) ->
unsigned long[1] for now, and replace all TUNNEL_* occurrences to
their bitmap counterparts. Use the converters in the places which talk
to the userspace, hardware (NFP) or other hosts (GRE header). The rest
must explicitly use the new flags only. This must be done at once,
otherwise there will be too many conversions throughout the code in
the intermediate commits.
Finally, disable the old __be16 flags for use in the kernel code
(except for the two 'irregular' flags mentioned above), to prevent
any accidental (mis)use of them. For the userspace, nothing is
changed, only additions were made.
Most noticeable bloat-o-meter difference (.text):
vmlinux: 307/-1 (306)
gre.ko: 62/0 (62)
ip_gre.ko: 941/-217 (724) [*]
ip_tunnel.ko: 390/-900 (-510) [**]
ip_vti.ko: 138/0 (138)
ip6_gre.ko: 534/-18 (516) [*]
ip6_tunnel.ko: 118/-10 (108)
[*] gre_flags_to_tnl_flags() grew, but still is inlined
[**] ip_tunnel_find() got uninlined, hence such decrease
The average code size increase in non-extreme case is 100-200 bytes
per module, mostly due to sizeof(long) > sizeof(__be16), as
%__IP_TUNNEL_FLAG_NUM is less than %BITS_PER_LONG and the compilers
are able to expand the majority of bitmap_*() calls here into direct
operations on scalars.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Skip hook unregistration when adding or deleting devices from an
existing netdev basechain. Otherwise, commit/abort path try to
unregister hooks which not enabled.
Fixes: b9703ed44f ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Fixes: 7d937b1071 ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netdev basechain updates are stored in the transaction object hook list.
When setting on the table dormant flag, it iterates over the existing
hooks in the basechain. Thus, skipping the hooks that are being
added/deleted in this transaction, which leaves hook registration in
inconsistent state.
Reject table flag updates in combination with netdev basechain updates
in the same batch:
- Update table flags and add/delete basechain: Check from basechain update
path if there are pending flag updates for this table.
- add/delete basechain and update table flags: Iterate over the transaction
list to search for basechain updates from the table update path.
In both cases, the batch is rejected. Based on suggestion from Florian Westphal.
Fixes: b9703ed44f ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Fixes: 7d937b1071 ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Report EOPNOTSUPP if NFT_MSG_DESTROYCHAIN is used to delete hooks in an
existing netdev basechain, thus, only NFT_MSG_DELCHAIN is allowed.
Fixes: 7d937b1071 ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If nft_netdev_register_hooks() fails, the memory associated with
nft_stats is not freed, causing a memory leak.
This patch fixes it by moving nft_stats_alloc() down after
nft_netdev_register_hooks() succeeds.
Fixes: b9703ed44f ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Signed-off-by: Quan Tian <tianquan23@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Restore skipping transaction if table update does not modify flags.
Fixes: 179d9ba555 ("netfilter: nf_tables: fix table flag updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Clone already always provides a current view of the lookup table, use it
to destroy the set, otherwise it is possible to destroy elements twice.
This fix requires:
212ed75dc5 ("netfilter: nf_tables: integrate pipapo into commit protocol")
which came after:
9827a0e6e2 ("netfilter: nft_set_pipapo: release elements in clone from abort path").
Fixes: 9827a0e6e2 ("netfilter: nft_set_pipapo: release elements in clone from abort path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>