This patch adds a blob layout per chain to represent the ruleset in the
packet datapath.
size (unsigned long)
struct nft_rule_dp
struct nft_expr
...
struct nft_rule_dp
struct nft_expr
...
struct nft_rule_dp (is_last=1)
The new structure nft_rule_dp represents the rule in a more compact way
(smaller memory footprint) compared to the control-plane nft_rule
structure.
The ruleset blob is a read-only data structure. The first field contains
the blob size, then the rules containing expressions. There is a trailing
rule which is used by the tracing infrastructure which is equivalent to
the NULL rule marker in the previous representation. The blob size field
does not include the size of this trailing rule marker.
The ruleset blob is generated from the commit path.
This patch reuses the infrastructure available since 0cbc06b3fa
("netfilter: nf_tables: remove synchronize_rcu in commit phase") to
build the array of rules per chain.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
include/linux/netfilter_netdev.h:97 suspicious rcu_dereference_check() usage!
2 locks held by sd-resolve/1100:
0: ..(rcu_read_lock_bh){1:3}, at: ip_finish_output2
1: ..(rcu_read_lock_bh){1:3}, at: __dev_queue_xmit
__dev_queue_xmit+0 ..
The helper has two callers, one uses rcu_read_lock, the other
rcu_read_lock_bh(). Annotate the dereference to reflect this.
Fixes: 42df6e1d22 ("netfilter: Introduce egress hook")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its the same as nf_conntrack_put(), but without the
need for an indirect call. The downside is a module dependency on
nf_conntrack, but all of these already depend on conntrack anyway.
Cc: Paul Blakey <paulb@mellanox.com>
Cc: dev@openvswitch.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_ct_put() results in a usesless indirection:
nf_ct_put -> nf_conntrack_put -> nf_conntrack_destroy -> rcu readlock +
indirect call of ct_hooks->destroy().
There are two _put helpers:
nf_ct_put and nf_conntrack_put. The latter is what should be used in
code that MUST NOT cause a linker dependency on the conntrack module
(e.g. calls from core network stack).
Everyone else should call nf_ct_put() instead.
A followup patch will convert a few nf_conntrack_put() calls to
nf_ct_put(), in particular from modules that already have a conntrack
dependency such as act_ct or even nf_conntrack itself.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No functional changes, these structures should be const.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ip_ct_attach predates struct nf_ct_hook, we can place it there and
remove the exported symbol.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Convert nf_conn reference counting from atomic_t to refcount_t based api.
refcount_t api provides more runtime sanity checks and will warn on
certain constructs, e.g. refcount_inc() on a zero reference count, which
usually indicates use-after-free.
For this reason template allocation is changed to init the refcount to
1, the subsequenct add operations are removed.
Likewise, init_conntrack() is changed to set the initial refcount to 1
instead refcount_inc().
This is safe because the new entry is not (yet) visible to other cpus.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The pointer lt is being assigned a value and then later
updated but that value is never read. The pointer is
redundant and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If destination port is above 32k and source port below 16k
assume this might cause 'port shadowing' where a 'new' inbound
connection matches an existing one, e.g.
inbound X:41234 -> Y:53 matches existing conntrack entry
Z:53 -> X:4123, where Z got natted to X.
In this case, new packet is natted to Z:53 which is likely
unwanted.
We avoid the rewrite for connections that originate from local host:
port-shadowing is only possible with forwarded connections.
Also adjust test case.
v3: no need to call tuple_force_port_remap if already in random mode (Phil)
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Phil Sutter <phil@nwl.cc>
Acked-by: Eric Garver <eric@garver.life>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows to identify flows that originate from local machine
in a followup patch.
It would be possible to make this a ->status bit instead.
For now I did not do that yet because I don't have a use-case for
exposing this info to userspace.
If one comes up the toggle can be replaced with a status bit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Chain stats are updated from the Netfilter hook path which already run
under rcu read-size lock section.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If compiled with CONFIG_NET_NS_REFCNT_TRACKER=y,
using put_net_track() in iterate_cleanup_work()
and netns_tracker_alloc() in nf_nat_masq_schedule()
might help us finding netns refcount imbalances.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If compiled with CONFIG_NET_NS_REFCNT_TRACKER=y,
using put_net_track() in nfulnl_instance_free_rcu()
and get_net_track() in instance_create()
might help us finding netns refcount imbalances.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add user template explicit support. At this moment, max
TCAM rule size is utilized for all rules, doesn't matter
which and how much flower matches are provided by user. It
means that some of TCAM space is wasted, which impacts
the number of filters that can be offloaded.
Introducing the template, allows to have more HW offloaded
filters by specifying the template explicitly.
Example:
tc qd add dev PORT clsact
tc chain add dev PORT ingress protocol ip \
flower dst_ip 0.0.0.0/16
tc filter add dev PORT ingress protocol ip \
flower skip_sw dst_ip 1.2.3.4/16 action drop
NOTE: chain 0 is the default chain id for "tc chain" & "tc filter"
command, so it is omitted in the example above.
This patch adds only template support for default chain 0 suppoerted
by prestera driver at this moment. Chains are not supported yet,
and will be added later.
Signed-off-by: Volodymyr Mytnyk <vmytnyk@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent net-next fails to initialize ports with:
realtek-smi switch: phy mode gmii is unsupported on port 0
realtek-smi switch lan5 (uninitialized): validation of gmii with
support 0000000,00000000,000062ef and advertisement
0000000,00000000,000062ef failed: -22
realtek-smi switch lan5 (uninitialized): failed to connect to PHY:
-EINVAL
realtek-smi switch lan5 (uninitialized): error -22 setting up PHY
for tree 1, switch 0, port 0
Current net branch(3dd7d40b43) is not
affected.
I also noticed the same issue before with older versions but using
a MDIO interface driver, not realtek-smi.
Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeroen de Borst says:
====================
gve improvements
This patchset consists of unrelated changes:
A bug fix for an issue that disabled jumbo-frame support, a few code
improvements and minor funcitonal changes and 3 new features:
Supporting tx|rx-coalesce-usec for DQO
Suspend/resume/shutdown
Optional metadata descriptors
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding ethtool support for changing rx-coalesce-usec and tx-coalesce-usec
when using the DQO queue format.
Signed-off-by: Tao Liu <xliutaox@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Being able to see how many descriptors are in-use is helpful
when diagnosing certain issues.
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Signed-off-by: Jordan Kim <jrkim@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for suspend, resume and shutdown.
Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers to pass metadata along with packet data to the device.
Introduce a new metadata descriptor type
* GVE_TXD_MTD
This descriptor is optional. If present it immediate follows the
packet descriptor and precedes the segment descriptor.
This descriptor may be repeated. Multiple metadata descriptors may
follow. There are no immediate uses for this, this is for future
proofing. At present devices allow only 1 MTD descriptor.
The lower four bits of the type_flags field encode GVE_TXD_MTD.
The upper four bits of the type_flags field encodes a *sub*type.
Introduce one such metadata descriptor subtype
* GVE_MTD_SUBTYPE_PATH
This shares path information with the device for network failure
discovery and robust response:
Linux derives ipv6 flowlabel and ECMP multipath from sk->sk_txhash,
and updates this field on error with sk_rethink_txhash. Allow the host
stack to do the same. Pass the tx_hash value if set. Also communicate
whether the path hash is set, or more exactly, what its type is. Define
two common types
GVE_MTD_PATH_HASH_NONE
GVE_MTD_PATH_HASH_L4
Concrete examples of error conditions that are resolved are
mentioned in the commits that add sk_rethink_txhash calls. Such as
commit 7788174e87 ("tcp: change IPv6 flow-label upon receiving
spurious retransmission").
Experimental results mirror what the theory suggests: where IPv6
FlowLabel is included in path selection (e.g., LAG/ECMP), flowlabel
rotation on TCP timeout avoids the vast majority of TCP disconnects
that would otherwise have occurred during link failures in long-haul
backbones, when an alternative path is available.
Rotation can be applied to various bad connection signals, such as
timeouts and spurious retransmissions. In aggregate, such flow level
signals can help locate network issues. Define initial common states:
GVE_MTD_PATH_STATE_DEFAULT
GVE_MTD_PATH_STATE_TIMEOUT
GVE_MTD_PATH_STATE_CONGESTION
GVE_MTD_PATH_STATE_RETRANSMIT
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed after we introduced the barrier in gve_napi_poll.
Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The id field should be a u32 not a signed int.
Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Giving the device access to other kernel structs is not ideal.
Move the indexes into their own array and just keep pointers to
them in the ntfy block struct.
Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The legacy raw addressing device option was processed before the
new RDA queue format option. This caused the supported features mask,
which is provided only on the RDA queue format option, not to be set.
This disabled jumbo-frame support when using raw adressing.
Fixes: 255489f5b3 ("gve: Add a jumbo-frame device option")
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King says:
====================
net: phylink: add PCS validation
This series allows phylink to include the PCS in its validation step.
There are two reasons to make this change:
1. Some of the network drivers that are making use of the split PCS
support are already manually calling into their PCS drivers to
perform validation. E.g. stmmac with xpcs.
2. Logically, some network drivers such as mvneta and mvpp2, the
restriction we impose in the validate() callback is a property of
the "PCS" block that we provide rather than the MAC.
This series:
1. Gives phylink a mechanism to query the MAC driver which PCS is
wishes to use for the PHY interface mode. This is necessary to allow
the PCS to be involved in the validation step without making changes
to the configuration.
2. Provide a pcs_validate() method that PCS can implement. This follows
a similar model to the MAC's validate() callback, but with some minor
differences due to observations from the various implementations.
E.g. returning an error code for not-supported and the way the
advertising bitmap is masked.
3. Convert mvpp2 and mvneta to this as examples of its use. Further
Conversions are in the pipeline, including for stmmac+xpcs, as well
as some DSA drivers. Note that DSA conversion to this is conditional
upon all DSA drivers populating their supported_interfaces bitmap,
since this is required before mac_select_pcs() can be used.
Existing drivers that set a PCS in mac_prepare() or mac_config(), or
shortly after phylink_create() will continue to work. However, it should
be noted that mac_select_pcs() will be called during phylink_create(),
and thus any PCS returned by mac_select_pcs() must be available by this
time - or we drop the check in phylink_create().
v2: fix kerneldoc typo in patch 1.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert mvneta to validate the autoneg state for 1000base-X in the
pcs_validate() operation, rather than the MAC validate() operation.
This allows us to switch the MAC validate() to use
phylink_generic_validate().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
An initial stab at converting mvneta to PCS operations. There's a few
FIXMEs to be solved.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert mvneta to use the mac_prepare() and mac_finish() methods in
preparation to converting mvneta to split-PCS support.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert mvpp2 to validate the autoneg state for 1000base-X in the
pcs_validate() operation, rather than the MAC validate() operation.
This allows us to switch the MAC validate() to use
phylink_generic_validate().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the mac_select_pcs() method to choose between the GMAC and XLG
PCS implementations.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a hook for PCS to validate the link parameters. This avoids MAC
drivers having to have knowledge of their PCS in their validate()
method, thereby allowing several MAC drivers to be simplfied.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
mac_select_pcs() allows us to have an explicit point to query which
PCS the MAC wishes to use for a particular PHY interface mode, thereby
allowing us to add support to validate the link settings with the PCS.
Phylink will also use this to select the PCS to be used during a major
configuration event without the MAC driver needing to call
phylink_set_pcs().
Note that if mac_select_pcs() is present, the supported_interfaces
bitmap must be filled in; this avoids mac_select_pcs() being called
with PHY_INTERFACE_MODE_NA when we want to get support for all
interface types. Phylink will return an error in phylink_create()
unless this condition is satisfied.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
t-queue
Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2021-12-15
This series contains updates to ice driver only.
Jake makes changes to flash update. This includes the following:
* a new shadow-ram region similar to NVM region but for the device shadow
RAM contents. This is distinct from NVM region because shadow RAM is
built up during device init and may be different from the raw NVM flash
data.
* refactoring of the ice_flash_pldm_image to become the main flash update
entry point. This is simpler than having both an
ice_devlink_flash_update and an ice_flash_pldm_image. It will make
additions like dry-run easier in the future.
* reducing time to read Option ROM version information.
* adding support for firmware activation via devlink reload, when
possible.
The major new work is the reload support, which allows activating firmware
immediately without a reboot when possible. Reload support only supports
firmware activation.
Jesse improves transmit code: utilizing newer netif_tx* API, adding some
prefetch calls, correcting expected conditions when calling ice_vsi_down(),
and utilizing __netdev_tx_sent_queue() call.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
mlx5-next branch 2021-12-15
Hi Dave, Jakub, Jason
This pulls mlx5-next branch into net-next and rdma branches.
All patches already reviewed on both rdma and netdev mailing lists.
Please pull and let me know if there's any problem.
1) Add multiple FDB steering priorities [1]
2) Introduce HW bits needed to configure MAC list size of VF/SF.
Required for ("net/mlx5: Memory optimizations") upcoming series [2].
[1] https://lore.kernel.org/netdev/20211201193621.9129-1-saeed@kernel.org/
[2] https://lore.kernel.org/lkml/20211208141722.13646-1-shayd@nvidia.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, mostly
rather small housekeeping patches:
1) Remove unused variable in IPVS, from GuoYong Zheng.
2) Use memset_after in conntrack, from Kees Cook.
3) Remove leftover function in nfnetlink_queue, from Florian Westphal.
4) Remove redundant test on bool in conntrack, from Bernard Zhao.
5) egress support for nft_fwd, from Lukas Wunner.
6) Make pppoe work for br_netfilter, from Florian Westphal.
7) Remove unused variable in conntrack resize routine, from luo penghao.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
netfilter: conntrack: Remove useless assignment statements
netfilter: bridge: add support for pppoe filtering
netfilter: nft_fwd_netdev: Support egress hook
netfilter: ctnetlink: remove useless type conversion to bool
netfilter: nf_queue: remove leftover synchronize_rcu
netfilter: conntrack: Use memset_startat() to zero struct nf_conn
ipvs: remove unused variable for ip_vs_new_dest
====================
Link: https://lore.kernel.org/r/20211215234911.170741-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The old_size assignment here will not be used anymore
The clang_analyzer complains as follows:
Value stored to 'old_size' is never read
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Downstream patch will use this bit in order to know whether the device
supports changing of max_uc_list.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The kernel gained a new interface for drivers to use to combine tail
bump (doorbell) and BQL updates, attempt to use those new interfaces.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>