This patch removes a duplicate define from
net/netfilter/ipset/ip_set_hash_gen.h
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
At the restructuring of the bitmap types creation in ipset, for the
bitmap:port type wrong (too large) memory allocation was copied
(netfilter bugzilla id #859).
Reported-by: Quentin Armitage <quentin@armitage.org.uk>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Don't verify checksum for outgoing packets because checksum calculation
may be done by the device.
Without this patch:
$ ip6tables -I OUTPUT -p tcp --dport 80 -j REJECT --reject-with tcp-reset
$ time telnet ipv6.google.com 80
Trying 2a00:1450:4010:c03::67...
telnet: Unable to connect to remote host: Connection timed out
real 0m7.201s
user 0m0.000s
sys 0m0.000s
With the patch applied:
$ ip6tables -I OUTPUT -p tcp --dport 80 -j REJECT --reject-with tcp-reset
$ time telnet ipv6.google.com 80
Trying 2a00:1450:4010:c03::67...
telnet: Unable to connect to remote host: Connection refused
real 0m0.085s
user 0m0.000s
sys 0m0.000s
Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The unnamed union should be possible to be initialized directly, but
unfortunately it's not so:
/usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c: In
function ?hash_netnet4_kadt?:
/usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c:141:
error: unknown field ?cidr? specified in initializer
Reported-by: Husnu Demir <hdemir@metu.edu.tr>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of cb->data, use callback dump args only and introduce symbolic
names instead of plain numbers at accessing the argument members.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
we can allow users in uninit net namespace to operate ipt_CLUSTERIP
now.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Create proc entries under the ipt_CLUSTERIP directory of proper
net namespace.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Inorder to find clusterip_config in net namespace.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
this lock is used for protecting clusterip_configs of per
net namespace, it should be per net namespace too.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
clusterip_configs should be per net namespace, so operate
cluster in one net namespace won't affect other net
namespace. right now, only allow to operate the clusterip_configs
of init net namespace.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Create /proc/net/ipt_CLUSTERIP directory for per net namespace.
Right now,only allow to create entries under the ipt_CLUSTERIP
in init net namespace.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TCP listener refactoring, part 7 :
Use sock_gen_put() instead of xt_socket_put_sk() for future
SYN_RECV support.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Improve the SH fallback realserver selection strategy.
With sh and sh-fallback, if a realserver is down, this attempts to
distribute the traffic that would have gone to that server evenly
among the remaining servers.
Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
commit 578bc3ef1e ("ipvs: reorganize dest trash") added
rcu_barrier() on cleanup to wait dest users and schedulers
like LBLC and LBLCR to put their last dest reference.
Using rcu_barrier with many namespaces is problematic.
Trying to fix it by freeing dest with kfree_rcu is not
a solution, RCU callbacks can run in parallel and execution
order is random.
Fix it by creating new function ip_vs_dest_put_and_free()
which is heavier than ip_vs_dest_put(). We will use it just
for schedulers like LBLC, LBLCR that can delay their dest
release.
By default, dests reference is above 0 if they are present in
service and it is 0 when deleted but still in trash list.
Change the dest trash code to use ip_vs_dest_put_and_free(),
so that refcnt -1 can be used for freeing. As result,
such checks remain in slow path and the rcu_barrier() from
netns cleanup can be removed.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Jeff Kirsher says:
====================
This series contains updates to i40e only.
Alex provides the majority of the patches against i40e, where he does
cleanup of the Tx and RX queues and to align the code with the known
good Tx/Rx queue code in the ixgbe driver.
Anjali provides an i40e patch to update link events to not print to
the log until the device is administratively up.
Catherine provides a patch to update the driver version.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 634fb979e8 ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :
skc_dport is __be16 (network order), but skc_num is __u16 (host order)
So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16
Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the version number of the driver.
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This change brings support for 64 bit netstats to the driver. Previously
the stats were 64 bit but highly racy due to the fact that 64 bit
transactions are not atomic on 32 bit systems. This change makes is so
that the 64 bit byte and packet stats are reliable on all architectures.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Allocate the queue pairs individually instead of as a group. This
allows for much easier queue management as it is possible to dynamically
resize the queues without having to free and allocate the entire block.
Ease statistic collection by treating Tx/Rx queue pairs as a single
unit. Each pair is allocated together and starts with a Tx queue and
ends with an Rx queue. By ordering them this way it is possible to know
the Rx offset based on a pointer to the Tx queue.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This replaces the ring container array with a linked list. The idea is
to make the logic much easier to deal with since this will allow us to
call a simple helper function from the q_vectors to go through the
entire list.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Allocate the q_vectors individually. The advantage to this is that it
allows for easier freeing and allocation. In addition it makes it so
that we could do node specific allocations at some point in the future.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This makes it so that the Tx and Rx byte and packet counts are
separated from the rest of the statistics. This allows for better
isolation of these stats when we move them into the 64 bit statistics.
Simplify things by re-ordering how the stats display in ethtool.
Instead of displaying all of the Tx queues as a block, followed by all
the Rx queues, the new order is Tx[0], Rx[0], Tx[1], Rx[1], ..., Tx[n],
Rx[n]. This reduces the loops and cleans up the display for testing
purposes since it is very easy to verify if flow director is doing the
right thing as the Tx and Rx queue pair are shown in pairs.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Drop Tx flag and TXSW which is tested but never set.
As a result of this change we can drop a complicated check that always
resulted in the final result of i40e_tx_csum being equal to the
CHECKSUM_PARTIAL value. As such we can replace the entire function call
with just a check for skb->summed == CHECKSUM_PARTIAL.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sync the fast path for i40e_tx_map and i40e_clean_tx_irq so that they
are similar to igb and ixgbe.
- Only update the Tx descriptor ring in tx_map
- Make skb mapping always on the first buffer in the chain
- Drop the use of MAPPED_AS_PAGE Tx flag
- Only store flags on the first buffer_info structure
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Avoid directly incrementing next_to_use for multiple reasons. The main
reason being that if we directly increment it then it can attain a state
where it is equal to the ring count. Technically this is a state it
should not be able to reach but the way this is written it now can.
This patch pulls the value off into a register and then increments it
and writes back either the value or 0 depending on if the value is equal
to the ring count.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
- drop the mapped_as_page u8 from the Tx buffer info as it was unused
- use the DMA unmap accessors for Tx DMA
- replace checks of DMA with checks of the unmap length to verify if an
unmap is needed
- update the Tx buffer layout to make it consistent with igb, ixgbe
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
sk_pacing_rate is read by sch_fq packet scheduler at any time,
with no synchronization, so make sure we update it in a
sensible way. ACCESS_ONCE() is how we instruct compiler
to not do stupid things, like using the memory location
as a temporary variable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 5 :
We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.
This patch includes the needed struct sock_common in front
of struct request_sock
This means there is no more inet6_request_sock IPv6 specific
structure.
Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.
loc_port -> ir_loc_port
loc_addr -> ir_loc_addr
rmt_addr -> ir_rmt_addr
rmt_port -> ir_rmt_port
iif -> ir_iif
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.
It's relatively easy to extend the skb using frag_list to allow
more frags to be appended into the last sk_buff.
This still builds very efficient skbs, and allows reaching 45 MSS per
skb.
(45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
sk_buff)
High speed TCP flows benefit from this extension by lowering TCP stack
cpu usage (less packets stored in receive queue, less ACK packets
processed)
Forwarding setups could be hurt, as such skbs will need to be
linearized, although its not a new problem, as GRO could already
provide skbs with a frag_list.
We could make the 65536 bytes threshold a tunable to mitigate this.
(First time we need to linearize skb in skb_needs_linearize(), we could
lower the tunable to ~16*1460 so that following skb_gro_receive() calls
build smaller skbs)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a enhancement.
for the first node in fib_trie, newpos is 0, bit is 1.
Only for the leaf or node with unmatched key need calc pos.
Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can only setup multicast address for network device when
net_device_ops->ndo_set_rx_mode is not null.
Some configurations need to add multicast address for net
device, such as netfilter cluster match module.
Add a fake ndo_set_rx_mode function to allow this operation.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Tx "completed" stat was part of the original rewrite for detecting
Tx hangs. However some time ago in ixgbe I determined that we could
just use the packets stat instead. Since then this stat was
removed from ixgbe and it serves no purpose in i40e so it can be
dropped.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Link events should not print to the log until the device is
administratively up.
Signed-off-by: Anjali Singhai <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
SkyHawk-R can support RoCE. Add code to display RoCE specific
counters maintained in hardware.
Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving to version 2 of GET_STATS command as SkyHawk-R supports
higher number of rings.
Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each network interface (either PF or VF) is identified by its port's MAC id.
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CONFIG_IPV6=n is still a valid choice ;)
It appears we can remove dead code.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/linux/netdevice.h
net/core/sock.c
Trivial merge issues.
Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.
Two parallel line additions in net/core/sock.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings says:
====================
Some more fixes for EF10 support; hopefully the last lot:
1. Fixes for reading statistics, from Edward Cree and Jon Cooper.
2. Addition of ethtool statistics for packets dropped by the hardware
before they were associated with a specific function, from Edward Cree.
3. Only bind to functions that are in control of their associated port,
as the driver currently assumes this is the case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steinar reported FQ pacing was not working for UDP flows.
It looks like the initial sk->sk_pacing_rate value of 0 was
a wrong choice. We should init it to ~0U (unlimited)
Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes
no real sense. The default rate is really unlimited, and we
need to avoid a zero divide.
Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 612c337306.
As per Stephen Hemminger, the layout of the netlink attribute
is not implemented correctly so revert this for now.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the missing destroy_workqueue() before return from
qlcnic_probe() in the error handling case.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix the error handling in moxart_mac_probe():
- return -ENOMEM in some memory alloc fail cases
- add missing free_netdev() in the error handling case
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the calculation of the nlmsg size, by adding the missing
nla_total_size().
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>