As part of an effort to move skb->dropcount to skb->cb[] use a common
macro in protocol families using skb->cb[] for ancillary data to
validate available room in skb->cb[].
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_fastopen_create_child() is static and should not be exported.
tcp4_gso_segment() and tcp6_gso_segment() should be static.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Another TCP issue is triggered by ECN.
Under pressure, receiver gets ECN marks, and send back ACK packets
with ECE TCP flag. Senders enter CA_CWR state.
In this state, tcp_tso_should_defer() is short cut :
if (icsk->icsk_ca_state != TCP_CA_Open)
goto send_now;
This means that about all ACK packets we receive are triggering
a partial send, and because cwnd is kept small, we can only send
a small amount of data for each incoming ACK,
which in return generate more ACK packets.
Allowing CA_Open and CA_CWR states to enable TSO defer in
tcp_tso_should_defer() brings performance back :
TSO autodefer has more chance to defer under pressure.
This patch increases TSO and LRO/GRO efficiency back to normal levels,
and does not impact overall ECN behavior.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With sysctl_tcp_min_tso_segs being 4, it is very possible
that tcp_tso_should_defer() decides not sending last 2 MSS
of initial window of 10 packets. This also applies if
autosizing decides to send X MSS per GSO packet, and cwnd
is not a multiple of X.
This patch implements an heuristic based on age of first
skb in write queue : If it was sent very recently (less than half srtt),
we can predict that no ACK packet will come in less than half rtt,
so deferring might cause an under utilization of our window.
This is visible on initial send (IW10) on web servers,
but more generally on some RPC, as the last part of the message
might need an extra RTT to get delivered.
Tested:
Ran following packetdrill test
// A simple server-side test that sends exactly an initial window (IW10)
// worth of packets.
`sysctl -e -q net.ipv4.tcp_min_tso_segs=4`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.1 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.1 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4
+0 write(4, ..., 14600) = 14600
+0 > . 1:5841(5840) ack 1 win 457
+0 > . 5841:11681(5840) ack 1 win 457
// Following packet should be sent right now.
+0 > P. 11681:14601(2920) ack 1 win 457
+.1 < . 1:1(0) ack 14601 win 257
+0 close(4) = 0
+0 > F. 14601:14601(0) ack 1
+.1 < F. 1:1(0) ack 14602 win 257
+0 > . 14602:14602(0) ack 2
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TSO relies on ability to defer sending a small amount of packets.
Heuristic is to wait for future ACKS in hope to send more packets at once.
Current algorithm uses a per socket tso_deferred field as a pseudo timer.
This pseudo timer relies on future ACK, but there is no guarantee
we receive them in time.
Fix would be to use a real timer, but cost of such timer is probably too
expensive for typical cases.
This patch changes the logic to test the time of last transmit,
because we should not add bursts of more than 1ms for any given flow.
We've used this patch for about two years at Google, before FQ/pacing
as it would reduce a fair amount of bursts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At this point the leaf_info hash is redundant. By adding the suffix length
to the fib_alias hash list we no longer have need of leaf_info as we can
determine the prefix length from fa_slen. So we can compress things by
dropping the leaf_info structure from fib_trie and instead directly connect
the leaves to the fib_alias hash list.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make use of an empty spot in the alias to store the suffix length so that
we don't need to pull that information from the leaf_info structure.
This patch also makes a slight change to the user statistics. Instead of
incrementing semantic_match_miss once per leaf_info miss we now just
increment it once per leaf if a match was not found.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces the prefix length variable in the leaf_info structure with a
suffix length value, or host identifier length in bits. By doing this it
makes it easier to sort out since the tnodes and leaf are carrying this
value as well since it is compatible with the ->pos field in tnodes.
I also cleaned up one spot that had some list manipulation that could be
simplified. I basically updated it so that we just use hlist_add_head_rcu
instead of calling hlist_add_before_rcu on the first node in the list.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There isn't any advantage to having it as a list and by making it an hlist
we make the fib_alias more compatible with the list_info in terms of the
type of list used.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joining multicast group on ethernet level via "ip maddr" command would
not work if we have an Ethernet switch that does igmp snooping since
the switch would not replicate multicast packets on ports that did not
have IGMP reports for the multicast addresses.
Linux vxlan interfaces created via "ip link add vxlan" have the group option
that enables then to do the required join.
By extending ip address command with option "autojoin" we can get similar
functionality for openvswitch vxlan interfaces as well as other tunneling
mechanisms that need to receive multicast traffic. The kernel code is
structured similar to how the vxlan driver does a group join / leave.
example:
ip address add 224.1.1.10/24 dev eth5 autojoin
ip address del 224.1.1.10/24 dev eth5
Signed-off-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the unlikely event that skb_get_hash is unable to deduce a hash
in udp_flow_src_port we use a consistent random value instead.
This is specified in GRE/UDP draft section 3.2.1:
https://tools.ietf.org/html/draft-ietf-tsvwg-gre-in-udp-encap-04
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_should_expand_sndbuf() does not expand the send buffer if we have
filled the congestion window.
However, it should use tcp_packets_in_flight() instead of
tp->packets_out to make this check.
Testing has established that the difference matters a lot if there are
many SACKed packets, causing a needless performance shortfall.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a need to perform igmp join/leave operations while RTNL is
held.
Make ip_mc_{join|leave}_group() wrappers around
__ip_mc_{join|leave}_group() to avoid the proliferation of work queues.
For example, vxlan_igmp_join() could possibly be removed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_copy_bits() returns zero on success and negative value on error,
so it is needed to invert the condition in ip_check_defrag().
Fixes: 1bf3751ec9 ("ipv4: ip_check_defrag must not modify skb before unsharing")
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This message isn't really needed it justs waits time/space.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Missing netlink attribute validation in nft_lookup, from Patrick
McHardy.
2) Restrict ipv6 partial checksum handling to UDP, since that's the
only case it works for. From Vlad Yasevich.
3) Clear out silly device table sentinal macros used by SSB and BCMA
drivers. From Joe Perches.
4) Make sure the remote checksum code never creates a situation where
the remote checksum is applied yet the tunneling metadata describing
the remote checksum transformation is still present. Otherwise an
external entity might see this and apply the checksum again. From
Tom Herbert.
5) Use msecs_to_jiffies() where applicable, from Nicholas Mc Guire.
6) Don't explicitly initialize timer struct fields, use setup_timer()
and mod_timer() instead. From Vaishali Thakkar.
7) Don't invoke tg3_halt() without the tp->lock held, from Jun'ichi
Nomura.
8) Missing __percpu annotation in ipvlan driver, from Eric Dumazet.
9) Don't potentially perform skb_get() on shared skbs, also from Eric
Dumazet.
10) Fix COW'ing of metrics for non-DST_HOST routes in ipv6, from Martin
KaFai Lau.
11) Fix merge resolution error between the iov_iter changes in vhost and
some bug fixes that occurred at the same time. From Jason Wang.
12) If rtnl_configure_link() fails we have to perform a call to
->dellink() before unregistering the device. From WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits)
net: dsa: Set valid phy interface type
rtnetlink: call ->dellink on failure when ->newlink exists
com20020-pci: add support for eae single card
vhost_net: fix wrong iter offset when setting number of buffers
net: spelling fixes
net/core: Fix warning while make xmldocs caused by dev.c
net: phy: micrel: disable NAND-tree for KSZ8021, KSZ8031, KSZ8051, KSZ8081
ipv6: fix ipv6_cow_metrics for non DST_HOST case
openvswitch: Fix key serialization.
r8152: restore hw settings
hso: fix rx parsing logic when skb allocation fails
tcp: make sure skb is not shared before using skb_get()
bridge: netfilter: Move sysctl-specific error code inside #ifdef
ipv6: fix possible deadlock in ip6_fl_purge / ip6_fl_gc
ipvlan: add a missing __percpu pcpu_stats
tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one()
bgmac: fix device initialization on Northstar SoCs (condition typo)
qlcnic: Delete existing multicast MAC list before adding new
net/mlx5_core: Fix configuration of log_uar_page_sz
sunvnet: don't change gso data on clones
...
Spelling errors caught by codespell.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 can keep a copy of SYN message using skb_get() in
tcp_v6_conn_request() so that caller wont free the skb when calling
kfree_skb() later.
Therefore TCP fast open has to clone the skb it is queuing in
child->sk_receive_queue, as all skbs consumed from receive_queue are
freed using __kfree_skb() (ie assuming skb->users == 1)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: 5b7ed0892f ("tcp: move fastopen functions to tcp_fastopen.c")
Signed-off-by: David S. Miller <davem@davemloft.net>
Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called
from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of
wrapper functions.
Although this patch moves static keys decrement from __mem_cgroup_free to
mem_cgroup_css_free, it does not introduce any functional changes, because
the keys are incremented on setting the limit (tcp or kmem), which can
only happen after successful mem_cgroup_css_online.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull security layer updates from James Morris:
"Highlights:
- Smack adds secmark support for Netfilter
- /proc/keys is now mandatory if CONFIG_KEYS=y
- TPM gets its own device class
- Added TPM 2.0 support
- Smack file hook rework (all Smack users should review this!)"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
cipso: don't use IPCB() to locate the CIPSO IP option
SELinux: fix error code in policydb_init()
selinux: add security in-core xattr support for pstore and debugfs
selinux: quiet the filesystem labeling behavior message
selinux: Remove unused function avc_sidcmp()
ima: /proc/keys is now mandatory
Smack: Repair netfilter dependency
X.509: silence asn1 compiler debug output
X.509: shut up about included cert for silent build
KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
MAINTAINERS: email update
tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
smack: fix possible use after frees in task_security() callers
smack: Add missing logging in bidirectional UDS connect check
Smack: secmark support for netfilter
Smack: Rework file hooks
tpm: fix format string error in tpm-chip.c
char/tpm/tpm_crb: fix build error
smack: Fix a bidirectional UDS connect check typo
smack: introduce a special case for tmpfs in smack_d_instantiate()
...
The unified hierarchy interface for memory cgroups will no longer use "-1"
to mean maximum possible resource value. In preparation for this, make
the string an argument and let the caller supply it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change remote checksum handling to set checksum partial as default
behavior. Added an iflink parameter to configure not using
checksum partial (calling csum_partial to update checksum).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds infrastructure so that remote checksum offload can
set CHECKSUM_PARTIAL instead of calling csum_partial and writing
the modfied checksum field.
Add skb_remcsum_adjust_partial function to set an skb for using
CHECKSUM_PARTIAL with remote checksum offload. Changed
skb_remcsum_process and skb_gro_remcsum_process to take a boolean
argument to indicate if checksum partial can be set or the
checksum needs to be modified using the normal algorithm.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Properly set GSO types and skb->encapsulation in the UDP tunnel GRO
complete so that packets are properly represented for GSO. This sets
SKB_GSO_UDP_TUNNEL or SKB_GSO_UDP_TUNNEL_CSUM depending on whether
non-zero checksums were received, and sets SKB_GSO_TUNNEL_REMCSUM if
the remote checksum option was processed.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remote checksum offload processing is currently the same for both
the GRO and non-GRO path. When the remote checksum offload option
is encountered, the checksum field referred to is modified in
the packet. So in the GRO case, the packet is modified in the
GRO path and then the operation is skipped when the packet goes
through the normal path based on skb->remcsum_offload. There is
a problem in that the packet may be modified in the GRO path, but
then forwarded off host still containing the remote checksum option.
A remote host will again perform RCO but now the checksum verification
will fail since GRO RCO already modified the checksum.
To fix this, we ensure that GRO restores a packet to it's original
state before returning. In this model, when GRO processes a remote
checksum option it still changes the checksum per the algorithm
but on return from lower layer processing the checksum is restored
to its original value.
In this patch we add define gro_remcsum structure which is passed
to skb_gro_remcsum_process to save offset and delta for the checksum
being changed. After lower layer processing, skb_gro_remcsum_cleanup
is called to restore the checksum before returning from GRO.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using the IPCB() macro to get the IPv4 options is convenient, but
unfortunately NetLabel often needs to examine the CIPSO option outside
of the scope of the IP layer in the stack. While historically IPCB()
worked above the IP layer, due to the inclusion of the inet_skb_param
struct at the head of the {tcp,udp}_skb_cb structs, recent commit
971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
reordered the tcp_skb_cb struct and invalidated this IPCB() trick.
This patch fixes the problem by creating a new function,
cipso_v4_optptr(), which locates the CIPSO option inside the IP header
without calling IPCB(). Unfortunately, this isn't as fast as a simple
lookup so some additional tweaks were made to limit the use of this
new function.
Cc: <stable@vger.kernel.org> # 3.18
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Packetization Layer Path MTU Discovery works separately beside
Path MTU Discovery at IP level, different net namespace has
various requirements on which one to chose, e.g., a virutalized
container instance would require TCP PMTU to probe an usable
effective mtu for underlying tunnel, while the host would
employ classical ICMP based PMTU to function.
Hence making TCP PMTU mechanism per net namespace to decouple
two functionality. Furthermore the probe base MSS should also
be configured separately for each namespace.
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a server has enabled Fast Open and it receives a pure SYN-data
packet (without a Fast Open option), it won't accept the data but it
incorrectly returns a SYN-ACK with a Fast Open cookie and also
increments the SNMP stat LINUX_MIB_TCPFASTOPENPASSIVEFAIL.
This patch makes the server include a Fast Open cookie in SYN-ACK
only if the SYN has some Fast Open option (i.e., when client
requests or presents a cookie).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We set the outer mode protocol too early. As a result, the
local error handler might dispatch to the wrong address family
and report the error to a wrong socket type. We fix this by
setting the outer protocol to the skb after we accessed the
inner mode for the last time, right before we do the atcual
encapsulation where we switch finally to the outer mode.
Reported-by: Chris Ruehl <chris.ruehl@gtsys.com.hk>
Tested-by: Chris Ruehl <chris.ruehl@gtsys.com.hk>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead
of nla_{get,put}_u16.
Fixes the sparse warnings:
warning: incorrect type in assignment (different base types)
expected restricted __be32 [addressable] [usertype] o_key
got restricted __be16 [addressable] [usertype] i_flags
warning: incorrect type in assignment (different base types)
expected restricted __be16 [usertype] sport
got unsigned short
warning: incorrect type in assignment (different base types)
expected restricted __be16 [usertype] dport
got unsigned short
warning: incorrect type in argument 3 (different base types)
expected unsigned short [unsigned] [usertype] value
got restricted __be16 [usertype] sport
warning: incorrect type in argument 3 (different base types)
expected unsigned short [unsigned] [usertype] value
got restricted __be16 [usertype] dport
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
represented by a tcp_timewait_sock, we rate limit dupacks in response
to incoming packets (a) with TCP timestamps that fail PAWS checks, or
(b) with sequence numbers that are out of the acceptable window.
We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that in state ESTABLISHED, where the connection is represented
by a tcp_sock, we rate limit dupacks in response to incoming packets
(a) with TCP timestamps that fail PAWS checks, or (b) with sequence
numbers or ACK numbers that are out of the acceptable window.
We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.
There is already a similar (although global) rate-limiting mechanism
for "challenge ACKs". When deciding whether to send a challence ACK,
we first consult the new per-connection rate limit, and then the
global rate limit.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the SYN_RECV state, where the TCP connection is represented by
tcp_request_sock, we now rate-limit SYNACKs in response to a client's
retransmitted SYNs: we do not send a SYNACK in response to client SYN
if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
since we last sent a SYNACK in response to a client's retransmitted
SYN.
This allows the vast majority of legitimate client connections to
proceed unimpeded, even for the most aggressive platforms, iOS and
MacOS, which actually retransmit SYNs 1-second intervals for several
times in a row. They use SYN RTO timeouts following the progression:
1,1,1,1,1,2,4,8,16,32.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.
This patch includes:
- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending
The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.
We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.
We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.
The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.
The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
When we added pacing to TCP, we decided to let sch_fq take care
of actual pacing.
All TCP had to do was to compute sk->pacing_rate using simple formula:
sk->pacing_rate = 2 * cwnd * mss / rtt
It works well for senders (bulk flows), but not very well for receivers
or even RPC :
cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
can end up pacing ACK packets, slowing down the sender.
Really, only the sender should pace, according to its own logic.
Instead of adding a new bit in skb, or call yet another flow
dissection, we tweak skb->truesize to a small value (2), and
we instruct sch_fq to use new helper and not pace pure ack.
Note this also helps TCP small queue, as ack packets present
in qdisc/NIC do not prevent sending a data packet (RPC workload)
This helps to reduce tx completion overhead, ack packets can use regular
sock_wfree() instead of tcp_wfree() which is a bit more expensive.
This has no impact in the case packets are sent to loopback interface,
as we do not coalesce ack packets (were we would detect skb->truesize
lie)
In case netem (with a delay) is used, skb_orphan_partial() also sets
skb->truesize to 1.
This patch is a combination of two patches we used for about one year at
Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds skb_remcsum_process and skb_gro_remcsum_process to
perform the appropriate adjustments to the skb when receiving
remote checksum offload.
Updated vxlan and gue to use these functions.
Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did
not see any change in performance.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
That takes care of the majority of ->sendmsg() instances - most of them
via memcpy_to_msg() or assorted getfrag() callbacks. One place where we
still keep memcpy_fromiovecend() is tipc - there we potentially read the
same data over and over; separate patch, that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
patch is actually smaller than it seems to be - most of it is unindenting
the inner loop body in tcp_sendmsg() itself...
the bit in tcp_input.c is going to get reverted very soon - that's what
memcpy_from_msg() will become, but not in this commit; let's keep it
reasonably contained...
There's one potentially subtle change here: in case of short copy from
userland, mainline tcp_send_syn_data() discards the skb it has allocated
and falls back to normal path, where we'll send as much as possible after
rereading the same data again. This patch trims SYN+data skb instead -
that way we don't need to copy from the same place twice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
One deployment requirement of DCTCP is to be able to run
in a DC setting along with TCP traffic. As Glenn Judd's
NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls
of TCP in the Datacenter" [1] (tba) explains, one way to
solve this on switch side is to split DCTCP and TCP traffic
in two queues per switch port based on the DSCP: one queue
soley intended for DCTCP traffic and one for non-DCTCP traffic.
For the DCTCP queue, there's the marking threshold K as
explained in commit e3118e8359 ("net: tcp: add DCTCP congestion
control algorithm") for RED marking ECT(0) packets with CE.
For the non-DCTCP queue, there's f.e. a classic tail drop queue.
As already explained in e3118e8359, running DCTCP at scale
when not marking SYN/SYN-ACK packets with ECT(0) has severe
consequences as for non-ECT(0) packets, traversing the RED
marking DCTCP queue will result in a severe reduction of
connection probability.
This is due to the DCTCP queue being dominated by ECT(0) traffic
and switches handle non-ECT traffic in the RED marking queue
after passing K as drops, where K is usually a low watermark
in order to leave enough tailroom for bursts. Splitting DCTCP
traffic among several queues (ECN and non-ECN queue) is being
considered a terrible idea in the network community as it
splits single flows across multiple network paths.
Therefore, commit e3118e8359 implements this on Linux as
ECT(0) marked traffic, as we argue that marking all packets
of a DCTCP flow is the only viable solution and also doesn't
speak against the draft.
However, recently, a DCTCP implementation for FreeBSD hit also
their mainline kernel [2]. In order to let them play well
together with Linux' DCTCP, we would need to loosen the
requirement that ECT(0) has to be asserted during the 3WHS as
not implemented in FreeBSD. This simplifies the ECN test and
lets DCTCP work together with FreeBSD.
Joint work with Daniel Borkmann.
[1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd
[2] 8ad8794452
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
timestamps, this loops timestamps on top of empty packets.
Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
cmsg reception (aside from timestamps) are no longer possible. This
works together with a follow on patch that allows administrators to
only allow tx timestamping if it does not loop payload or metadata.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes (rfc -> v1)
- add documentation
- remove unnecessary skb->len test (thanks to Richard Cochran)
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit be9f4a44e7 ("ipv4: tcp: remove per net tcp_sock")
I tried to address contention on a socket lock, but the solution
I chose was horrible :
commit 3a7c384ffd ("ipv4: tcp: unicast_sock should not land outside
of TCP stack") addressed a selinux regression.
commit 0980e56e50 ("ipv4: tcp: set unicast_sock uc_ttl to -1")
took care of another regression.
commit b5ec8eeac4 ("ipv4: fix ip_send_skb()") fixed another regression.
commit 811230cd85 ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
was another shot in the dark.
Really, just use a proper socket per cpu, and remove the skb_orphan()
call, to re-enable flow control.
This solves a serious problem with FQ packet scheduler when used in
hostile environments, as we do not want to allocate a flow structure
for every RST packet sent in response to a spoofed packet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of nr_cpu_ids and use modern percpu allocation.
Note that the sockets themselves are not yet allocated
using NUMA affinity.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current behavior only passes RTTs from sequentially acked data to CC.
If sender gets a combined ACK for segment 1 and SACK for segment 3, then the
computed RTT for CC is the time between sending segment 1 and receiving SACK
for segment 3.
Pass the minimum computed RTT from any acked data to CC, i.e. time between
sending segment 3 and receiving SACK for segment 3.
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
They are all either written once or extremly rarely (e.g. from init
code), so we can move them to the .data..read_mostly section.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 1191 said, "a host MUST not increase its estimate of the Path
MTU in response to the contents of a Datagram Too Big message."
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I added sk_pacing_rate field, I forgot to initialize its value
in the per cpu unicast_sock used in ip_send_unicast_reply()
This means that for sch_fq users, RST packets, or ACK packets sent
on behalf of TIME_WAIT sockets might be sent to slowly or even dropped
once we reach the per flow limit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 95bd09eb27 ("tcp: TSO packets automatic sizing")
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, it isn't possible to request checksums on the outer UDP
header of tunnels - the TUNNEL_CSUM flag is ignored. This adds
support for requesting that UDP checksums be computed on transmit
and properly reported if they are present on receive.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug in CUBIC that causes cwnd to increase slightly
too slowly when multiple ACKs arrive in the same jiffy.
If cwnd is supposed to increase at a rate of more than once per jiffy,
then CUBIC was sometimes too slow. Because the bic_target is
calculated for a future point in time, calculated with time in
jiffies, the cwnd can increase over the course of the jiffy while the
bic_target calculated as the proper CUBIC cwnd at time
t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only
increases on jiffy tick boundaries.
So since the cnt is set to:
ca->cnt = cwnd / (bic_target - cwnd);
as cwnd increases but bic_target does not increase due to jiffy
granularity, the cnt becomes too large, causing cwnd to increase
too slowly.
For example:
- suppose at the beginning of a jiffy, cwnd=40, bic_target=44
- so CUBIC sets:
ca->cnt = cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10
- suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai()
increases cwnd to 41
- so CUBIC sets:
ca->cnt = cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13
So now CUBIC will wait for 13 packets to be ACKed before increasing
cwnd to 42, insted of 10 as it should.
The fix is to avoid adjusting the slope (determined by ca->cnt)
multiple times within a jiffy, and instead skip to compute the Reno
cwnd, the "TCP friendliness" code path.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change CUBIC to properly handle stretch ACKs in additive increase mode
by passing in the count of ACKed packets to tcp_cong_avoid_ai().
In addition, because we are now precisely accounting for stretch ACKs,
including delayed ACKs, we can now remove the delayed ACK tracking and
estimation code that tracked recent delayed ACK behavior in
ca->delayed_ack.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change Reno to properly handle stretch ACKs in additive increase mode
by passing in the count of ACKed packets to tcp_cong_avoid_ai().
In addition, if snd_cwnd crosses snd_ssthresh during slow start
processing, and we then exit slow start mode, we need to carry over
any remaining "credit" for packets ACKed and apply that to additive
increase by passing this remaining "acked" count to
tcp_cong_avoid_ai().
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_cong_avoid_ai() was too timid (snd_cwnd increased too slowly) on
"stretch ACKs" -- cases where the receiver ACKed more than 1 packet in
a single ACK. For example, suppose w is 10 and we get a stretch ACK
for 20 packets, so acked is 20. We ought to increase snd_cwnd by 2
(since acked/w = 20/10 = 2), but instead we were only increasing cwnd
by 1. This patch fixes that behavior.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
cover more than the RFC-specified maximum of 2 packets. These stretch
ACKs can cause serious performance shortfalls in common congestion
control algorithms that were designed and tuned years ago with
receiver hosts that were not using LRO or GRO, and were instead
politely ACKing every other packet.
This patch series fixes Reno and CUBIC to handle stretch ACKs.
This patch prepares for the upcoming stretch ACK bug fix patches. It
adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
changes all congestion control algorithms to pass in 1 for the ACKed
count. It also changes tcp_slow_start() to return the number of packet
ACK "credits" that were not processed in slow start mode, and can be
processed by the congestion control module in additive increase mode.
In future patches we will fix tcp_cong_avoid_ai() to handle stretch
ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
and additive increase mode.
Reported-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/arm/boot/dts/imx6sx-sdb.dts
net/sched/cls_bpf.c
Two simple sets of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
An exception is seen in ICMP ping receive path where the skb
destructor sock_rfree() tries to access a freed socket. This happens
because ping_rcv() releases socket reference with sock_put() and this
internally frees up the socket. Later icmp_rcv() will try to free the
skb and as part of this, skb destructor is called and which leads
to a kernel panic as the socket is freed already in ping_rcv().
-->|exception
-007|sk_mem_uncharge
-007|sock_rfree
-008|skb_release_head_state
-009|skb_release_all
-009|__kfree_skb
-010|kfree_skb
-011|icmp_rcv
-012|ip_local_deliver_finish
Fix this incorrect free by cloning this skb and processing this cloned
skb instead.
This patch was suggested by Eric Dumazet
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on rhashtable walking I noticed that the UDP diag
dumping code is buggy. In particular, the socket skipping within
a chain never happens, even though we record the number of sockets
that should be skipped.
As this code was supposedly copied from TCP, this patch does what
TCP does and resets num before we walk a chain.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not caching dst_entries which cause redirects could be exploited by hosts
on the same subnet, causing a severe DoS attack. This effect aggravated
since commit f886497212 ("ipv4: fix dst race in sk_dst_get()").
Lookups causing redirects will be allocated with DST_NOCACHE set which
will force dst_release to free them via RCU. Unfortunately waiting for
RCU grace period just takes too long, we can end up with >1M dst_entries
waiting to be released and the system will run OOM. rcuos threads cannot
catch up under high softirq load.
Attaching the flag to emit a redirect later on to the specific skb allows
us to cache those dst_entries thus reducing the pressure on allocation
and deallocation.
This issue was discovered by Marcelo Leitner.
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Marcelo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing further work on the fib_trie I noted a few items.
First I was using calls that were far more complicated than they needed to
be for determining when to push/pull the suffix length. I have updated the
code to reflect the simplier logic.
The second issue is that I realised we weren't necessarily handling the
case of a leaf_info struct surviving a flush. I have updated the logic so
that now we will call pull_suffix in the event of having a leaf info value
left in the leaf after flushing it.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function fib_find_alias is only accessed by functions in fib_trie.c as
such it makes sense to relocate it and cast it as static so that the
compiler can take advantage of optimizations it can do to it as a local
function.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It doesn't make much sense to count the pointers ourselves when
empty_children already has a count for the number of NULL pointers stored
in the tnode. As such save ourselves the cycles and just use
empty_children.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch really does two things.
First it pulls the logic for determining if we should collapse one node out
of the tree and the actual code doing the collapse into a separate pair of
functions. This helps to make the changes to these areas more readable.
Second it encodes the upper 32b of the empty_children value onto the
full_children value in the case of bits == KEYLENGTH. By doing this we are
able to handle the case of a 32b node where empty_children would appear to
be 0 when it was actually 1ul << 32.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change corrects an issue where if inflate or halve fails we were
exiting the resize function without at least updating the slen for the
node. To correct this I have moved the update of max_size into the while
loop so that it is only decremented on a successful call to either inflate
or halve.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses two issues.
The first issue is the fact that I believe I had the RCU freeing sequence
slightly out of order. As a result we could get into an issue if a caller
went into a child of a child of the new node, then backtraced into the to be
freed parent, and then attempted to access a child of a child that may have
been consumed in a resize of one of the new nodes children. To resolve this I
have moved the resize after we have freed the oldtnode. The only side effect
of this is that we will now be calling resize on more nodes in the case of
inflate due to the fact that we don't have a good way to test to see if a
full_tnode on the new node was there before or after the allocation. This
should have minimal impact however since the node should already be
correctly size so it is just the cost of calling should_inflate that we
will be taking on the node which is only a couple of cycles.
The second issue is the fact that inflate and halve were essentially doing
the same thing after the new node was added to the trie replacing the old
one. As such it wasn't really necessary to keep the code in both functions
so I have split it out into two other functions, called replace and
update_children.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In doing performance testing and analysis of the changes I recently found
that by shifting the index I had created an unnecessary dependency.
I have updated the code so that we instead shift a mask by bits and then
just test against that as that should save us about 2 CPU cycles since we
can generate the mask while the key and pos are being processed.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP tunnel transmit functions udp_tunnel_xmit_skb and
udp_tunnel6_xmit_skb include a socket argument. The socket being
passed to the functions (from VXLAN) is a UDP created for receive
side. The only thing that the socket is used for in the transmit
functions is to get the setting for checksum (enabled or zero).
This patch removes the argument and and adds a nocheck argument
for checksum setting. This eliminates the unnecessary dependency
on a UDP socket for UDP tunnel transmit.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The logic to configure a network interface for kernel IP
auto-configuration is very simplistic, and does not handle the case
where a device is stacked onto another such as with DSA. This causes the
kernel not to open and configure the master network device in a DSA
switch tree, and therefore slave network devices using this master
network devices as conduit device cannot be open.
This restriction comes from a check in net/dsa/slave.c, which is
basically checking the master netdev flags for IFF_UP and returns
-ENETDOWN if it is not the case.
Automatically bringing-up DSA master network devices allows DSA slave
network devices to be used as valid interfaces for e.g: NFS root booting
by allowing kernel IP autoconfiguration to succeed on these interfaces.
On the reverse path, make sure we do not attempt to close a DSA-enabled
device as this would implicitely prevent the slave DSA network device
from operating.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 053c095a82 ("netlink: make nlmsg_end() and genlmsg_end()
void") didn't catch all of the cases where callers were breaking out
on the return value being equal to zero, which they no longer should
when zero means success.
Fix all such cases.
Reported-by: Marcel Holtmann <marcel@holtmann.org>
Reported-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.
This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to write
if (my_function(...))
/* error condition */
and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.
Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did
- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;
I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.
One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sockaddr is returned in IP(V6)_RECVERR as part of errhdr. That
structure is defined and allocated on the stack as
struct {
struct sock_extended_err ee;
struct sockaddr_in(6) offender;
} errhdr;
The second part is only initialized for certain SO_EE_ORIGIN values.
Always initialize it completely.
An MTU exceeded error on a SOCK_RAW/IPPROTO_RAW is one example that
would return uninitialized bytes.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Also verified that there is no padding between errhdr.ee and
errhdr.offender that could leak additional kernel data.
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RAW sockets with hdrinc suffer from contention on rt_uncached_lock
spinlock.
One solution is to use percpu lists, since most routes are destroyed
by the cpu that created them.
It is unclear why we even have to put these routes in uncached_list,
as all outgoing packets should be freed when a device is dismantled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: caacf05e5a ("ipv4: Properly purge netdev references on uncached routes.")
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces udp_offload_callbacks which has the same
GRO functions (but not a GSO function) as offload_callbacks,
except there is an argument to a udp_offload struct passed to
gro_receive and gro_complete functions. This additional argument
can be used to retrieve the per port structure of the encapsulation
for use in gro processing (mostly by doing container_of on the
structure).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same macros are used for rx as well. So rename it.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
With TLP, the peer may reply to a probe with an
ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
such ACK+DSACK will be missed and only at next, higher ack will the TLP
episode be considered done. Since the DSACK is not present anymore,
this will cost a cwnd reduction.
This patch ensures that this scenario does not cause a cwnd reduction, since
receiving an ACK+DSACK indicates that both the initial segment and the probe
have been received by the peer.
The following packetdrill test, from Neal Cardwell, validates this patch:
// Establish a connection.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.020 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4
// Send 1 packet.
+0 write(4, ..., 1000) = 1000
+0 > P. 1:1001(1000) ack 1
// Loss probe retransmission.
// packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
// In this case, this means: 1.5*RTT + 200ms = 230ms
+.230 > P. 1:1001(1000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10 }%
// Receiver ACKs at tlp_high_seq with a DSACK,
// indicating they received the original packet and probe.
+.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop>
+0 %{ assert tcpi_snd_cwnd == 10 }%
// Send another packet.
+0 write(4, ..., 1000) = 1000
+0 > P. 1001:2001(1000) ack 1
// Receiver ACKs above tlp_high_seq, which should end the TLP episode
// if we haven't already. We should not reduce cwnd.
+.020 < . 1:1(0) ack 2001 win 257
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
Credits:
-Gregory helped in finding that tcp_process_tlp_ack was where the cwnd
got reduced in our MPTCP tests.
-Neal wrote the packetdrill test above
-Yuchung reworked the patch to make it more readable.
Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Nandita Dukkipati <nanditad@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter/ipvs fixes for net
The following patchset contains netfilter/ipvs fixes, they are:
1) Small fix for the FTP helper in IPVS, a diff variable may be left
unset when CONFIG_IP_VS_IPV6 is set. Patch from Dan Carpenter.
2) Fix nf_tables port NAT in little endian archs, patch from leroy
christophe.
3) Fix race condition between conntrack confirmation and flush from
userspace. This is the second reincarnation to resolve this problem.
4) Make sure inner messages in the batch come with the nfnetlink header.
5) Relax strict check from nfnetlink_bind() that may break old userspace
applications using all 1s group mask.
6) Schedule removal of chains once no sets and rules refer to them in
the new nf_tables ruleset flush command. Reported by Asbjoern Sloth
Toennesen.
Note that this batch comes later than usual because of the short
winter holidays.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds the possibility to define a per route/destination
congestion control algorithm. Generally, this opens up the possibility
for a machine with different links to enforce specific congestion
control algorithms with optimal strategies for each of them based
on their network characteristics, even transparently for a single
application listening on all links.
For our specific use case, this additionally facilitates deployment
of DCTCP, for example, applications can easily serve internal
traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
would also allow for utilizing e.g. long living, low priority
background flows for certain destinations/routes while still being
able for normal traffic to utilize the default congestion control
algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific
property, we argue that a per route/destination setting is the most
natural and flexible.
The administrator can utilize this through ip-route(8) by appending
"congctl [lock] <name>", where <name> denotes the name of a
congestion control algorithm and the optional lock parameter allows
to enforce the given algorithm so that applications in user space
would not be allowed to overwrite that algorithm for that destination.
The dst metric lookups are being done when a dst entry is already
available in order to avoid a costly lookup and still before the
algorithms are being initialized, thus overhead is very low when the
feature is not being used. While the client side would need to drop
the current reference on the module, on server side this can actually
even be avoided as we just got a flat-copied socket clone.
Joint work with Florian Westphal.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.
While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.
It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it.
Joint work with Florian Westphal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds necessary infrastructure to the congestion control
framework for later per route congestion control support.
For a per route congestion control possibility, our aim is to store
a unique u32 key identifier into dst metrics, which can then be
mapped into a tcp_congestion_ops struct. We argue that having a
RTAX key entry is the most simple, generic and easy way to manage,
and also keeps the memory footprint of dst entries lower on 64 bit
than with storing a pointer directly, for example. Having a unique
key id also allows for decoupling actual TCP congestion control
module management from the FIB layer, i.e. we don't have to care
about expensive module refcounting inside the FIB at this point.
We first thought of using an IDR store for the realization, which
takes over dynamic assignment of unused key space and also performs
the key to pointer mapping in RCU. While doing so, we stumbled upon
the issue that due to the nature of dynamic key distribution, it
just so happens, arguably in very rare occasions, that excessive
module loads and unloads can lead to a possible reuse of previously
used key space. Thus, previously stale keys in the dst metric are
now being reassigned to a different congestion control algorithm,
which might lead to unexpected behaviour. One way to resolve this
would have been to walk FIBs on the actually rare occasion of a
module unload and reset the metric keys for each FIB in each netns,
but that's just very costly.
Therefore, we argue a better solution is to reuse the unique
congestion control algorithm name member and map that into u32 key
space through jhash. For that, we split the flags attribute (as it
currently uses 2 bits only anyway) into two u32 attributes, flags
and key, so that we can keep the cacheline boundary of 2 cachelines
on x86_64 and cache the precalculated key at registration time for
the fast path. On average we might expect 2 - 4 modules being loaded
worst case perhaps 15, so a key collision possibility is extremely
low, and guaranteed collision-free on LE/BE for all in-tree modules.
Overall this results in much simpler code, and all without the
overhead of an IDR. Due to the deterministic nature, modules can
now be unloaded, the congestion control algorithm for a specific
but unloaded key will fall back to the default one, and on module
reload time it will switch back to the expected algorithm
transparently.
Joint work with Florian Westphal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can just move this to an extra function and make the code
a bit more readable, no functional change.
Joint work with Florian Westphal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ip_cmsg_recv_offset function which takes an offset argument
that indicates the starting offset in skb where data is being received
from. This will be useful in the case of UDP and provided checksum
to user space.
ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
zero.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ip_cmsg_recv_offset function which takes an offset argument
that indicates the starting offset in skb where data is being received
from. This will be useful in the case of UDP and provided checksum
to user space.
ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
zero.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
they can be referenced in other source files.
Restructure ip_cmsg_recv to not go through flags using shift, check
for flags by 'and'. This eliminates both the shift and a conditional
per flag check.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move convert_csum from udp_sock to inet_sock. This allows the
possibility that we can use convert checksum for different types
of sockets and also allows convert checksum to be enabled from
inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When searching for an existing socket to reuse, the address family
is not taken into account - only port number. This means that an
IPv4 socket could be used for IPv6 traffic and vice versa, which
is sure to cause problems when passing packets.
It is not possible to trigger this problem currently because the
only user of Geneve creates just IPv4 sockets. However, that is
likely to change in the near future.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hash table for open Geneve ports is used only on creation and
deletion time. It is not performance critical and is not likely to
grow to a large number of items. Therefore, this can be changed
to use a simple linked list.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing Geneve locking scheme was pulled over directly from
VXLAN. However, VXLAN has a number of built in mechanisms which make
the locking more complex and are unlikely to be necessary with Geneve.
This simplifies the locking to use a basic scheme of a mutex
when doing updates plus RCU on receive.
In addition to making the code easier to read, this also avoids the
possibility of a race when creating or destroying sockets since
UDP sockets and the list of Geneve sockets are protected by different
locks. After this change, the entire operation is atomic.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The work queue is used only to free the UDP socket upon destruction.
This is not necessary with Geneve and generally makes the code more
difficult to reason about. It also introduces nondeterministic
behavior such as when a socket is rapidly deleted and recreated, which
could fail as the the deletion happens asynchronously.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
In fact the problem was completely unrelated to IPsec. The bug is
also reproducible if you just disable TSO/GSO.
The problem is that when the MSS goes down, existing queued packet
on the TX queue that have not been transmitted yet all look like
TSO packets and get treated as such.
This then triggers a bug where tcp_mss_split_point tells us to
generate a zero-sized packet on the TX queue. Once that happens
we're screwed because the zero-sized packet can never be removed
by ACKs.
Fixes: 1485348d24 ("tcp: Apply device TSO segment limit earlier")
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
Signed-off-by: David S. Miller <davem@davemloft.net>
This results in an approximately 30% increase in throughput
when handling encapsulated bulk traffic.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds a tracking value for the maximum suffix length of all
prefixes stored in any given tnode. With this value we can determine if we
need to backtrace or not based on if the suffix is greater than the pos
value.
By doing this we can reduce the CPU overhead for lookups in the local table
as many of the prefixes there are 32b long and have a suffix length of 0
meaning we can immediately backtrace to the root node without needing to
test any of the nodes between it and where we ended up.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some reason the compiler doesn't seem to understand that when we are in
a loop that runs from tnode_child_length - 1 to 0 we don't expect the value
of tn->bits to change. As such every call to tnode_get_child was rerunning
tnode_chile_length which ended up consuming quite a bit of space in the
resultant assembly code.
I have gone though and verified that in all cases where tnode_get_child
is used we are either winding though a fixed loop from tnode_child_length -
1 to 0, or are in a fastpath case where we are verifying the value by
either checking for any remaining bits after shifting index by bits and
testing for leaf, or by using tnode_child_length.
size net/ipv4/fib_trie.o
Before:
text data bss dec hex filename
15506 376 8 15890 3e12 net/ipv4/fib_trie.o
After:
text data bss dec hex filename
14827 376 8 15211 3b6b net/ipv4/fib_trie.o
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change pulls the node_set_parent functionality out of put_child_reorg
and instead leaves that to the function to take care of as well. By doing
this we can fully construct the new cluster of tnodes and all of the
pointers out of it before we start routing pointers into it.
I am suspecting this will likely fix some concurency issues though I don't
have a good test to show as such.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change pushes the tnode freeing down into the inflate and halve
functions. It makes more sense here as we have a better grasp of what is
going on and when a given cluster of nodes is ready to be freed.
I believe this may address a bug in the freeing logic as well. For some
reason if the freelist got to a certain size we would call
synchronize_rcu(). I'm assuming that what they meant to do is call
synchronize_rcu() after they had handed off that much memory via
call_rcu(). As such that is what I have updated the behavior to be.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that the assignment of the tnode to the parent is
handled directly within whatever function is currently handling the node be
it inflate, halve, or resize. By doing this we can avoid some of the need
to set NULL pointers in the tree while we are resizing the subnodes.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change pulls the logic for if we should inflate/halve the nodes out
into separate functions. It also addresses what I believe is a bug where 1
full node is all that is needed to keep a node from ever being halved.
Simple script to reproduce the issue:
modprobe dummy; ifconfig dummy0 up
for i in `seq 0 255`; do ifconfig dummy0:$i 10.0.${i}.1/24 up; done
ifconfig dummy0:256 10.0.255.33/16 up
for i in `seq 0 254`; do ifconfig dummy0:$i down; done
Results from /proc/net/fib_triestat
Before:
Local:
Aver depth: 3.00
Max depth: 4
Leaves: 17
Prefixes: 18
Internal nodes: 11
1: 8 2: 2 10: 1
Pointers: 1048
Null ptrs: 1021
Total size: 11 kB
After:
Local:
Aver depth: 3.41
Max depth: 5
Leaves: 17
Prefixes: 18
Internal nodes: 12
1: 8 2: 3 3: 1
Pointers: 36
Null ptrs: 8
Total size: 3 kB
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change consists of a cut/paste of resize to behind inflate and halve
so that I could remove the two function prototypes.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is to start cleaning up some of the rcu_read_lock/unlock
handling. I realized while reviewing the code there are several spots that
I don't believe are being handled correctly or are masking warnings by
locally calling rcu_read_lock/unlock instead of calling them at the correct
level.
A common example is a call to fib_get_table followed by fib_table_lookup.
The rcu_read_lock/unlock ought to wrap both but there are several spots where
they were not wrapped.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that anything that can be shifted by, or compared
to a value shifted by bits is updated to be an unsigned long. This is
mostly a precaution against an insanely huge address space that somehow
starts coming close to the 2^32 root node size which would require
something like 1.5 billion addresses.
I chose unsigned long instead of unsigned long long since I do not believe
it is possible to allocate a 32 bit tnode on a 32 bit system as the memory
consumed would be 16GB + 28B which exceeds the addressible space for any
one process.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change moves the pos value to the other side of the "bits" field. By
doing this it actually simplifies a significant amount of code in the trie.
For example when halving a tree we know that the bit lost exists at
oldnode->pos, and if we inflate the tree the new bit being add is at
tn->pos. Previously to find those bits you would have to subtract pos and
bits from the keylength or start with a value of (1 << 31) and then shift
that.
There are a number of spots throughout the code that benefit from this. In
the case of the hot-path searches the main advantage is that we can drop 2
or more operations from the search path as we no longer need to compute the
value for the index to be shifted by and can instead just use the raw pos
value.
In addition the tkey_extract_bits is now defunct and can be replaced by
get_index since the two operations were doing the same thing, but now
get_index does it much more quickly as it is only an xor and shift versus a
pair of shifts and a subtraction.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates the fib_table_insert function to take advantage of the
changes made to improve the performance of fib_table_lookup. As a result
the code should be smaller and run faster then the original.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes use of the same features I made use of for
fib_table_lookup to streamline fib_find_node. The resultant code should be
smaller and run faster than the original.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is meant to reduce the complexity of fib_table_lookup by reducing
the number of variables to the bare minimum while still keeping the same if
not improved functionality versus the original.
Most of this change was started off by the desire to rid the function of
chopped_off and current_prefix_length as they actually added very little to
the function since they only applied when computing the cindex. I was able
to replace them mostly with just a check for the prefix match. As long as
the prefix between the key and the node being tested was the same we know
we can search the tnode fully versus just testing cindex 0.
The second portion of the change ended up being a massive reordering.
Originally the calls to check_leaf were up near the start of the loop, and
the backtracing and descending into lower levels of tnodes was later. This
didn't make much sense as the structure of the tree means the leaves are
always the last thing to be tested. As such I reordered things so that we
instead have a loop that will delve into the tree and only exit when we
have either found a leaf or we have exhausted the tree. The advantage of
rearranging things like this is that we can fully inline check_leaf since
there is now only one reference to it in the function.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that leaf and tnode are the same struct. As a
result there is no need for rt_trie_node anymore since everyting can be
merged into tnode.
On 32b systems this results in the leaf being 4 bytes larger, however I
don't know if that is really an issue as this and an eariler patch that
added bits & pos have increased the size from 20 to 28. If I am not
mistaken slub/slab allocate on power of 2 sizes so 20 was likely being
rounded up to 32 anyway.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both the leaf and the tnode had an rcu_head in them, but they had them in
slightly different places. Since we now have them in the same spot and
know that any node with bits == 0 is a leaf and the rest are either vmalloc
or kmalloc tnodes depending on the value of bits it makes it easy to combine
the functions and reduce overhead.
In addition I have taken advantage of the rcu_head pointer to go ahead and
put together a simple linked list instead of using the tnode pointer as
this way we can merge either type of structure for freeing.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes some fundamental changes to the way leaves and tnodes are
constructed. The big differences are:
1. Leaves now populate pos and bits indicating their full key size.
2. Trie nodes now mask out their lower bits to be consistent with the leaf
3. Both structures have been reordered so that rt_trie_node now consisists
of a much larger region including the pos, bits, and rcu portions of
the tnode structure.
On 32b systems this will result in the leaf being 4B larger as the pos and
bits values were added to a hole created by the key as it was only 4B in
length.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The trie usage stats were currently being shared by all threads that were
calling fib_table_lookup. As a result when multiple threads were
performing lookups simultaneously the trie would begin to cache bounce
between those threads.
In order to prevent this I have updated the usage stats to use a set of
percpu variables. By doing this we should be able to avoid the cache
bouncing and still make use of these stats.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GRE tap device supports Ethernet over GRE, but doesn't
care about the source address of the tunnel, therefore it
can be changed without bring device down.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today vport-send has complex error handling because it involves
freeing skb and updating stats depending on return value from
vport send implementation.
This can be simplified by delegating responsibility of freeing
skb to the vport implementation for all cases. So that
vport-send needs just update stats.
Fixes: 91b7514cdf ("openvswitch: Unify vport error stats
handling")
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure this fetches 16-bits port data from the register.
Remove casting to make sparse happy, not needed anymore.
Signed-off-by: leroy christophe <christophe.leroy@c-s.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, searching for a socket to add a reference to is not
synchronized with deletion of sockets. This can result in use
after free if there is another operation that is removing a
socket at the same time. Solving this requires both holding the
appropriate lock and checking the refcount to ensure that it
has not already hit zero.
Inspired by a related (but not exactly the same) issue in the
VXLAN driver.
Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver")
CC: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sockets aren't currently removed from the the global list when
they are destroyed. In addition, offload handlers need to be cleaned
up as well.
Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver")
CC: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The encap->type comes straight from Netlink. Validate it against
max supported encap types just like ip_encap_hlen() already does.
Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The symbols are exported and could be used by external modules.
Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NBMA GRE tunnels temporarily push GRE header that contain the
per-packet NBMA destination on the skb via header ops early in xmit
path. It is the later pulled before the real GRE header is constructed.
The inner mac was thus set differently in nbma case: the GRE header
has been pushed by neighbor layer, and mac header points to beginning
of the temporary gre header (set by dev_queue_xmit).
Now that the offloads expect mac header to point to the gre payload,
fix the xmit patch to:
- pull first the temporary gre header away
- and reset mac header to point to gre payload
This fixes tso to work again with nbma tunnels.
Fixes: 14051f0452 ("gre: Use inner mac length when computing tunnel length")
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Cc: Tom Herbert <therbert@google.com>
Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses an issue with the level compression of the fib_trie.
Specifically in the case of adding a new leaf that triggers a new node to
be added that takes the place of the old node. The result is a trie where
the 1 child tnode is on one side and one leaf is on the other which gives
you a very deep trie. Below is the script I used to generate a trie on
dummy0 with a 10.X.X.X family of addresses.
ip link add type dummy
ipval=184549374
bit=2
for i in `seq 1 23`
do
ifconfig dummy0:$bit $ipval/8
ipval=`expr $ipval - $bit`
bit=`expr $bit \* 2`
done
cat /proc/net/fib_triestat
Running the script before the patch:
Local:
Aver depth: 10.82
Max depth: 23
Leaves: 29
Prefixes: 30
Internal nodes: 27
1: 26 2: 1
Pointers: 56
Null ptrs: 1
Total size: 5 kB
After applying the patch and repeating:
Local:
Aver depth: 4.72
Max depth: 9
Leaves: 29
Prefixes: 30
Internal nodes: 12
1: 3 2: 2 3: 7
Pointers: 70
Null ptrs: 30
Total size: 4 kB
What this fix does is start the rebalance at the newly created tnode
instead of at the parent tnode. This way if there is a gap between the
parent and the new node it doesn't prevent the new tnode from being
coalesced with any pre-existing nodes that may have been pushed into one
of the new nodes child branches.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) New offloading infrastructure and example 'rocker' driver for
offloading of switching and routing to hardware.
This work was done by a large group of dedicated individuals, not
limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu
2) Start making the networking operate on IOV iterators instead of
modifying iov objects in-situ during transfers. Thanks to Al Viro
and Herbert Xu.
3) A set of new netlink interfaces for the TIPC stack, from Richard
Alpe.
4) Remove unnecessary looping during ipv6 routing lookups, from Martin
KaFai Lau.
5) Add PAUSE frame generation support to gianfar driver, from Matei
Pavaluca.
6) Allow for larger reordering levels in TCP, which are easily
achievable in the real world right now, from Eric Dumazet.
7) Add a variable of napi_schedule that doesn't need to disable cpu
interrupts, from Eric Dumazet.
8) Use a doubly linked list to optimize neigh_parms_release(), from
Nicolas Dichtel.
9) Various enhancements to the kernel BPF verifier, and allow eBPF
programs to actually be attached to sockets. From Alexei
Starovoitov.
10) Support TSO/LSO in sunvnet driver, from David L Stevens.
11) Allow controlling ECN usage via routing metrics, from Florian
Westphal.
12) Remote checksum offload, from Tom Herbert.
13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
driver, from Thomas Lendacky.
14) Add MPLS support to openvswitch, from Simon Horman.
15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
Klassert.
16) Do gro flushes on a per-device basis using a timer, from Eric
Dumazet. This tries to resolve the conflicting goals between the
desired handling of bulk vs. RPC-like traffic.
17) Allow userspace to ask for the CPU upon what a packet was
received/steered, via SO_INCOMING_CPU. From Eric Dumazet.
18) Limit GSO packets to half the current congestion window, from Eric
Dumazet.
19) Add a generic helper so that all drivers set their RSS keys in a
consistent way, from Eric Dumazet.
20) Add xmit_more support to enic driver, from Govindarajulu
Varadarajan.
21) Add VLAN packet scheduler action, from Jiri Pirko.
22) Support configurable RSS hash functions via ethtool, from Eyal
Perry.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
Fix race condition between vxlan_sock_add and vxlan_sock_release
net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
net/mlx4: Add support for A0 steering
net/mlx4: Refactor QUERY_PORT
net/mlx4_core: Add explicit error message when rule doesn't meet configuration
net/mlx4: Add A0 hybrid steering
net/mlx4: Add mlx4_bitmap zone allocator
net/mlx4: Add a check if there are too many reserved QPs
net/mlx4: Change QP allocation scheme
net/mlx4_core: Use tasklet for user-space CQ completion events
net/mlx4_core: Mask out host side virtualization features for guests
net/mlx4_en: Set csum level for encapsulated packets
be2net: Export tunnel offloads only when a VxLAN tunnel is created
gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
net: fec: only enable mdio interrupt before phy device link up
net: fec: clear all interrupt events to support i.MX6SX
net: fec: reset fep link status in suspend function
net: sock: fix access via invalid file descriptor
net: introduce helper macro for_each_cmsghdr
...
Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
cmsghdr from msghdr, just cleanup.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge first patchbomb from Andrew Morton:
- a few minor cifs fixes
- dma-debug upadtes
- ocfs2
- slab
- about half of MM
- procfs
- kernel/exit.c
- panic.c tweaks
- printk upates
- lib/ updates
- checkpatch updates
- fs/binfmt updates
- the drivers/rtc tree
- nilfs
- kmod fixes
- more kernel/exit.c
- various other misc tweaks and fixes
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
exit: pidns: fix/update the comments in zap_pid_ns_processes()
exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
exit: exit_notify: re-use "dead" list to autoreap current
exit: reparent: call forget_original_parent() under tasklist_lock
exit: reparent: avoid find_new_reaper() if no children
exit: reparent: introduce find_alive_thread()
exit: reparent: introduce find_child_reaper()
exit: reparent: document the ->has_child_subreaper checks
exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
exit: proc: don't try to flush /proc/tgid/task/tgid
exit: release_task: fix the comment about group leader accounting
exit: wait: drop tasklist_lock before psig->c* accounting
exit: wait: don't use zombie->real_parent
exit: wait: cleanup the ptrace_reparented() checks
usermodehelper: kill the kmod_thread_locker logic
usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
...
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull VFS changes from Al Viro:
"First pile out of several (there _definitely_ will be more). Stuff in
this one:
- unification of d_splice_alias()/d_materialize_unique()
- iov_iter rewrite
- killing a bunch of ->f_path.dentry users (and f_dentry macro).
Getting that completed will make life much simpler for
unionmount/overlayfs, since then we'll be able to limit the places
sensitive to file _dentry_ to reasonably few. Which allows to have
file_inode(file) pointing to inode in a covered layer, with dentry
pointing to (negative) dentry in union one.
Still not complete, but much closer now.
- crapectomy in lustre (dead code removal, mostly)
- "let's make seq_printf return nothing" preparations
- assorted cleanups and fixes
There _definitely_ will be more piles"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
copy_from_iter_nocache()
new helper: iov_iter_kvec()
csum_and_copy_..._iter()
iov_iter.c: handle ITER_KVEC directly
iov_iter.c: convert copy_to_iter() to iterate_and_advance
iov_iter.c: convert copy_from_iter() to iterate_and_advance
iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
iov_iter.c: convert iov_iter_zero() to iterate_and_advance
iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
iov_iter.c: iterate_and_advance
iov_iter.c: macros for iterating over iov_iter
kill f_dentry macro
dcache: fix kmemcheck warning in switch_names
new helper: audit_file()
nfsd_vfs_write(): use file_inode()
ncpfs: use file_inode()
kill f_dentry uses
lockd: get rid of ->f_path.dentry->d_sb
...
Conflicts:
drivers/net/ethernet/amd/xgbe/xgbe-desc.c
drivers/net/ethernet/renesas/sh_eth.c
Overlapping changes in both conflict cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
When I cooked commit c3658e8d0f ("tcp: fix possible NULL dereference in
tcp_vX_send_reset()") I missed other spots we could deref a NULL
skb_dst(skb)
Again, if a socket is provided, we do not need skb_dst() to get a
pointer to network namespace : sock_net(sk) is good enough.
Reported-by: Dann Frazier <dann.frazier@canonical.com>
Bisected-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: ca777eff51 ("tcp: remove dst refcount false sharing for prequeue mode")
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 95bd09eb27 ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)
At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).
This is bad because :
- It sends small TSO packets even in Slow Start where rate quickly
increases.
- It tends to make socket write queue very big, increasing tcp_ack()
processing time, but also increasing memory needs, not necessarily
accounted for, as fast clones overhead is currently ignored.
- Lower GRO efficiency and more ACK packets.
Servers with a lot of small lived connections suffer from this.
Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.
Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.
As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.
Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.
Tested:
40 ms rtt link
nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
Before patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/s
87380 2000000 2000000 0.36 44.22
IpInReceives 600 0.0
IpOutRequests 599 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2033249 0.0
After patch :
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 2000000 2000000 0.36 44.27
IpInReceives 221 0.0
IpOutRequests 232 0.0
TcpOutSegs 1397 0.0
IpExtOutOctets 2013953 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In commit 2b4636a5f8 ("tcp_cubic: make the delay threshold of HyStart
less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms.
The remaining problem is that using delay_min + (delay_min/16) as the
threshold is too sensitive.
6.25 % of variation is too small for rtt above 60 ms, which are not
uncommon.
Lets use 12.5 % instead (delay_min + (delay_min/8))
Tested:
80 ms RTT between peers, FQ/pacing packet scheduler on sender.
10 bulk transfers of 10 seconds :
nstat >/dev/null
for i in `seq 1 10`
do
netperf -H remote -- -k THROUGHPUT | grep THROUGHPUT
done
nstat | grep Hystart
With the 6.25 % threshold :
THROUGHPUT=20.66
THROUGHPUT=249.38
THROUGHPUT=254.10
THROUGHPUT=14.94
THROUGHPUT=251.92
THROUGHPUT=237.73
THROUGHPUT=19.18
THROUGHPUT=252.89
THROUGHPUT=21.32
THROUGHPUT=15.58
TcpExtTCPHystartTrainDetect 2 0.0
TcpExtTCPHystartTrainCwnd 4756 0.0
TcpExtTCPHystartDelayDetect 5 0.0
TcpExtTCPHystartDelayCwnd 180 0.0
With the 12.5 % threshold
THROUGHPUT=251.09
THROUGHPUT=247.46
THROUGHPUT=250.92
THROUGHPUT=248.91
THROUGHPUT=250.88
THROUGHPUT=249.84
THROUGHPUT=250.51
THROUGHPUT=254.15
THROUGHPUT=250.62
THROUGHPUT=250.89
TcpExtTCPHystartTrainDetect 1 0.0
TcpExtTCPHystartTrainCwnd 3175 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
triggers too soon.
Having SNMP counters to have an idea of how often the various Hystart
methods trigger is useful prior to any modifications.
This patch adds SNMP counters tracking, how many time "ack train" or
"Delay" based Hystart triggers, and cumulative sum of cwnd at the time
Hystart decided to end SS (Slow Start)
myhost:~# nstat -a | grep Hystart
TcpExtTCPHystartTrainDetect 9 0.0
TcpExtTCPHystartTrainCwnd 20650 0.0
TcpExtTCPHystartDelayDetect 10 0.0
TcpExtTCPHystartDelayCwnd 360 0.0
->
Train detection was triggered 9 times, and average cwnd was
20650/9=2294,
Delay detection was triggered 10 times and average cwnd was 36
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The compute_score functions are a bit difficult to read.
Neaten them a bit to reduce object sizes and make them a
bit more intelligible.
Return early to avoid indentation and avoid unnecessary
initializations.
(allyesconfig, but w/ -O2 and no profiling)
$ size net/ipv[46]/udp.o.*
text data bss dec hex filename
28680 1184 25 29889 74c1 net/ipv4/udp.o.new
28756 1184 25 29965 750d net/ipv4/udp.o.old
17600 1010 2 18612 48b4 net/ipv6/udp.o.new
17632 1010 2 18644 48d4 net/ipv6/udp.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.
on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.
In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.
The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:
ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
v1 -> v2
large rewrite
- integrate with existing pktinfo cmsg generation code
- on ipv4: only send with new flag, to maintain legacy behavior
- on ipv6: send at most a single pktinfo cmsg
- on ipv6: initialize fields if not yet initialized
The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches
1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
(http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>
One line change, in response to catching an occurrence of this bug.
See also fix f4713a3dfa ("net-timestamp: make tcp_recvmsg call ...")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the inner mac header to point to the GRE payload when
doing GRO. This is needed if we proceed to send the packet
through GRE GSO which now uses the inner mac header instead
of inner network header to determine the length of encapsulation
headers.
Fixes: 14051f0452 ("gre: Use inner mac length when computing tunnel length")
Reported-by: Wolfgang Walter <linux@stwm.de>
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains netfilter updates for net-next. Basically,
enhancements for xt_recent, skip zeroing of timer in conntrack, fix
linking problem with recent redirect support for nf_tables, ipset
updates and a couple of cleanups. More specifically, they are:
1) Rise maximum number per IP address to be remembered in xt_recent
while retaining backward compatibility, from Florian Westphal.
2) Skip zeroing timer area in nf_conn objects, also from Florian.
3) Inspect IPv4 and IPv6 traffic from the bridge to allow filtering using
using meta l4proto and transport layer header, from Alvaro Neira.
4) Fix linking problems in the new redirect support when CONFIG_IPV6=n
and IP6_NF_IPTABLES=n.
And ipset updates from Jozsef Kadlecsik:
5) Support updating element extensions when the set is full (fixes
netfilter bugzilla id 880).
6) Fix set match with 32-bits userspace / 64-bits kernel.
7) Indicate explicitly when /0 networks are supported in ipset.
8) Simplify cidr handling for hash:*net* types.
9) Allocate the proper size of memory when /0 networks are supported.
10) Explicitly add padding elements to hash:net,net and hash:net,port,
because the elements must be u32 sized for the used hash function.
Jozsef is also cooking ipset RCU conversion which should land soon if
they reach the merge window in time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This resolves linking problems with CONFIG_IPV6=n:
net/built-in.o: In function `redirect_tg6':
xt_REDIRECT.c:(.text+0x6d021): undefined reference to `nf_nat_redirect_ipv6'
Reported-by: Andreas Ruprecht <rupran@einserver.de>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TCP timestamping introduced MSG_ERRQUEUE handling for TCP sockets.
If the socket is of family AF_INET6, call ipv6_recv_error instead
of ip_recv_error.
This change is more complex than a single branch due to the loadable
ipv6 module. It reuses a pre-existing indirect function call from
ping. The ping code is safe to call, because it is part of the core
ipv6 module and always present when AF_INET6 sockets are active.
Fixes: 4ed2d765 (net-timestamp: TCP timestamping)
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
It may also be worthwhile to add WARN_ON_ONCE(sk->family == AF_INET6)
to ip_recv_error.
Signed-off-by: David S. Miller <davem@davemloft.net>
Change remote checksum offload to call remcsum_adjust. This also
eliminates the optimization to skip an IP header as part of the
adjustment (really does not seem to be much of a win).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit ca777eff51 ("tcp: remove dst refcount false sharing for
prequeue mode") we have to relax check against skb dst in
tcp_v[46]_send_reset() if prequeue dropped the dst.
If a socket is provided, a full lookup was done to find this socket,
so the dst test can be skipped.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88191
Reported-by: Jaša Bartelj <jasa.bartelj@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Daniel Borkmann <dborkman@redhat.com>
Fixes: ca777eff51 ("tcp: remove dst refcount false sharing for prequeue mode")
Signed-off-by: David S. Miller <davem@davemloft.net>
ping_lookup() may return a wrong sock if sk_buff's and sock's protocols
dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's
sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong
sock will be returned.
the fix is to "continue" the searching, if no matching, return NULL.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Jane Zhou <a17711@motorola.com>
Signed-off-by: Yiwei Zhao <gbjc64@motorola.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter/ipvs updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, this includes the NAT redirection support for nf_tables, the
cgroup support for nft meta and conntrack zone support for the connlimit
match. Coming after those, a bunch of sparse warning fixes, missing
netns bits and cleanups. More specifically, they are:
1) Prepare IPv4 and IPv6 NAT redirect code to use it from nf_tables,
patches from Arturo Borrero.
2) Introduce the nf_tables redir expression, from Arturo Borrero.
3) Remove an unnecessary assignment in ip_vs_xmit/__ip_vs_get_out_rt().
Patch from Alex Gartrell.
4) Add nft_log_dereference() macro to the nf_log infrastructure, patch
from Marcelo Leitner.
5) Add some extra validation when registering logger families, also
from Marcelo.
6) Some spelling cleanups from stephen hemminger.
7) Fix sparse warning in nf_logger_find_get().
8) Add cgroup support to nf_tables meta, patch from Ana Rey.
9) A Kconfig fix for the new redir expression and fix sparse warnings in
the new redir expression.
10) Fix several sparse warnings in the netfilter tree, from
Florian Westphal.
11) Reduce verbosity when OOM in nfnetlink_log. User can basically do
nothing when this situation occurs.
12) Add conntrack zone support to xt_connlimit, again from Florian.
13) Add netnamespace support to the h323 conntrack helper, contributed
by Vasily Averin.
14) Remove unnecessary nul-pointer checks before free_percpu() and
module_put(), from Markus Elfring.
15) Use pr_fmt in nfnetlink_log, again patch from Marcelo Leitner.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now the vti_link_ops do not point the .dellink, for fb tunnel device
(ip_vti0), the net_device will be removed as the default .dellink is
unregister_netdevice_queue,but the tunnel still in the tunnel list,
then if we add a new vti tunnel, in ip_tunnel_find():
hlist_for_each_entry_rcu(t, head, hash_node) {
if (local == t->parms.iph.saddr &&
remote == t->parms.iph.daddr &&
link == t->parms.link &&
==> type == t->dev->type &&
ip_tunnel_key_match(&t->parms, flags, key))
break;
}
the panic will happen, cause dev of ip_tunnel *t is null:
[ 3835.072977] IP: [<ffffffffa04103fd>] ip_tunnel_find+0x9d/0xc0 [ip_tunnel]
[ 3835.073008] PGD b2c21067 PUD b7277067 PMD 0
[ 3835.073008] Oops: 0000 [#1] SMP
.....
[ 3835.073008] Stack:
[ 3835.073008] ffff8800b72d77f0 ffffffffa0411924 ffff8800bb956000 ffff8800b72d78e0
[ 3835.073008] ffff8800b72d78a0 0000000000000000 ffffffffa040d100 ffff8800b72d7858
[ 3835.073008] ffffffffa040b2e3 0000000000000000 0000000000000000 0000000000000000
[ 3835.073008] Call Trace:
[ 3835.073008] [<ffffffffa0411924>] ip_tunnel_newlink+0x64/0x160 [ip_tunnel]
[ 3835.073008] [<ffffffffa040b2e3>] vti_newlink+0x43/0x70 [ip_vti]
[ 3835.073008] [<ffffffff8150d4da>] rtnl_newlink+0x4fa/0x5f0
[ 3835.073008] [<ffffffff812f68bb>] ? nla_strlcpy+0x5b/0x70
[ 3835.073008] [<ffffffff81508fb0>] ? rtnl_link_ops_get+0x40/0x60
[ 3835.073008] [<ffffffff8150d11f>] ? rtnl_newlink+0x13f/0x5f0
[ 3835.073008] [<ffffffff81509cf4>] rtnetlink_rcv_msg+0xa4/0x270
[ 3835.073008] [<ffffffff8126adf5>] ? sock_has_perm+0x75/0x90
[ 3835.073008] [<ffffffff81509c50>] ? rtnetlink_rcv+0x30/0x30
[ 3835.073008] [<ffffffff81529e39>] netlink_rcv_skb+0xa9/0xc0
[ 3835.073008] [<ffffffff81509c48>] rtnetlink_rcv+0x28/0x30
....
modprobe ip_vti
ip link del ip_vti0 type vti
ip link add ip_vti0 type vti
rmmod ip_vti
do that one or more times, kernel will panic.
fix it by assigning ip_tunnel_dellink to vti_link_ops' dellink, in
which we skip the unregister of fb tunnel device. do the same on ip6_vti.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ieee802154/fakehard.c
A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.
Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c3ae62af8e ("tcp: should drop incoming frames without ACK
flag set") was created to mitigate a security vulnerability in which a
local attacker is able to inject data into locally-opened sockets by
using TCP protocol statistics in procfs to quickly find the correct
sequence number.
This broke the RFC5961 requirement to send a challenge ACK in response
to spurious RST packets, which was subsequently fixed by commit
7b514a886b ("tcp: accept RST without ACK flag").
Unfortunately, the RFC5961 requirement that spurious SYN packets be
handled in a similar manner remains broken.
RFC5961 section 4 states that:
... the handling of the SYN in the synchronized state SHOULD be
performed as follows:
1) If the SYN bit is set, irrespective of the sequence number, TCP
MUST send an ACK (also referred to as challenge ACK) to the remote
peer:
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
After sending the acknowledgment, TCP MUST drop the unacceptable
segment and stop processing further.
By sending an ACK, the remote peer is challenged to confirm the loss
of the previous connection and the request to start a new connection.
A legitimate peer, after restart, would not have a TCB in the
synchronized state. Thus, when the ACK arrives, the peer should send
a RST segment back with the sequence number derived from the ACK
field that caused the RST.
This RST will confirm that the remote peer has indeed closed the
previous connection. Upon receipt of a valid RST, the local TCP
endpoint MUST terminate its connection. The local TCP endpoint
should then rely on SYN retransmission from the remote end to
re-establish the connection.
This patch lets SYN packets through the discard added in c3ae62af8e,
so that spurious SYN packets are properly dealt with as per the RFC.
The challenge ACK is sent unconditionally and is rate-limited, so the
original vulnerability is not reintroduced by this patch.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use them to push skb->vlan_tci into the payload and avoid code
duplication.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on sk_forward_alloc problems reported by Denys
Fedoryshchenko, we found that tcp connect() (and fastopen) do not call
sk_wmem_schedule() for SYN packet (and/or SYN/DATA packet), so
sk_forward_alloc is negative while connect is in progress.
We can fix this by calling regular sk_stream_alloc_skb() both for the
SYN packet (in tcp_connect()) and the syn_data packet in
tcp_send_syn_data()
Then, tcp_send_syn_data() can avoid copying syn_data as we simply
can manipulate syn_data->cb[] to remove SYN flag (and increment seq)
Instead of open coding memcpy_fromiovecend(), simply use this helper.
This leaves in socket write queue clean fast clone skbs.
This was tested against our fastopen packetdrill tests.
Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If icmp_rcv() has successfully processed the incoming ICMP datagram, we
should use consume_skb() rather than kfree_skb() because a hit on the likes
of perf -e skb:kfree_skb is not called-for.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():
skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
<IRQ>
[<ffffffff80578226>] ? skb_put+0x3a/0x3b
[<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
[<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
[<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
[<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
[<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
[<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
[<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
[<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
[<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
[<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit 72e09ad107 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less likely to fail.
However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size can be much smaller.
The IGMP case doesn't have this issue as commit 57e1ab6ead
("igmp: refine skb allocations") stores the allocation size in
the cb[].
Set a reserved_tailroom to make it fit into the MTU and use
skb_availroom() helper instead. This also allows to get rid of
igmp_skb_size().
Reported-by: Wei Liu <lw1a2.jing@gmail.com>
Fixes: 72e09ad107 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: David L Stevens <david.stevens@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter updates for your net tree,
they are:
1) Fix missing initialization of the range structure (allocated in the
stack) in nft_masq_{ipv4, ipv6}_eval, from Daniel Borkmann.
2) Make sure the data we receive from userspace contains the req_version
structure, otherwise return an error incomplete on truncated input.
From Dan Carpenter.
3) Fix handling og skb->sk which may cause incorrect handling
of connections from a local process. Via Simon Horman, patch from
Calvin Owens.
4) Fix wrong netns in nft_compat when setting target and match params
structure.
5) Relax chain type validation in nft_compat that was recently included,
this broke the matches that need to be run from the route chain type.
Now iptables-test.py automated regression tests report success again
and we avoid the only possible problematic case, which is the use of
nat targets out of nat chain type.
6) Use match->table to validate the tablename, instead of the match->name.
Again patch for nft_compat.
7) Restore the synchronous release of objects from the commit and abort
path in nf_tables. This is causing two major problems: splats when using
nft_compat, given that matches and targets may sleep and call_rcu is
invoked from softirq context. Moreover Patrick reported possible event
notification reordering when rules refer to anonymous sets.
8) Fix race condition in between packets that are being confirmed by
conntrack and the ctnetlink flush operation. This happens since the
removal of the central spinlock. Thanks to Jesper D. Brouer to looking
into this.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Trying to add an unreachable route incorrectly returns -ESRCH if
if custom FIB rules are present:
[root@localhost ~]# ip route add 74.125.31.199 dev eth0 via 1.2.3.4
RTNETLINK answers: Network is unreachable
[root@localhost ~]# ip rule add to 55.66.77.88 table 200
[root@localhost ~]# ip route add 74.125.31.199 dev eth0 via 1.2.3.4
RTNETLINK answers: No such process
[root@localhost ~]#
Commit 83886b6b63 ("[NET]: Change "not found"
return value for rule lookup") changed fib_rules_lookup()
to use -ESRCH as a "not found" code internally, but for user space it
should be translated into -ENETUNREACH. Handle the translation centrally in
ipv4-specific fib_lookup(), leaving the DECnet case alone.
On a related note, commit b7a71b51ee
("ipv4: removed redundant conditional") removed a similar translation from
ip_route_input_slow() prematurely AIUI.
Fixes: b7a71b51ee ("ipv4: removed redundant conditional")
Signed-off-by: Panu Matilainen <pmatilai@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/chelsio/cxgb4vf/sge.c
drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c
sge.c was overlapping two changes, one to use the new
__dev_alloc_page() in net-next, and one to use s->fl_pg_order in net.
ixgbe_phy.c was a set of overlapping whitespace changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
In DC world, GSO packets initially cooked by tcp_sendmsg() are usually
big, as sk_pacing_rate is high.
When network is congested, cwnd can be smaller than the GSO packets
found in socket write queue. tcp_write_xmit() splits GSO packets
using the available cwnd, and we end up sending a single GSO packet,
consuming all available cwnd.
With GRO aggregation on the receiver, we might handle a single GRO
packet, sending back a single ACK.
1) This single ACK might be lost
TLP or RTO are forced to attempt a retransmit.
2) This ACK releases a full cwnd, sender sends another big GSO packet,
in a ping pong mode.
This behavior does not fill the pipes in the best way, because of
scheduling artifacts.
Make sure we always have at least two GSO packets in flight.
This allows us to safely increase GRO efficiency without risking
spurious retransmits.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/fou.c: In function ‘ip_tunnel_encap_del_fou_ops’:
net/ipv4/fou.c:861:1: warning: no return statement in function returning non-void [-Wreturn-type]
Fixes: a8c5f90fb5 ("ip_tunnel: Ops registration for secondary encap (fou, gue)")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/bridge/br_netfilter.c:870:6: symbol 'br_netfilter_enable' was not declared. Should it be static?
no; add include
net/ipv4/netfilter/nft_reject_ipv4.c:22:6: symbol 'nft_reject_ipv4_eval' was not declared. Should it be static?
yes
net/ipv6/netfilter/nf_reject_ipv6.c:16:6: symbol 'nf_send_reset6' was not declared. Should it be static?
no; add include
net/ipv6/netfilter/nft_reject_ipv6.c:22:6: symbol 'nft_reject_ipv6_eval' was not declared. Should it be static?
yes
net/netfilter/core.c:33:32: symbol 'nf_ipv6_ops' was not declared. Should it be static?
no; add include
net/netfilter/xt_DSCP.c:40:57: cast truncates bits from constant value (ffffff03 becomes 3)
net/netfilter/xt_DSCP.c:57:59: cast truncates bits from constant value (ffffff03 becomes 3)
add __force, 3 is what we want.
net/ipv4/netfilter/nf_log_arp.c:77:6: symbol 'nf_log_arp_packet' was not declared. Should it be static?
yes
net/ipv4/netfilter/nf_reject_ipv4.c:17:6: symbol 'nf_send_reset' was not declared. Should it be static?
no; add include
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of calling fou and gue functions directly from ip_tunnel
use ops for these that were previously registered. This patch adds the
logic to add and remove encapsulation operations for ip_tunnel,
and modified fou (and gue) to register with ip_tunnels.
This patch also addresses a circular dependency between ip_tunnel
and fou that was causing link errors when CONFIG_NET_IP_TUNNEL=y
and CONFIG_NET_FOU=m. References to fou an gue have been removed from
ip_tunnel.c
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Standardize function pointer uses.
Convert calling style from:
(*foo)(args...);
to:
foo(args...);
Other miscellanea:
o Add braces around loops with single ifs on multiple lines
o Realign arguments around these functions
o Invert logic in if to return immediately.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use IS_ENABLED(CONFIG_IPV6), to enable this code if IPv6 is
a module.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: c8e6ad0829 ("ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently there are only three neigh tables in the whole kernel:
arp table, ndisc table and decnet neigh table. What's more,
we don't support registering multiple tables per family.
Therefore we can just make these tables statically built-in.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.
All messages are still ratelimited.
Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled. Even so,
these messages are now _not_ emitted by default.
This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings". For backward compatibility,
the sysctl is not removed, but it has no function. The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c
Miscellanea:
o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alternative to RPS/RFS is to use hardware support for multiple
queues.
Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.
Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.
We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.
After accept(), connect(), or even file descriptor passing around
processes, applications can use :
int cpu;
socklen_t len = sizeof(cpu);
getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_mark_napi_id() is used to record for a flow napi id of incoming
packets for busypoll sake.
We should do this only on established flows, not on listeners.
This was 'working' by virtue of the socket cloning, but doing
this on SYN packets in unecessary cache line dirtying.
Even if we move sk_napi_id in the same cache line than sk_lock,
we are working to make SYN processing lockless, so it is desirable
to set sk_napi_id only for established flows.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing GRO processing for UDP tunnels, we never add
SKB_GSO_UDP_TUNNEL to gso_type - only the type of the inner protocol
is added (such as SKB_GSO_TCPV4). The result is that if the packet is
later resegmented we will do GSO but not treat it as a tunnel. This
results in UDP fragmentation of the outer header instead of (i.e.) TCP
segmentation of the inner header as was originally on the wire.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ever since raw_probe_proto_opt was added it had the problem of
causing the user iov to be read twice, once during the probe for
the protocol header and once again in ip_append_data.
This is a potential security problem since it means that whatever
we're probing may be invalid. This patch plugs the hole by
firstly advancing the iov so we don't read the same spot again,
and secondly saving what we read the first time around for use
by ip_append_data.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function raw_probe_proto_opt tries to extract the first two
bytes from the user input in order to seed the IPsec lookup for
ICMP packets. In doing so it's processing iovec by hand and
overcomplicating things.
This patch replaces the manual iovec processing with a call to
memcpy_fromiovecend.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When transferring from the original range in nf_nat_masquerade_{ipv4,ipv6}()
we copy over values from stack in from min_proto/max_proto due to uninitialized
range variable in both, nft_masq_{ipv4,ipv6}_eval. As we only initialize
flags at this time from nft_masq struct, just zero out the rest.
Fixes: 9ba1f726be ("netfilter: nf_tables: add new nft_masq expression")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As NIC multicast filtering isn't perfect, and some platforms are
quite content to spew broadcasts, we should not trigger an event
for skb:kfree_skb when we do not have a match for such an incoming
datagram. We do though want to avoid sweeping the matter under the
rug entirely, so increment a suitable statistic.
This incorporates feedback from David L. Stevens, Karl Neiss and Eric
Dumazet.
V3 - use bool per David Miller
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pravin B Shelar says:
====================
Open vSwitch
First two patches are related to OVS MPLS support. Rest of patches
are mostly refactoring and minor improvements to openvswitch.
v1-v2:
- Fix conflicts due to "gue: Remote checksum offload"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 64ce207306 ("[NET]: Make NETDEBUG pure printk wrappers")
originally had these NETDEBUG printks as always emitting.
Commit a2a316fd06 ("[NET]: Replace CONFIG_NET_DEBUG with sysctl")
added a net_msg_warn sysctl to these NETDEBUG uses.
Convert these NETDEBUG uses to normal pr_info calls.
This changes the output prefix from "ESP: " to include
"IPSec: " for the ipv4 case and "IPv6: " for the ipv6 case.
These output lines are now like the other messages in the files.
Other miscellanea:
Neaten the arithmetic spacing to be consistent with other
arithmetic spacing in the files.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These messages aren't useful as there's a generic dump_stack()
on OOM.
Neaten the comment and if test above the OOM by separating the
assign in if into an allocation then if test.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Device can export MPLS GSO support in dev->mpls_features same way
it export vlan features in dev->vlan_features. So it is safe to
remove NETIF_F_GSO_MPLS redundant flag.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
When filling netlink info, dport is being returned as flags. Fix
instances to return correct value.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():
skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
<IRQ>
[<ffffffff80578226>] ? skb_put+0x3a/0x3b
[<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
[<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
[<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
[<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
[<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
[<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
[<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
[<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
[<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
[<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit 72e09ad107 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less likely to fail.
However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size can be much smaller.
The IGMP case doesn't have this issue as commit 57e1ab6ead
("igmp: refine skb allocations") stores the allocation size in
the cb[].
Set a reserved_tailroom to make it fit into the MTU and use
skb_availroom() helper instead. This also allows to get rid of
igmp_skb_size().
Reported-by: Wei Liu <lw1a2.jing@gmail.com>
Fixes: 72e09ad107 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: David L Stevens <david.stevens@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a single fixed string is smaller code size than using
a format and many string arguments.
Reduces overall code size a little.
$ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
text data bss dec hex filename
34269 7012 14824 56105 db29 net/ipv4/igmp.o.new
34315 7012 14824 56151 db57 net/ipv4/igmp.o.old
30078 7869 13200 51147 c7cb net/ipv6/mcast.o.new
30105 7869 13200 51174 c7e6 net/ipv6/mcast.o.old
11434 3748 8580 23762 5cd2 net/ipv6/ip6_flowlabel.o.new
11491 3748 8580 23819 5d0b net/ipv6/ip6_flowlabel.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ueki Kohei reported that when we are using NewReno with connections that
have a very low traffic, we may timeout the connection too early if a
second loss occurs after the first one was successfully acked but no
data was transfered later. Below is his description of it:
When SACK is disabled, and a socket suffers multiple separate TCP
retransmissions, that socket's ETIMEDOUT value is calculated from the
time of the *first* retransmission instead of the *latest*
retransmission.
This happens because the tcp_sock's retrans_stamp is set once then never
cleared.
Take the following connection:
Linux remote-machine
| |
send#1---->(*1)|--------> data#1 --------->|
| | |
RTO : :
| | |
---(*2)|----> data#1(retrans) ---->|
| (*3)|<---------- ACK <----------|
| | |
| : :
| : :
| : :
16 minutes (or more) :
| : :
| : :
| : :
| | |
send#2---->(*4)|--------> data#2 --------->|
| | |
RTO : :
| | |
---(*5)|----> data#2(retrans) ---->|
| | |
| | |
RTO*2 : :
| | |
| | |
ETIMEDOUT<----(*6)| |
(*1) One data packet sent.
(*2) Because no ACK packet is received, the packet is retransmitted.
(*3) The ACK packet is received. The transmitted packet is acknowledged.
At this point the first "retransmission event" has passed and been
recovered from. Any future retransmission is a completely new "event".
(*4) After 16 minutes (to correspond with retries2=15), a new data
packet is sent. Note: No data is transmitted between (*3) and (*4).
The socket's timeout SHOULD be calculated from this point in time, but
instead it's calculated from the prior "event" 16 minutes ago.
(*5) Because no ACK packet is received, the packet is retransmitted.
(*6) At the time of the 2nd retransmission, the socket returns
ETIMEDOUT.
Therefore, now we clear retrans_stamp as soon as all data during the
loss window is fully acked.
Reported-by: Ueki Kohei
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".
When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.
Having a helper like this means there will be less places to touch
during that transformation.
Based upon descriptions and patch from Al Viro.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add processing of the remote checksum offload option in both the normal
path as well as the GRO path. The implements patching the affected
checksum to derive the offloaded checksum.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add if_tunnel flag TUNNEL_ENCAP_FLAG_REMCSUM to configure
remote checksum offload on an IP tunnel. Add logic in gue_build_header
to insert remote checksum offload option.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
checksum offload being done (in this case inner checksum must not
be offloaded to the NIC).
Added logic in __skb_udp_tunnel_segment to handle remote checksum
offload case.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add functions and basic definitions for processing standard flags,
private flags, and control messages. This includes definitions
to compute length of optional fields corresponding to a set of flags.
Flag validation is in validate_gue_flags function. This checks for
unknown flags, and that length of optional fields is <= length
in guehdr hlen.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __skb_udp_tunnel_segment if outer UDP checksums are enabled and
ip_summed is not already CHECKSUM_PARTIAL, set up checksum offload
if device features allow it.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move fou_build_header out of ip_tunnel.c and into fou.c splitting
it up into fou_build_header, gue_build_header, and fou_build_udp.
This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap
to call fou_build_header or gue_build_header based on the tunnel
encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen
functions which are called by ip_encap_hlen. New net/fou.h has
prototypes and defines for this.
Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels
can use FOU/GUE and fou module is also selected.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pernet ops aren't ever unregistered, which causes a memory
leak and an OOPs if the module is ever reinserted.
Fixes: 0b5e8b8eea ("net: Add Geneve tunneling protocol driver")
CC: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geneve does not currently set the inner protocol type when
transmitting packets. This causes GSO segmentation to fail on NICs
that do not support Geneve offloading.
CC: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value of seq_printf() is soon to be removed. Remove the
checks from seq_printf() in favor of seq_has_overflowed().
Link: http://lkml.kernel.org/r/20141104142236.GA10239@salvia
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since adding a new function to seq_file (seq_has_overflowed())
there isn't any value for functions called from seq_show to
return anything. Remove the int returns of the various
print_tuple/<foo>_print_tuple functions.
Link: http://lkml.kernel.org/p/f2e8cf8df433a197daa62cbaf124c900c708edc7.1412031505.git.joe@perches.com
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The seq_printf() and friends are having their return values removed.
The print_conntrack() returns the result of seq_printf(), which is
meaningless when seq_printf() returns void. Might as well remove the
return values of print_conntrack() as well.
Link: http://lkml.kernel.org/r/20141029220107.465008329@goodmis.org
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch allows to set ECN on a per-route basis in case the sysctl
tcp_ecn is not set to 1. In other words, when ECN is set for specific
routes, it provides a tcp_ecn=1 behaviour for that route while the rest
of the stack acts according to the global settings.
One can use 'ip route change dev $dev $net features ecn' to toggle this.
Having a more fine-grained per-route setting can be beneficial for various
reasons, for example, 1) within data centers, or 2) local ISPs may deploy
ECN support for their own video/streaming services [1], etc.
There was a recent measurement study/paper [2] which scanned the Alexa's
publicly available top million websites list from a vantage point in US,
Europe and Asia:
Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
only ECN") ;)); the break in connectivity on-path was found is about
1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
more common in the negotiation phase (and mostly seen in the Alexa
middle band, ranks around 50k-150k): from 12-thousand hosts on which
there _may_ be ECN-linked connection failures, only 79 failed with RST
when _not_ failing with RST when ECN is not requested.
It's unclear though, how much equipment in the wild actually marks CE
when buffers start to fill up.
We thought about a fallback to non-ECN for retransmitted SYNs as another
global option (which could perhaps one day be made default), but as Eric
points out, there's much more work needed to detect broken middleboxes.
Two examples Eric mentioned are buggy firewalls that accept only a single
SYN per flow, and middleboxes that successfully let an ECN flow establish,
but later mark CE for all packets (so cwnd converges to 1).
[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
[2] http://ecn.ethz.ch/
Joint work with Daniel Borkmann.
Reference: http://thread.gmane.org/gmane.linux.network/335797
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function cookie_check_timestamp(), both called from IPv4/6 context,
is being used to decode the echoed timestamp from the SYN/ACK into TCP
options used for follow-up communication with the peer.
We can remove ECN handling from that function, split it into a separate
one, and simply rename the original function into cookie_decode_options().
cookie_decode_options() just fills in tcp_option struct based on the
echoed timestamp received from the peer. Anything that fails in this
function will actually discard the request socket.
While this is the natural place for decoding options such as ECN which
commit 172d69e63c ("syncookies: add support for ECN") added, we argue
that in particular for ECN handling, it can be checked at a later point
in time as the request sock would actually not need to be dropped from
this, but just ECN support turned off.
Therefore, we split this functionality into cookie_ecn_ok(), which tells
us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.
This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
won't be enough anymore at that point; if the timestamp indicates ECN
and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.
This would mean adding a route lookup to cookie_check_timestamp(), which
we definitely want to avoid. As we already do a route lookup at a later
point in cookie_{v4,v6}_check(), we can simply make use of that as well
for the new cookie_ecn_ok() function w/o any additional cost.
Joint work with Daniel Borkmann.
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Was a bit more difficult to read than needed due to magic shifts;
add defines and document the used encoding scheme.
Joint work with Daniel Borkmann.
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
else is unnecessary after return 0 in __udp4_lib_rcv()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove __inline__ / inline and let compiler decide what to do
with static functions
Inspired-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter/ipvs fixes for net
The following patchset contains fixes for netfilter/ipvs. This round of
fixes is larger than usual at this stage, specifically because of the
nf_tables bridge reject fixes that I would like to see in 3.18. The
patches are:
1) Fix a null-pointer dereference that may occur when logging
errors. This problem was introduced by 4a4739d56b ("ipvs: Pull
out crosses_local_route_boundary logic") in v3.17-rc5.
2) Update hook mask in nft_reject_bridge so we can also filter out
packets from there. This fixes 36d2af5 ("netfilter: nf_tables: allow
to filter from prerouting and postrouting"), which needs this chunk
to work.
3) Two patches to refactor common code to forge the IPv4 and IPv6
reject packets from the bridge. These are required by the nf_tables
reject bridge fix.
4) Fix nft_reject_bridge by avoiding the use of the IP stack to reject
packets from the bridge. The idea is to forge the reject packets and
inject them to the original port via br_deliver() which is now
exported for that purpose.
5) Restrict nft_reject_bridge to bridge prerouting and input hooks.
the original skbuff may cloned after prerouting when the bridge stack
needs to flood it to several bridge ports, it is too late to reject
the traffic.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
That can be reused by the reject bridge expression to build the reject
packet. The new functions are:
* nf_reject_ip_tcphdr_get(): to sanitize and to obtain the TCP header.
* nf_reject_iphdr_put(): to build the IPv4 header.
* nf_reject_ip_tcphdr_put(): to build the TCP header.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Some drivers are unable to perform TX completions in a bound time.
They instead call skb_orphan()
Problem is skb_fclone_busy() has to detect this case, otherwise
we block TCP retransmits and can freeze unlucky tcp sessions on
mostly idle hosts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 1f3279ae0c ("tcp: avoid retransmits of TCP packets hanging in host queues")
Signed-off-by: David S. Miller <davem@davemloft.net>
Challenge ACK is described in RFC 5961, fix typo.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, skb_inner_network_header is used but this does not account
for Ethernet header for ETH_P_TEB. Use skb_inner_mac_header which
handles TEB and also should work with IP encapsulation in which case
inner mac and inner network headers are the same.
Tested: Ran TCP_STREAM over GRE, worked as expected.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we cache them, the kernel will reuse them, independently of
whether forwarding is enabled or not. Which means that if forwarding is
disabled on the input interface where the first routing request comes
from, then that unreachable result will be cached and reused for
other interfaces, even if forwarding is enabled on them. The opposite
is also true.
This can be verified with two interfaces A and B and an output interface
C, where B has forwarding enabled, but not A and trying
ip route get $dst iif A from $src && ip route get $dst iif B from $src
Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only count packets that failed cookie-authentication.
We can get SYNCOOKIESFAILED > 0 while we never even sent a single cookie.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, despite the comment right before the function,
nf_log_register allows registering two loggers on with the same type and
end up overwriting the previous register.
Not a real issue today as current tree doesn't have two loggers for the
same type but it's better to get this protected.
Also make sure that all of its callers do error checking.
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The WARN_ON in inet_evict_bucket can be triggered by a valid case:
inet_frag_kill and inet_evict_bucket can be running in parallel on the
same queue which means that there has been at least one more ref added
by a previous inet_frag_find call, but inet_frag_kill can delete the
timer before inet_evict_bucket which will cause the WARN_ON() there to
trigger since we'll have refcnt!=1. Now, this case is valid because the
queue is being "killed" for some reason (removed from the chain list and
its timer deleted) so it will get destroyed in the end by one of the
inet_frag_put() calls which reaches 0 i.e. refcnt is still valid.
CC: Florian Westphal <fw@strlen.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McLean <chutzpah@gentoo.org>
Fixes: b13d3cbfb8 ("inet: frag: move eviction of queues to work queue")
Reported-by: Patrick McLean <chutzpah@gentoo.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the evictor is running it adds some chosen frags to a local list to
be evicted once the chain lock has been released but at the same time
the *frag_queue can be running for some of the same queues and it
may call inet_frag_kill which will wait on the chain lock and
will then delete the queue from the wrong list since it was added in the
eviction one. The fix is simple - check if the queue has the evict flag
set under the chain lock before deleting it, this is safe because the
evict flag is set only under that lock and having the flag set also means
that the queue has been detached from the chain list, so no need to delete
it again.
An important note to make is that we're safe w.r.t refcnt because
inet_frag_kill and inet_evict_bucket will sync on the del_timer operation
where only one of the two can succeed (or if the timer is executing -
none of them), the cases are:
1. inet_frag_kill succeeds in del_timer
- then the timer ref is removed, but inet_evict_bucket will not add
this queue to its expire list but will restart eviction in that chain
2. inet_evict_bucket succeeds in del_timer
- then the timer ref is kept until the evictor "expires" the queue, but
inet_frag_kill will remove the initial ref and will set
INET_FRAG_COMPLETE which will make the frag_expire fn just to remove
its ref.
In the end all of the queue users will do an inet_frag_put and the one
that reaches 0 will free it. The refcount balance should be okay.
CC: Florian Westphal <fw@strlen.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McLean <chutzpah@gentoo.org>
Fixes: b13d3cbfb8 ("inet: frag: move eviction of queues to work queue")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Patrick McLean <chutzpah@gentoo.org>
Tested-by: Patrick McLean <chutzpah@gentoo.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing upcoming Yaogong patch (converting out of order queue
into an RB tree), I hit the max reordering level of linux TCP stack.
Reordering level was limited to 127 for no good reason, and some
network setups [1] can easily reach this limit and get limited
throughput.
Allow a new max limit of 300, and add a sysctl to allow admins to even
allow bigger (or lower) values if needed.
[1] Aggregation of links, per packet load balancing, fabrics not doing
deep packet inspections, alternative TCP congestion modules...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new expression provides NAT in the redirect flavour, which is to
redirect packets to local machine.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch refactors the IPv4 code so it can be usable both from xt and
nf_tables.
A similar patch follows-up to handle IPv6.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
unsigned char *sha (source) was already in original git version
but was never used.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu tcp_md5sig_pool contains memory blobs that ultimately
go through sg_set_buf().
-> sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
This requires that whole area is in a physically contiguous portion
of memory. And that @buf is not backed by vmalloc().
Given that alloc_percpu() can use vmalloc() areas, this does not
fit the requirements.
Replace alloc_percpu() by a static DEFINE_PER_CPU() as tcp_md5sig_pool
is small anyway, there is no gain to dynamically allocate it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 765cf9976e ("tcp: md5: remove one indirection level in tcp_md5sig_pool")
Reported-by: Crestez Dan Leonard <cdleonard@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
to set skb->hash which is used to spread flows across multiple TXQs.
But, the above routines are invoked before the source port of the connection
is created. Because of this all outgoing connections that just differ in the
source port get hashed into the same TXQ.
This patch fixes this problem for IPv4/6 by invoking the the above routines
after the source port is available for the socket.
Fixes: b73c3d0e4("net: Save TX flow hash in sock and set in skbuf on xmit")
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gso_segment has three possible return values:
1. a pointer to the first segmented skb
2. an errno value (IS_ERR())
3. NULL. This can happen when GSO is used for header verification.
However, several callers currently test IS_ERR instead of IS_ERR_OR_NULL
and would oops when NULL is returned.
Note that these call sites should never actually see such a NULL return
value; all callers mask out the GSO bits in the feature argument.
However, there have been issues with some protocol handlers erronously not
respecting the specified feature mask in some cases.
It is preferable to get 'have to turn off hw offloading, else slow' reports
rather than 'kernel crashes'.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gso_segment() has a 'features' argument representing offload features
available to the output path.
A few handlers, e.g. GRE, instead re-fetch the features of skb->dev and use
those instead of the provided ones when handing encapsulation/tunnels.
Depending on dev->hw_enc_features of the output device skb_gso_segment() can
then return NULL even when the caller has disabled all GSO feature bits,
as segmentation of inner header thinks device will take care of segmentation.
This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs
that did not fit the remaining token quota as the segmentation does not work
when device supports corresponding hw offload capabilities.
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter fixes for net
The following patchset contains netfilter fixes for your net tree,
they are:
1) Fix missing MODULE_LICENSE() in the new nf_reject_ipv{4,6} modules.
2) Restrict nat and masq expressions to the nat chain type. Otherwise,
users may crash their kernel if they attach a nat/masq rule to a non
nat chain.
3) Fix hook validation in nft_compat when non-base chains are used.
Basically, initialize hook_mask to zero.
4) Make sure you use match/targets in nft_compat from the right chain
type. The existing validation relies on the table name which can be
avoided by
5) Better netlink attribute validation in nft_nat. This expression has
to reject the configuration when no address and proto configurations
are specified.
6) Interpret NFTA_NAT_REG_*_MAX if only if NFTA_NAT_REG_*_MIN is set.
Yet another sanity check to reject incorrect configurations from
userspace.
7) Conditional NAT attribute dumping depending on the existing
configuration.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"A quick batch of bug fixes:
1) Fix build with IPV6 disabled, from Eric Dumazet.
2) Several more cases of caching SKB data pointers across calls to
pskb_may_pull(), thus referencing potentially free'd memory. From
Li RongQing.
3) DSA phy code tests operation presence improperly, instead of going:
if (x->ops->foo)
r = x->ops->foo(args);
it was going:
if (x->ops->foo(args))
r = x->ops->foo(args);
Fix from Andew Lunn"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
Net: DSA: Fix checking for get_phy_flags function
ipv6: fix a potential use after free in sit.c
ipv6: fix a potential use after free in ip6_offload.c
ipv4: fix a potential use after free in gre_offload.c
tcp: fix build error if IPv6 is not enabled
pskb_may_pull() may change skb->data and make greh pointer oboslete;
so need to reassign greh;
but since first calling pskb_may_pull already ensured that skb->data
has enough space for greh, so move the reference of greh before second
calling pskb_may_pull(), to avoid reassign greh.
Fixes: 7a7ffbabf9("ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NIC")
Cc: Wei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Include fixes for netrom and dsa (Fabian Frederick and Florian
Fainelli)
2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO.
3) Several SKB use after free fixes (vxlan, openvswitch, vxlan,
ip_tunnel, fou), from Li ROngQing.
4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy.
5) Use after free in virtio_net, from Michael S Tsirkin.
6) Fix flow mask handling for megaflows in openvswitch, from Pravin B
Shelar.
7) ISDN gigaset and capi bug fixes from Tilman Schmidt.
8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin.
9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov.
10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong
Wang and Eric Dumazet.
11) Don't overwrite end of SKB when parsing malformed sctp ASCONF
chunks, from Daniel Borkmann.
12) Don't call sock_kfree_s() with NULL pointers, this function also has
the side effect of adjusting the socket memory usage. From Cong Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
bna: fix skb->truesize underestimation
net: dsa: add includes for ethtool and phy_fixed definitions
openvswitch: Set flow-key members.
netrom: use linux/uaccess.h
dsa: Fix conversion from host device to mii bus
tipc: fix bug in bundled buffer reception
ipv6: introduce tcp_v6_iif()
sfc: add support for skb->xmit_more
r8152: return -EBUSY for runtime suspend
ipv4: fix a potential use after free in fou.c
ipv4: fix a potential use after free in ip_tunnel_core.c
hyperv: Add handling of IP header with option field in netvsc_set_hash()
openvswitch: Create right mask with disabled megaflows
vxlan: fix a free after use
openvswitch: fix a use after free
ipv4: dst_entry leak in ip_send_unicast_reply()
ipv4: clean up cookie_v4_check()
ipv4: share tcp_v4_save_options() with cookie_v4_check()
ipv4: call __ip_options_echo() in cookie_v4_check()
atm: simplify lanai.c by using module_pci_driver
...
pskb_may_pull() maybe change skb->data and make uh pointer oboslete,
so reload uh and guehdr
Fixes: 37dd0247 ("gue: Receive side for Generic UDP Encapsulation")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() maybe change skb->data and make eth pointer oboslete,
so set eth after pskb_may_pull()
Fixes:3d7b46cd("ip_tunnel: push generic protocol handling to ip_tunnel module")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
and in case errors in __ip_append_data() nobody frees stolen dst entry
Fixes: 2e77d89b2f ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
Signed-off-by: Vasily Averin <vvs@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can retrieve opt from skb, no need to pass it as a parameter.
And opt should always be non-NULL, no need to check.
Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Tested-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cookie_v4_check() allocates ip_options_rcu in the same way
with tcp_v4_save_options(), we can just make it a helper function.
Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 971f10eca1 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
missed that cookie_v4_check() still calls ip_options_echo() which uses
IPCB(). It should use TCPCB() at TCP layer, so call __ip_options_echo()
instead.
Fixes: commit 971f10eca1 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Tested-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.
This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().
Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().
This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"
* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...
TCP Small queues tries to keep number of packets in qdisc
as small as possible, and depends on a tasklet to feed following
packets at TX completion time.
Choice of tasklet was driven by latencies requirements.
Then, TCP stack tries to avoid reorders, by locking flows with
outstanding packets in qdisc in a given TX queue.
What can happen is that many flows get attracted by a low performing
TX queue, and cpu servicing TX completion has to feed packets for all of
them, making this cpu 100% busy in softirq mode.
This became particularly visible with latest skb->xmit_more support
Strategy adopted in this patch is to detect when tcp_wfree() is called
from ksoftirqd and let the outstanding queue for this flow being drained
before feeding additional packets, so that skb->ooo_okay can be set
to allow select_queue() to select the optimal queue :
Incoming ACKS are normally handled by different cpus, so this patch
gives more chance for these cpus to take over the burden of feeding
qdisc with future packets.
Tested:
lpaa23:~# ./super_netperf 1400 --google-pacing-rate 3028000 -H lpaa24 -l 3600 &
lpaa23:~# sar -n DEV 1 10 | grep eth1
06:16:18 AM eth1 595448.00 1190564.00 38381.09 1760253.12 0.00 0.00 1.00
06:16:19 AM eth1 594858.00 1189686.00 38340.76 1758952.72 0.00 0.00 0.00
06:16:20 AM eth1 597017.00 1194019.00 38480.79 1765370.29 0.00 0.00 1.00
06:16:21 AM eth1 595450.00 1190936.00 38380.19 1760805.05 0.00 0.00 0.00
06:16:22 AM eth1 596385.00 1193096.00 38442.56 1763976.29 0.00 0.00 1.00
06:16:23 AM eth1 598155.00 1195978.00 38552.97 1768264.60 0.00 0.00 0.00
06:16:24 AM eth1 594405.00 1188643.00 38312.57 1757414.89 0.00 0.00 1.00
06:16:25 AM eth1 593366.00 1187154.00 38252.16 1755195.83 0.00 0.00 0.00
06:16:26 AM eth1 593188.00 1186118.00 38232.88 1753682.57 0.00 0.00 1.00
06:16:27 AM eth1 596301.00 1192241.00 38440.94 1762733.09 0.00 0.00 0.00
Average: eth1 595457.30 1190843.50 38381.69 1760664.84 0.00 0.00 0.50
lpaa23:~# ./tc -s -d qd sh dev eth1 | grep backlog
backlog 7606336b 2513p requeues 167982
backlog 224072b 74p requeues 566
backlog 581376b 192p requeues 5598
backlog 181680b 60p requeues 1070
backlog 5305056b 1753p requeues 110166 // Here, this TX queue is attracting flows
backlog 157456b 52p requeues 1758
backlog 672216b 222p requeues 3025
backlog 60560b 20p requeues 24541
backlog 448144b 148p requeues 21258
lpaa23:~# echo 1 >/proc/sys/net/ipv4/tcp_tsq_enable_tcp_wfree_ksoftirqd_detect
Immediate jump to full bandwidth, and traffic is properly
shard on all tx queues.
lpaa23:~# sar -n DEV 1 10 | grep eth1
06:16:46 AM eth1 1397632.00 2795397.00 90081.87 4133031.26 0.00 0.00 1.00
06:16:47 AM eth1 1396874.00 2793614.00 90032.99 4130385.46 0.00 0.00 0.00
06:16:48 AM eth1 1395842.00 2791600.00 89966.46 4127409.67 0.00 0.00 1.00
06:16:49 AM eth1 1395528.00 2791017.00 89946.17 4126551.24 0.00 0.00 0.00
06:16:50 AM eth1 1397891.00 2795716.00 90098.74 4133497.39 0.00 0.00 1.00
06:16:51 AM eth1 1394951.00 2789984.00 89908.96 4125022.51 0.00 0.00 0.00
06:16:52 AM eth1 1394608.00 2789190.00 89886.90 4123851.36 0.00 0.00 1.00
06:16:53 AM eth1 1395314.00 2790653.00 89934.33 4125983.09 0.00 0.00 0.00
06:16:54 AM eth1 1396115.00 2792276.00 89984.25 4128411.21 0.00 0.00 1.00
06:16:55 AM eth1 1396829.00 2793523.00 90030.19 4130250.28 0.00 0.00 0.00
Average: eth1 1396158.40 2792297.00 89987.09 4128439.35 0.00 0.00 0.50
lpaa23:~# tc -s -d qd sh dev eth1 | grep backlog
backlog 7900052b 2609p requeues 173287
backlog 878120b 290p requeues 589
backlog 1068884b 354p requeues 5621
backlog 996212b 329p requeues 1088
backlog 984100b 325p requeues 115316
backlog 956848b 316p requeues 1781
backlog 1080996b 357p requeues 3047
backlog 975016b 322p requeues 24571
backlog 990156b 327p requeues 21274
(All 8 TX queues get a fair share of the traffic)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_nh_match does not match nexthops correctly. Example:
ip route add 172.16.10/24 nexthop via 192.168.122.12 dev eth0 \
nexthop via 192.168.122.13 dev eth0
ip route del 172.16.10/24 nexthop via 192.168.122.14 dev eth0 \
nexthop via 192.168.122.15 dev eth0
Del command is successful and route is removed. After this patch
applied, the route is correctly matched and result is:
RTNETLINK answers: No such process
Please consider this for stable trees as well.
Fixes: 4e902c5741 ("[IPv4]: FIB configuration using struct fib_config")
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We worked hard to improve tcp_ack() performance, by not accessing
skb_shinfo() in fast path (cd7d8498c9 tcp: change tcp_skb_pcount()
location)
We still have one spurious access because of ACK timestamping,
added in commit e1c8a607b2 ("net-timestamp: ACK timestamp for
bytestreams")
By checking if sk_tsflags has SOF_TIMESTAMPING_TX_ACK set,
we can avoid two cache line misses for the common case.
While we are at it, add two prefetchw() :
One in tcp_ack() to bring skb at the head of write queue.
One in tcp_clean_rtx_queue() loop to bring following skb,
as we will delete skb from the write queue and dirty skb->next->prev.
Add a couple of [un]likely() clauses.
After this patch, tcp_ack() is no longer the most consuming
function in tcp stack.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP Small Queues (tcp_tsq_handler()) can hold one reference on
sk->sk_wmem_alloc, preventing skb->ooo_okay being set.
We should relax test done to set skb->ooo_okay to take care
of this extra reference.
Minimal truesize of skb containing one byte of payload is
SKB_TRUESIZE(1)
Without this fix, we have more chance locking flows into the wrong
transmit queue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the missing validation code to avoid the use of nat/masq from
non-nat chains. The validation assumes two possible configuration
scenarios:
1) Use of nat from base chain that is not of nat type. Reject this
configuration from the nft_*_init() path of the expression.
2) Use of nat from non-base chain. In this case, we have to wait until
the non-base chain is referenced by at least one base chain via
jump/goto. This is resolved from the nft_*_validate() path which is
called from nf_tables_check_loops().
The user gets an -EOPNOTSUPP in both cases.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...
- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.
This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.
Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.
- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.
- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.
There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...
Pull networking updates from David Miller:
"Most notable changes in here:
1) By far the biggest accomplishment, thanks to a large range of
contributors, is the addition of multi-send for transmit. This is
the result of discussions back in Chicago, and the hard work of
several individuals.
Now, when the ->ndo_start_xmit() method of a driver sees
skb->xmit_more as true, it can choose to defer the doorbell
telling the driver to start processing the new TX queue entires.
skb->xmit_more means that the generic networking is guaranteed to
call the driver immediately with another SKB to send.
There is logic added to the qdisc layer to dequeue multiple
packets at a time, and the handling mis-predicted offloads in
software is now done with no locks held.
Finally, pktgen is extended to have a "burst" parameter that can
be used to test a multi-send implementation.
Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
virtio_net
Adding support is almost trivial, so export more drivers to
support this optimization soon.
I want to thank, in no particular or implied order, Jesper
Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
David Tat, Hannes Frederic Sowa, and Rusty Russell.
2) PTP and timestamping support in bnx2x, from Michal Kalderon.
3) Allow adjusting the rx_copybreak threshold for a driver via
ethtool, and add rx_copybreak support to enic driver. From
Govindarajulu Varadarajan.
4) Significant enhancements to the generic PHY layer and the bcm7xxx
driver in particular (EEE support, auto power down, etc.) from
Florian Fainelli.
5) Allow raw buffers to be used for flow dissection, allowing drivers
to determine the optimal "linear pull" size for devices that DMA
into pools of pages. The objective is to get exactly the
necessary amount of headers into the linear SKB area pre-pulled,
but no more. The new interface drivers use is eth_get_headlen().
From WANG Cong, with driver conversions (several had their own
by-hand duplicated implementations) by Alexander Duyck and Eric
Dumazet.
6) Support checksumming more smoothly and efficiently for
encapsulations, and add "foo over UDP" facility. From Tom
Herbert.
7) Add Broadcom SF2 switch driver to DSA layer, from Florian
Fainelli.
8) eBPF now can load programs via a system call and has an extensive
testsuite. Alexei Starovoitov and Daniel Borkmann.
9) Major overhaul of the packet scheduler to use RCU in several major
areas such as the classifiers and rate estimators. From John
Fastabend.
10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
Duyck.
11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
Dumazet.
12) Add Datacenter TCP congestion control algorithm support, From
Florian Westphal.
13) Reorganize sk_buff so that __copy_skb_header() is significantly
faster. From Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
netlabel: directly return netlbl_unlabel_genl_init()
net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
net: description of dma_cookie cause make xmldocs warning
cxgb4: clean up a type issue
cxgb4: potential shift wrapping bug
i40e: skb->xmit_more support
net: fs_enet: Add NAPI TX
net: fs_enet: Remove non NAPI RX
r8169:add support for RTL8168EP
net_sched: copy exts->type in tcf_exts_change()
wimax: convert printk to pr_foo()
af_unix: remove 0 assignment on static
ipv6: Do not warn for informational ICMP messages, regardless of type.
Update Intel Ethernet Driver maintainers list
bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
tipc: fix bug in multicast congestion handling
net: better IFF_XMIT_DST_RELEASE support
net/mlx4_en: remove NETDEV_TX_BUSY
3c59x: fix bad split of cpu_to_le32(pci_map_single())
net: bcmgenet: fix Tx ring priority programming
...
1/ Step down as dmaengine maintainer see commit 08223d80df "dmaengine
maintainer update"
2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit
7787380336 "net_dma: mark broken"), without reports of performance
regression.
3/ Miscellaneous fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUKDLKAAoJEB7SkWpmfYgC7wwP/iNHqRjf1suMUTBIF3P6Hgbe
VCUwh0IkuujMPDG46WRn6cYzarRxVPLoGaLHLPszgjI6pmGPVv19wqeDOlUxtcmr
0iQWEWv/zqseaAIW+4gj/WYCyMgKil49EUBJKCZCfNmIaad+e0pr8f0uE5yOkHPM
tqWoZERu9A4dlXGr1TjeOZVzdnPrCt92MrLDN6ZZ6tMuJaEc5PauaLxKTeGy5fYj
UB+k1xJQzECbsYfpB+uCVYl5/qPO1rNyuBYS8THCsW+JYmrbbfH2kkF2lo2FaUpO
8Yd50FtzXHKWwAt7BzfIwU2M7x0wRmryrC/xsQi6M+WmVeHYvvHUIpzaA66xRZ5x
fCy3Fu8sEnmnmboAbh2v2c5uTycqRl2xPzbpLAuxglloXIxzi3ckp6ESF/Z4SldH
oxIoEievN7lah3vKgvlHZYcWDzrYr8EKf/EzFe9RqDBQDKtzDzre1H9Uivr387Vm
uFUcGHYG/GXuX47C7EUsMtaSW2UEoR2ytw/HR6CKFPTVXwAzEO6kA9vg0EqL0iIq
2wVLgavlZuwegmaUBgnr+bgVZMvVN7OU7fAIRVe5xNO6itrPKvheSlQthmRiiq9C
uzOu4PS6PexqzHUNPCcJpCsj+lawmCSrE0bxtPzTA/CQInVgWs219V9+W5Gn/0YA
EARN9k6ueX9PZPQrPQLm
=BBBv
-----END PGP SIGNATURE-----
Merge tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine
Pull dmaengine updates from Dan Williams:
"Even though this has fixes marked for -stable, given the size and the
needed conflict resolutions this is 3.18-rc1/merge-window material.
These patches have been languishing in my tree for a long while. The
fact that I do not have the time to do proper/prompt maintenance of
this tree is a primary factor in the decision to step down as
dmaengine maintainer. That and the fact that the bulk of drivers/dma/
activity is going through Vinod these days.
The net_dma removal has not been in -next. It has developed simple
conflicts against mainline and net-next (for-3.18).
Continuing thanks to Vinod for staying on top of drivers/dma/.
Summary:
1/ Step down as dmaengine maintainer see commit 08223d80df
"dmaengine maintainer update"
2/ Removal of net_dma, as it has been marked 'broken' since 3.13
(commit 7787380336 "net_dma: mark broken"), without reports of
performance regression.
3/ Miscellaneous fixes"
* tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
net: make tcp_cleanup_rbuf private
net_dma: revert 'copied_early'
net_dma: simple removal
dmaengine maintainer update
dmatest: prevent memory leakage on error path in thread
ioat: Use time_before_jiffies()
dmaengine: fix xor sources continuation
dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup()
dma: mv_xor: Remove all callers of mv_xor_slot_cleanup()
dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call
ioat: Use pci_enable_msix_exact() instead of pci_enable_msix()
drivers: dma: Include appropriate header file in dca.c
drivers: dma: Mark functions as static in dma_v3.c
dma: mv_xor: Add DMA API error checks
ioat/dca: Use dev_is_pci() to check whether it is pci device
Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.
Current handling of IFF_XMIT_DST_RELEASE is not optimal.
Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y
The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.
As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.
This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.
Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :
dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
netif_keep_dst(dev);
Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.
The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a openvswitch compilation error when CONFIG_INET is not set:
=====================================================
In file included from include/net/geneve.h:4:0,
from net/openvswitch/flow_netlink.c:45:
include/net/udp_tunnel.h: In function 'udp_tunnel_handle_offloads':
>> include/net/udp_tunnel.h💯2: error: implicit declaration of function 'iptunnel_handle_offloads' [-Werror=implicit-function-declaration]
>> return iptunnel_handle_offloads(skb, udp_csum, type);
>> ^
>> >> include/net/udp_tunnel.h💯2: warning: return makes pointer from integer without a cast
>> >> cc1: some warnings being treated as errors
=====================================================
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we find a general query with non-zero number of sources, we
are dropping the skb as it's malformed.
RFC3376, section 4.1.8. Number of Sources (N):
This number is zero in a General Query or a Group-Specific Query,
and non-zero in a Group-and-Source-Specific Query.
Therefore, reflect that by using kfree_skb() instead of consume_skb().
Fixes: d679c5324d ("igmp: avoid drop_monitor false positives")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a device level support for Geneve -- Generic Network
Virtualization Encapsulation. The protocol is documented at
http://tools.ietf.org/html/draft-gross-geneve-01
Only protocol layer Geneve support is provided by this driver.
Openvswitch can be used for configuring, set up and tear down
functional Geneve tunnels.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains another batch with Netfilter/IPVS updates
for net-next, they are:
1) Add abstracted ICMP codes to the nf_tables reject expression. We
introduce four reasons to reject using ICMP that overlap in IPv4
and IPv6 from the semantic point of view. This should simplify the
maintainance of dual stack rule-sets through the inet table.
2) Move nf_send_reset() functions from header files to per-family
nf_reject modules, suggested by Patrick McHardy.
3) We have to use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) everywhere in the
code now that br_netfilter can be modularized. Convert remaining spots
in the network stack code.
4) Use rcu_barrier() in the nf_tables module removal path to ensure that
we don't leave object that are still pending to be released via
call_rcu (that may likely result in a crash).
5) Remove incomplete arch 32/64 compat from nft_compat. The original (bad)
idea was to probe the word size based on the xtables match/target info
size, but this assumption is wrong when you have to dump the information
back to userspace.
6) Allow to filter from prerouting and postrouting in the nf_tables bridge.
In order to emulate the ebtables NAT chains (which are actually simple
filter chains with no special semantics), we have support filtering from
this hooks too.
7) Add explicit module dependency between xt_physdev and br_netfilter.
This provides a way to detect if the user needs br_netfilter from
the configuration path. This should reduce the breakage of the
br_netfilter modularization.
8) Cleanup coding style in ip_vs.h, from Simon Horman.
9) Fix crash in the recently added nf_tables masq expression. We have
to register/unregister the notifiers to clean up the conntrack table
entries from the module init/exit path, not from the rule addition /
deletion path. From Arturo Borrero.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows configuring IPIP, sit, and GRE tunnels to use GUE.
This is very similar to fou excpet that we need to insert the GUE header
in addition to the UDP header on transmit.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support receiving for GUE packets in the fou module. The
fou module now supports direct foo-over-udp (no encapsulation header)
and GUE. To support this a type parameter is added to the fou netlink
parameters.
For a GUE socket we define gue_udp_recv, gue_gro_receive, and
gue_gro_complete to handle the specifics of the GUE protocol. Most
of the code to manage and configure sockets is common with the fou.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes fou[46]_gro_receive and fou[46]_gro_complete
functions. The v4 or v6 variants were chosen for the UDP offloads
based on the address family of the socket this is not necessary
or correct. Alternatively, this patch adds is_ipv6 to napi_gro_skb.
This is set in udp6_gro_receive and unset in udp4_gro_receive. In
fou_gro_receive the value is used to select the correct inet_offloads
for the protocol of the outer IP header.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adjusting max_header for the tunnel interface based on egress
device we need to account for any extra bytes in secondary encapsulation
(e.g. FOU).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to register the notifiers in the masquerade expression from
the the module _init and _exit path.
This fixes crashes when removing the masquerade rule with no
ipt_MASQUERADE support in place (which was masking the problem).
Fixes: 9ba1f72 ("netfilter: nf_tables: add new nft_masq expression")
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/usb/r8152.c
net/netfilter/nfnetlink.c
Both r8152 and nfnetlink conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
In 34666d4 ("netfilter: bridge: move br_netfilter out of the core"),
the bridge netfilter code has been modularized.
Use IS_ENABLED instead of ifdef to cover the module case.
Fixes: 34666d4 ("netfilter: bridge: move br_netfilter out of the core")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move nf_send_reset() and nf_send_reset6() to nf_reject_ipv4 and
nf_reject_ipv6 respectively. This code is shared by x_tables and
nf_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces the NFT_REJECT_ICMPX_UNREACH type which provides
an abstraction to the ICMP and ICMPv6 codes that you can use from the
inet and bridge tables, they are:
* NFT_REJECT_ICMPX_NO_ROUTE: no route to host - network unreachable
* NFT_REJECT_ICMPX_PORT_UNREACH: port unreachable
* NFT_REJECT_ICMPX_HOST_UNREACH: host unreachable
* NFT_REJECT_ICMPX_ADMIN_PROHIBITED: administratevely prohibited
You can still use the specific codes when restricting the rule to match
the corresponding layer 3 protocol.
I decided to not overload the existing NFT_REJECT_ICMP_UNREACH to have
different semantics depending on the table family and to allow the user
to specify ICMP family specific codes if they restrict it to the
corresponding family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Call skb_set_inner_protocol to set inner Ethernet protocol to
protocol being encapsulation by GRE before tunnel_xmit. This is
needed for GSO if UDP encapsulation (fou) is being done.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV4
before tunnel_xmit. This is needed if UDP encapsulation (fou) is
being done.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_udp_segment is the function called from udp4_ufo_fragment to
segment a UDP tunnel packet. This function currently assumes
segmentation is transparent Ethernet bridging (i.e. VXLAN
encapsulation). This patch generalizes the function to
operate on either Ethertype or IP protocol.
The inner_protocol field must be set to the protocol of the inner
header. This can now be either an Ethertype or an IP protocol
(in a union). A new flag in the skbuff indicates which type is
effective. skb_set_inner_protocol and skb_set_inner_ipproto
helper functions were added to set the inner_protocol. These
functions are called from the point where the tunnel encapsulation
is occuring.
When skb_udp_tunnel_segment is called, the function to segment the
inner packet is selected based on the inner IP or Ethertype. In the
case of an IP protocol encapsulation, the function is derived from
inet[6]_offloads. In the case of Ethertype, skb->protocol is
set to the inner_protocol and skb_mac_gso_segment is called. (GRE
currently does this, but it might be possible to lookup the protocol
in offload_base and call the appropriate segmenation function
directly).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lets use a proper structure to clearly document and implement
skb fast clones.
Then, we might experiment more easily alternative layouts.
This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm,
to stop leaking of implementation details.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we have two different policies for orphan sockets
that repeatedly stall on zero window ACKs. If a socket gets
a zero window ACK when it is transmitting data, the RTO is
used to probe the window. The socket is aborted after roughly
tcp_orphan_retries() retries (as in tcp_write_timeout()).
But if the socket was idle when it received the zero window ACK,
and later wants to send more data, we use the probe timer to
probe the window. If the receiver always returns zero window ACKs,
icsk_probes keeps getting reset in tcp_ack() and the orphan socket
can stall forever until the system reaches the orphan limit (as
commented in tcp_probe_timer()). This opens up a simple attack
to create lots of hanging orphan sockets to burn the memory
and the CPU, as demonstrated in the recent netdev post "TCP
connection will hang in FIN_WAIT1 after closing if zero window is
advertised." http://www.spinics.net/lists/netdev/msg296539.html
This patch follows the design in RTO-based probe: we abort an orphan
socket stalling on zero window when the probe timer reaches both
the maximum backoff and the maximum RTO. For example, an 100ms RTT
connection will timeout after roughly 153 seconds (0.3 + 0.6 +
.... + 76.8) if the receiver keeps the window shut. If the orphan
socket passes this check, but the system already has too many orphans
(as in tcp_out_of_resources()), we still abort it but we'll also
send an RST packet as the connection may still be active.
In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
sockets stalled on zero-window probes. This changes the semantics
of TCP_USER_TIMEOUT slightly because it previously only applies
when the socket has pending transmission.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
cipso_v4_cache_init is only called by __init cipso_v4_init
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip4_frags_ctl_register is only called by __init ipfrag_init
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Proper CHECKSUM_COMPLETE support needs to adjust skb->csum
when we remove one header. Its done using skb_gro_postpull_rcsum()
In the case of IPv4, we know that the adjustment is not really needed,
because the checksum over IPv4 header is 0. Lets add a comment to
ease code comprehension and avoid copy/paste errors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No caller uses the return value, so make this function return void.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
pull request: netfilter/ipvs updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
most relevantly they are:
1) Four patches to make the new nf_tables masquerading support
independent of the x_tables infrastructure. This also resolves a
compilation breakage if the masquerade target is disabled but the
nf_tables masq expression is enabled.
2) ipset updates via Jozsef Kadlecsik. This includes the addition of the
skbinfo extension that allows you to store packet metainformation in the
elements. This can be used to fetch and restore this to the packets through
the iptables SET target, patches from Anton Danilov.
3) Add the hash:mac set type to ipset, from Jozsef Kadlecsick.
4) Add simple weighted fail-over scheduler via Simon Horman. This provides
a fail-over IPVS scheduler (unlike existing load balancing schedulers).
Connections are directed to the appropriate server based solely on
highest weight value and server availability, patch from Kenny Mathis.
5) Support IPv6 real servers in IPv4 virtual-services and vice versa.
Simon Horman informs that the motivation for this is to allow more
flexibility in the choice of IP version offered by both virtual-servers
and real-servers as they no longer need to match: An IPv4 connection
from an end-user may be forwarded to a real-server using IPv6 and
vice versa. No ip_vs_sync support yet though. Patches from Alex Gartrell
and Julian Anastasov.
6) Add global generation ID to the nf_tables ruleset. When dumping from
several different object lists, we need a way to identify that an update
has ocurred so userspace knows that it needs to refresh its lists. This
also includes a new command to obtain the 32-bits generation ID. The
less significant 16-bits of this ID is also exposed through res_id field
in the nfnetlink header to quickly detect the interference and retry when
there is no risk of ID wraparound.
7) Move br_netfilter out of the bridge core. The br_netfilter code is
built in the bridge core by default. This causes problems of different
kind to people that don't want this: Jesper reported performance drop due
to the inconditional hook registration and I remember to have read complains
on netdev from people regarding the unexpected behaviour of our bridging
stack when br_netfilter is enabled (fragmentation handling, layer 3 and
upper inspection). People that still need this should easily undo the
damage by modprobing the new br_netfilter module.
8) Dump the set policy nf_tables that allows set parameterization. So
userspace can keep user-defined preferences when saving the ruleset.
From Arturo Borrero.
9) Use __seq_open_private() helper function to reduce boiler plate code
in x_tables, From Rob Jones.
10) Safer default behaviour in case that you forget to load the protocol
tracker. Daniel Borkmann and Florian Westphal detected that if your
ruleset is stateful, you allow traffic to at least one single SCTP port
and the SCTP protocol tracker is not loaded, then any SCTP traffic may
be pass through unfiltered. After this patch, the connection tracking
classifies SCTP/DCCP/UDPlite/GRE packets as invalid if your kernel has
been compiled with support for these modules.
====================
Trivially resolved conflict in include/linux/skbuff.h, Eric moved some
netfilter skbuff members around, and the netfilter tree adjusted the
ifdef guards for the bridging info pointer.
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested by Stephen. Also drop inline keyword and let compiler decide.
gcc 4.7.3 decides to no longer inline tcp_ecn_check_ce, so split it up.
The actual evaluation is not inlined anymore while the ECN_OK test is.
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After Octavian Purdilas tcp ipv4/ipv6 unification work this helper only
has a single callsite.
While at it, convert name to lowercase, suggested by Stephen.
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This variable i is overwritten to 0 by following code
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
SYN/ACK ECT test:
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
ss {-4,-6} -t -i diag interface:
... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
reordering:101 rcv_space:29200
... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
325.5Mbps rcv_rtt:1.5 rcv_space:29200
More information about DCTCP can be found in [1-4].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
and ACK properties, e.g. ACK that updates window is treated differently
than DUPACK.
Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
DCTCP also implements a CE state machine that keeps track of CE markings
of incoming packets.
Therefore, extend the congestion control framework to provide these
event types, so that DCTCP can be properly implemented as a normal
congestion algorithm module outside of the core stack.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The congestion control ops "cwnd_event" currently supports
CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others).
Both FAST and SLOW_ACK are only used by Westwood congestion
control algorithm.
This removes both flags from cwnd_event and adds a new
in_ack_event callback for this. The goal is to be able to
provide more detailed information about ACKs, such as whether
ECE flag was set, or whether the ACK resulted in a window
update.
It is required for DataCenter TCP (DCTCP) congestion control
algorithm as it makes a different choice depending on ECE being
set or not.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a flag to TCP congestion algorithms that allows
for requesting to mark IPv4/IPv6 sockets with transport as ECN
capable, that is, ECT(0), when required by a congestion algorithm.
It is currently used and needed in DataCenter TCP (DCTCP), as it
requires both peers to assert ECT on all IP packets sent - it
uses ECN feedback (i.e. CE, Congestion Encountered information)
from switches inside the data center to derive feedback to the
end hosts.
Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
algorithm/behaviour slightly diverges from RFC3168, therefore this
is only (!) enabled iff the assigned congestion control ops module
has requested this. By that, we can tightly couple this logic really
only to the provided congestion control ops.
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split assignment and initialization from one into two functions.
This is required by followup patches that add Datacenter TCP
(DCTCP) congestion control algorithm - we need to be able to
determine if the connection is moderated by DCTCP before the
3WHS has finished.
As we walk the available congestion control list during the
assignment, we are always guaranteed to have Reno present as
it's fixed compiled-in. Therefore, since we're doing the
early assignment, we don't have a real use for the Reno alias
tcp_init_congestion_ops anymore and can thus remove it.
Actual usage of the congestion control operations are being
made after the 3WHS has finished, in some cases however we
can access get_info() via diag if implemented, therefore we
need to zero out the private area for those modules.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We do not wish to disturb dropwatch or perf drop profiles with an ARP
we will ignore.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2014-09-25
1) Remove useless hash_resize_mutex in xfrm_hash_resize().
This mutex is used only there, but xfrm_hash_resize()
can't be called concurrently at all. From Ying Xue.
2) Extend policy hashing to prefixed policies based on
prefix lenght thresholds. From Christophe Gouault.
3) Make the policy hash table thresholds configurable
via netlink. From Christophe Gouault.
4) Remove the maximum authentication length for AH.
This was needed to limit stack usage. We switched
already to allocate space, so no need to keep the
limit. From Herbert Xu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: commit f187bc6efb ("ipv4: No need to set generic neighbour pointer")
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup which follows the idea in commit e11ecddf51 (tcp: use
TCP_SKB_CB(skb)->tcp_flags in input path),
and it may reduce register pressure since skb->cb[] access is fast,
bacause skb is probably in a register.
v2: remove variable th
v3: reword the changelog
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our goal is to access no more than one cache line access per skb in
a write or receive queue when doing the various walks.
After recent TCP_SKB_CB() reorganizations, it is almost done.
Last part is tcp_skb_pcount() which currently uses
skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
3 cache lines in current kernel (skb->head, skb->end, and
shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
This very simple patch reuses space currently taken by tcp_tw_isn
only in input path, as tcp_skb_pcount is only needed for skb stored in
write queue.
This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
to get SKBTX_ACK_TSTAMP, which seems possible.
This also speeds up all sack processing in general.
This speeds up tcp_sendmsg() because it no longer has to access/dirty
shinfo.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP maintains lists of skb in write queue, and in receive queues
(in order and out of order queues)
Scanning these lists both in input and output path usually requires
access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq
These fields are currently in two different cache lines, meaning we
waste lot of memory bandwidth when these queues are big and flows
have either packet drops or packet reorders.
We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
this header is not used in fast path. This allows TCP to search much faster
in the skb lists.
Even with regular flows, we save one cache line miss in fast path.
Thanks to Christoph Paasch for noticing we need to cleanup
skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
Lets break this assumption, but provide a helper to not change all call points.
ip_send_unicast_reply() gets a new struct ip_options pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_dma was the only external user so this can become local to tcp.c
again.
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that tcp_dma_try_early_copy() is gone nothing ever sets
copied_early.
Also reverts "53240c208776 tcp: Fix possible double-ack w/ user dma"
since it is no longer necessary.
Cc: Ali Saidi <saidi@engin.umich.edu>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Neal Cardwell <ncardwell@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.
This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.
Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():
https://lkml.org/lkml/2014/9/3/177
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
While profiling TCP stack, I noticed one useless atomic operation
in tcp_sendmsg(), caused by skb_header_release().
It turns out all current skb_header_release() users have a fresh skb,
that no other user can see, so we can avoid one atomic operation.
Introduce __skb_header_release() to clearly document this.
This gave me a 1.5 % improvement on TCP_RR workload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we try to add an already existing tunnel, we don't return
an error. Instead we continue and call ip_tunnel_update().
This means that we can change existing tunnels by adding
the same tunnel multiple times. It is even possible to change
the tunnel endpoints of the fallback device.
We fix this by returning an error if we try to add an existing
tunnel.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The send_check logic was only interesting in cases of TCP offload and
UDP UFO where the checksum needed to be initialized to the pseudo
header checksum. Now we've moved that logic into the related
gso_segment functions so gso_send_check is no longer needed.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In udp[46]_ufo_send_check the UDP checksum initialized to the pseudo
header checksum. We can move this logic into udp[46]_ufo_fragment.
After this change udp[64]_ufo_send_check is a no-op.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp_v[46]_gso_send_check the TCP checksum is initialized to the
pseudo header checksum using __tcp_v[46]_send_check. We can move this
logic into new tcp[46]_gso_segment functions to be done when
ip_summed != CHECKSUM_PARTIAL (ip_summed == CHECKSUM_PARTIAL should be
the common case, possibly always true when taking GSO path). After this
change tcp_v[46]_gso_send_check is no-op.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
In order to make TCP more resilient in presence of reorders, we need
to allow coalescing to happen when skbs from out of order queue are
transferred into receive queue. LRO/GRO can be completely canceled
in some pathological cases, like per packet load balancing on aggregated
links.
I had to move tcp_try_coalesce() up in the file above tcp_ofo_queue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
protected by a lock, meaning that hosts can be stuck hard if all cpus
want to check ICMP limits.
When say a DNS or NTP server process is restarted, inetpeer tree grows
quick and machine comes to its knees.
iptables can not help because the bottleneck happens before ICMP
messages are even cooked and sent.
This patch adds a new global limitation, using a token bucket filter,
controlled by two new sysctl :
icmp_msgs_per_sec - INTEGER
Limit maximal number of ICMP packets sent per second from this host.
Only messages whose type matches icmp_ratemask are
controlled by this limit.
Default: 1000
icmp_msgs_burst - INTEGER
icmp_msgs_per_sec controls number of ICMP packets sent per second,
while icmp_msgs_burst controls the burst size of these packets.
Default: 50
Note that if we really want to send millions of ICMP messages per
second, we might extend idea and infra added in commit 04ca6973f7
("ip: make IP identifiers less predictable") :
add a token bucket in the ip_idents hash and no longer rely on inetpeer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
this_cpu_ptr() in preemptible context is generally bad
Sep 22 05:05:55 br kernel: [ 94.608310] BUG: using smp_processor_id()
in
preemptible [00000000] code: ip/2261
Sep 22 05:05:55 br kernel: [ 94.608316] caller is
tunnel_dst_set.isra.28+0x20/0x60 [ip_tunnel]
Sep 22 05:05:55 br kernel: [ 94.608319] CPU: 3 PID: 2261 Comm: ip Not
tainted
3.17.0-rc5 #82
We can simply use raw_cpu_ptr(), as preemption is safe in these
contexts.
Should fix https://bugzilla.kernel.org/show_bug.cgi?id=84991
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Joe <joe9mail@gmail.com>
Fixes: 9a4aa9af44 ("ipv4: Use percpu Cache route in IP tunnels")
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
or more if some sysctl (eg tcp_retries2) are changed.
Better use 64bit to perform icsk_rto << icsk_backoff operations
As Joe Perches suggested, add a helper for this.
Yuchung spotted the tcp_v4_err() case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added netlink attrs to configure FOU encapsulation for GRE, netlink
handling of these flags, and properly adjust MTU for encapsulation.
ip_tunnel_encap is called from ip_tunnel_xmit to actually perform FOU
encapsulation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink handling for IP tunnel encapsulation parameters and
and adjustment of MTU for encapsulation. ip_tunnel_encap is called
from ip_tunnel_xmit to actually perform FOU encapsulation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes IP tunnel to support (secondary) encapsulation,
Foo-over-UDP. Changes include:
1) Adding tun_hlen as the tunnel header length, encap_hlen as the
encapsulation header length, and hlen becomes the grand total
of these.
2) Added common netlink define to support FOU encapsulation.
3) Routines to perform FOU encapsulation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement fou_gro_receive and fou_gro_complete, and populate these
in the correponsing udp_offloads for the socket. Added ipproto to
udp_offloads and pass this from UDP to the fou GRO routine in proto
field of napi_gro_cb structure.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides a receive path for foo-over-udp. This allows
direct encapsulation of IP protocols over UDP. The bound destination
port is used to map to an IP protocol, and the XFRM framework
(udp_encap_rcv) is used to receive encapsulated packets. Upon
reception, the encapsulation header is logically removed (pointer
to transport header is advanced) and the packet is reinjected into
the receive path with the IP protocol indicated by the mapping.
Netlink is used to configure FOU ports. The configuration information
includes the port number to bind to and the IP protocol corresponding
to that port.
This should support GRE/UDP
(http://tools.ietf.org/html/draft-yong-tsvwg-gre-in-udp-encap-02),
as will as the other IP tunneling protocols (IPIP, SIT).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Want to be able to use these in foo-over-udp offloads, etc.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we no longer rely on having tcp headers for skbs in receive queue,
tcp repair do not need to build fake ones.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added a few more UDP tunnel APIs that can be shared by UDP based
tunnel protocol implementation. The main ones are highlighted below.
setup_udp_tunnel_sock() configures UDP listener socket for
receiving UDP encapsulated packets.
udp_tunnel_xmit_skb() and upd_tunnel6_xmit_skb() transmit skb
using UDP encapsulation.
udp_tunnel_sock_release() closes the UDP tunnel listener socket.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ip6_udp_tunnel.c for ipv6 UDP tunnel functions to avoid ifdefs
in udp_tunnel.c
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While tracking down the MAX_AH_AUTH_LEN crash in an old kernel
I thought that this limit was rather arbitrary and we should
just get rid of it.
In fact it seems that we've already done all the work needed
to remove it apart from actually removing it. This limit was
there in order to limit stack usage. Since we've already
switched over to allocating scratch space using kmalloc, there
is no longer any need to limit the authentication length.
This patch kills all references to it, including the BUG_ONs
that led me here.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Currently we genarate a blackhole route route whenever we have
matching policies but can not resolve the states. Here we assume
that dst_output() is called to kill the balckholed packets.
Unfortunately this assumption is not true in all cases, so
it is possible that these packets leave the system unwanted.
We fix this by generating blackhole routes only from the
route lookup functions, here we can guarantee a call to
dst_output() afterwards.
Fixes: 2774c131b1 ("xfrm: Handle blackhole route creation via afinfo.")
Reported-by: Konstantinos Kolelis <k.kolelis@sirrix.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
tcp_collapse() wants to shrink skb so that the overhead is minimal.
Now we store tcp flags into TCP_SKB_CB(skb)->tcp_flags, we no longer
need to keep around full headers.
Whole available space is dedicated to the payload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can allow a segment with FIN to be aggregated,
if we take care to add tcp flags,
and if skb_try_coalesce() takes care of zero sized skbs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags,
which is only used in output path.
tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue,
and its unfortunate because this bit is located in a cache line right before
the payload.
We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags.
This patch does so, and avoids the cache line miss in tcp_recvmsg()
Following patches will
- allow a segment with FIN being coalesced in tcp_try_coalesce()
- simplify tcp_collapse() by not copying the headers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2abb7cdc0d ("udp: Add support for doing checksum unnecessary
conversion") caused napi_gro_cb structs with the "flush" field zero to
take the "udp_gro_receive" path rather than the "set flush to 1" path
that they would previously take. As a result I saw booting from an NFS
root hang shortly after starting userspace, with "server not
responding" messages.
This change to the handling of "flush == 0" packets appears to be
incidental to the goal of adding new code in the case where
skb_gro_checksum_validate_zero_check() returns zero. Based on that and
the fact that it breaks things, I'm assuming that it is unintentional.
Fixes: 2abb7cdc0d ("udp: Add support for doing checksum unnecessary conversion")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users are starting to test nf_tables with no x_tables support. Therefore,
masquerading needs to be indenpendent of it from Kconfig.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that we have masquerading support in nf_tables, the NAT chain can
be use with it, not only for SNAT/DNAT. So make this chain type
independent of it.
While at it, move it inside the scope of 'if NF_NAT_IPV*' to simplify
dependencies.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
nf-next pull request
The following patchset contains Netfilter/IPVS updates for your
net-next tree. Regarding nf_tables, most updates focus on consolidating
the NAT infrastructure and adding support for masquerading. More
specifically, they are:
1) use __u8 instead of u_int8_t in arptables header, from
Mike Frysinger.
2) Add support to match by skb->pkttype to the meta expression, from
Ana Rey.
3) Add support to match by cpu to the meta expression, also from
Ana Rey.
4) A smatch warning about IPSET_ATTR_MARKMASK validation, patch from
Vytas Dauksa.
5) Fix netnet and netportnet hash types the range support for IPv4,
from Sergey Popovich.
6) Fix missing-field-initializer warnings resolved, from Mark Rustad.
7) Dan Carperter reported possible integer overflows in ipset, from
Jozsef Kadlecsick.
8) Filter out accounting objects in nfacct by type, so you can
selectively reset quotas, from Alexey Perevalov.
9) Move specific NAT IPv4 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
10) Use the new NAT IPv4 functions from nft_chain_nat_ipv4.
11) Move specific NAT IPv6 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
12) Use the new NAT IPv6 functions from nft_chain_nat_ipv6.
13) Refactor code to add nft_delrule(), which can be reused in the
enhancement of the NFT_MSG_DELTABLE to remove a table and its
content, from Arturo Borrero.
14) Add a helper function to unregister chain hooks, from
Arturo Borrero.
15) A cleanup to rename to nft_delrule_by_chain for consistency with
the new nft_*() functions, also from Arturo.
16) Add support to match devgroup to the meta expression, from Ana Rey.
17) Reduce stack usage for IPVS socket option, from Julian Anastasov.
18) Remove unnecessary textsearch state initialization in xt_string,
from Bojan Prtvar.
19) Add several helper functions to nf_tables, more work to prepare
the enhancement of NFT_MSG_DELTABLE, again from Arturo Borrero.
20) Enhance NFT_MSG_DELTABLE to delete a table and its content, from
Arturo Borrero.
21) Support NAT flags in the nat expression to indicate the flavour,
eg. random fully, from Arturo.
22) Add missing audit code to ebtables when replacing tables, from
Nicolas Dichtel.
23) Generalize the IPv4 masquerading code to allow its re-use from
nf_tables, from Arturo.
24) Generalize the IPv6 masquerading code, also from Arturo.
25) Add the new masq expression to support IPv4/IPv6 masquerading
from nf_tables, also from Arturo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add inet_gro_receive and inet_gro_complete to ipip_offload to
support GRO.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/udp_offload.c:339:5: warning: symbol 'udp4_gro_complete' was
not declared. Should it be static?
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Fixes: 57c67ff4bd ("udp: additional GRO support")
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove one sparse warning :
net/ipv4/ip_sockglue.c:328:22: warning: incorrect type in assignment (different address spaces)
net/ipv4/ip_sockglue.c:328:22: expected struct ip_ra_chain [noderef] <asn:4>*next
net/ipv4/ip_sockglue.c:328:22: got struct ip_ra_chain *[assigned] ra
And replace one rcu_assign_ptr() by RCU_INIT_POINTER() where applicable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck reported high false sharing on dst refcount in tcp stack
when prequeue is used. prequeue is the mechanism used when a thread is
blocked in recvmsg()/read() on a TCP socket, using a blocking model
rather than select()/poll()/epoll() non blocking one.
We already try to use RCU in input path as much as possible, but we were
forced to take a refcount on the dst when skb escaped RCU protected
region. When/if the user thread runs on different cpu, dst_release()
will then touch dst refcount again.
Commit 093162553c (tcp: force a dst refcount when prequeue packet)
was an example of a race fix.
It turns out the only remaining usage of skb->dst for a packet stored
in a TCP socket prequeue is IP early demux.
We can add a logic to detect when IP early demux is probably going
to use skb->dst. Because we do an optimistic check rather than duplicate
existing logic, we need to guard inet_sk_rx_dst_set() and
inet6_sk_rx_dst_set() from using a NULL dst.
Many thanks to Alexander for providing a nice bug report, git bisection,
and reproducer.
Tested using Alexander script on a 40Gb NIC, 8 RX queues.
Hosts have 24 cores, 48 hyper threads.
echo 0 >/proc/sys/net/ipv4/tcp_autocorking
for i in `seq 0 47`
do
for j in `seq 0 2`
do
netperf -H $DEST -t TCP_STREAM -l 1000 \
-c -C -T $i,$i -P 0 -- \
-m 64 -s 64K -D &
done
done
Before patch : ~6Mpps and ~95% cpu usage on receiver
After patch : ~9Mpps and ~35% cpu usage on receiver.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net.ipv4.ip_nonlocal_bind sysctl was global to all network
namespaces. This patch allows to set a different value for each
network namespace.
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nft_masq expression is intended to perform NAT in the masquerade flavour.
We decided to have the masquerade functionality in a separated expression other
than nft_nat.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Let's refactor the code so we can reach the masquerade functionality
from outside the xt context (ie. nftables).
The patch includes the addition of an atomic counter to the masquerade
notifier: the stuff to be done by the notifier is the same for xt and
nftables. Therefore, only one notification handler is needed.
This factorization only involves IPv4; a similar patch follows to
handle IPv6.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
inetpeer sequence numbers are no longer incremented, so no need to
check and flush the tree. The function that increments the sequence
number was already dead code and removed in in "ipv4: remove unused
function" (068a6e18). Remove the code that checks for a change, too.
Verifying that v4_seq and v6_seq are never incremented and thus that
flush_check compares bp->flush_seq to 0 is trivial.
The second part of the change removes flush_check completely even
though bp->flush_seq is exactly !0 once, at initialization. This
change is correct because the time this branch is true is when
bp->root == peer_avl_empty_rcu, in which the branch and
inetpeer_invalidate_tree are a NOOP.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE assumes that the GRE header is at skb_network_header +
ip_hrdlen(skb). It is more general to use skb_transport_header
and this allows the possbility of inserting additional header
between IP and GRE (which is what we will done in Generic UDP
Encapsulation for GRE).
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.
We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
The TCP_SKB_CB(skb)->when field no longer exists as of recent change
7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when"). And in any case,
tcp_fragment() is called on already-transmitted packets from the
__tcp_retransmit_skb() call site, so copying timestamps of any kind
in this spot is quite sensible.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 740b0f1841 ("tcp: switch rtt estimations to usec resolution"),
we no longer need to maintain timestamps in two different fields.
TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP_SKB_CB(skb)->when has different meaning in output and input paths.
In output path, it contains a timestamp.
In input path, it contains an ISN, chosen by tcp_timewait_state_process()
Lets add a different name to ease code comprehension.
Note that 'when' field will disappear in following patch,
as skb_mstamp already contains timestamp, the anonymous
union will promptly disappear as well.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since sock_efree and sock_demux are essentially the same code for non-TCP
sockets and the case where CONFIG_INET is not defined we can combine the
code or replace the call to sock_edemux in several spots. As a result we
can avoid a bit of unnecessary code or code duplication.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lets make this hash function a bit secure, as ICMP attacks are still
in the wild.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nh_exceptions is effectively used under rcu, but lacks proper
barriers. Between kzalloc() and setting of nh->nh_exceptions(),
we need a proper memory barrier.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 4895c771c7 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: David S. Miller <davem@davemloft.net>
As in IPv6 people might increase the igmp query robustness variable to
make sure unsolicited state change reports aren't lost on the network. Add
and document this new knob to igmp code.
RFCs allow tuning this parameter back to first IGMP RFC, so we also use
this setting for all counters, including source specific multicast.
Also take over sysctl value when upping the interface and don't reuse
the last one seen on the interface.
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the exported IPv4 NAT functions that are provided by the core. This
removes duplicated code so iptables and nft use the same NAT codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move the specific NAT IPv4 core functions that are called from the
hooks from iptable_nat.c to nf_nat_l3proto_ipv4.c. This prepares the
ground to allow iptables and nft to use the same NAT engine code that
comes in a follow up patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
sk->sk_error_queue is dequeued in four locations. All share the
exact same logic. Deduplicate.
Also collapse the two critical sections for dequeue (at the top of
the recv handler) and signal (at the bottom).
This moves signal generation for the next packet forward, which should
be harmless.
It also changes the behavior if the recv handler exits early with an
error. Previously, a signal for follow-up packets on the errqueue
would then not be scheduled. The new behavior, to always signal, is
arguably a bug fix.
For rxrpc, the change causes the same function to be called repeatedly
for each queued packet (because the recv handler == sk_error_report).
It is likely that all packets will fail for the same reason (e.g.,
memory exhaustion).
This code runs without sk_lock held, so it is not safe to trust that
sk->sk_err is immutable inbetween releasing q->lock and the subsequent
test. Introduce int err just to avoid this potential race.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_checksum_try_convert and skb_gro_checksum_try_convert
after checksum is found present and validated in the GRE header
for normal and GRO paths respectively.
In GRO path, call skb_gro_checksum_try_convert
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for doing CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE
conversion in UDP tunneling path.
In the normal UDP path, we call skb_checksum_try_convert after locating
the UDP socket. The check is that checksum conversion is enabled for
the socket (new flag in UDP socket) and that checksum field is
non-zero.
In the UDP GRO path, we call skb_gro_checksum_try_convert after
checksum is validated and checksum field is non-zero. Since this is
already in GRO we assume that checksum conversion is always wanted.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix places where there is space before tab, long lines, and
awkward if(){, double spacing etc. Add blank line after declaration/initialization.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow GRO path to "consume" checksums provided in CHECKSUM_UNNECESSARY
and to report new checksums verfied for use in fallback to normal
path.
Change GRO checksum path to track csum_level using a csum_cnt field
in NAPI_GRO_CB. On GRO initialization, if ip_summed is
CHECKSUM_UNNECESSARY set NAPI_GRO_CB(skb)->csum_cnt to
skb->csum_level + 1. For each checksum verified, decrement
NAPI_GRO_CB(skb)->csum_cnt while its greater than zero. If a checksum
is verfied and NAPI_GRO_CB(skb)->csum_cnt == 0, we have verified a
deeper checksum than originally indicated in skbuf so increment
csum_level (or initialize to CHECKSUM_UNNECESSARY if ip_summed is
CHECKSUM_NONE or CHECKSUM_COMPLETE).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch:
- Clarifies the specific requirements of devices returning
CHECKSUM_UNNECESSARY (comments in skbuff.h).
- Adds csum_level field to skbuff. This is used to express how
many checksums are covered by CHECKSUM_UNNECESSARY (stores n - 1).
This replaces the overloading of skb->encapsulation, that field is
is now only used to indicate inner headers are valid.
- Change __skb_checksum_validate_needed to "consume" each checksum
as indicated by csum_level as layers of the the packet are parsed.
- Remove skb_pop_rcv_encapsulation, no longer needed in the new
csum_level model.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace uses of get_cpu_var for address calculation through this_cpu_ptr.
Cc: netdev@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In GRE demux if the GRE checksum pop rcv encapsulation so that any
encapsulated checksums are treated as tunnel checksums.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement GRO for UDPv6. Add UDP checksum verification in gro_receive
for both UDP4 and UDP6 calling skb_gro_checksum_validate_zero_check.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp[64]_gro_receive call skb_gro_checksum_validate to validate TCP
checksum in the gro context.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon timeout, undo (via both timestamps/Eifel and DSACKs) was
disabled if any retransmits were still in flight. The concern was
perhaps that spurious retransmission sent in a previous recovery
episode may trigger DSACKs to falsely undo the current recovery.
However, this inadvertently misses undo opportunities (using either
TCP timestamps or DSACKs) when timeout occurs during a loss episode,
i.e. recurring timeouts or timeout during fast recovery. In these
cases some retransmissions will be in flight but we should allow
undo. Furthermore, we should only reset undo_marker and undo_retrans
upon timeout if we are starting a new recovery episode. Finally,
when we do reset our undo state, we now do so in a manner similar
to tcp_enter_recovery(), so that we require a DSACK for each of
the outstsanding retransmissions. This will achieve the original
goal by requiring that we receive the same number of DSACKs as
retransmissions.
This patch increases the undo events by 50% on Google servers.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions time_before, time_before_eq, time_after, and time_after_eq
are more robust for comparing jiffies against other values.
A simplified version of the Coccinelle semantic patch making this change
is as follows:
@change@
expression E1,E2;
@@
- jiffies - E1 < E2
+ time_before(jiffies, E1+E2)
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "rcu_dereference()" call is used directly in a condition.
Since its return value is never dereferenced it is recommended to use
"rcu_access_pointer()" instead of "rcu_dereference()".
Therefore, this patch makes the replacement.
The following Coccinelle semantic patch was used:
@@
@@
(
if(
(<+...
- rcu_dereference
+ rcu_access_pointer
(...)
...+>)) {...}
|
while(
(<+...
- rcu_dereference
+ rcu_access_pointer
(...)
...+>)) {...}
)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7a9bc9b81a ("ipv4: Elide fib_validate_source() completely when possible.")
introduced a short-circuit to avoid calling fib_validate_source when not
needed. That change took rp_filter into account, but not accept_local.
This resulted in a change of behaviour: with rp_filter and accept_local
off, incoming packets with a local address in the source field should be
dropped.
Here is how to reproduce the change pre/post 7a9bc9b81a commit:
-configure the same IPv4 address on hosts A and B.
-try to send an ARP request from B to A.
-The ARP request will be dropped before that commit, but accepted and answered
after that commit.
This adds a check for ACCEPT_LOCAL, to maintain full
fib validation in case it is 0. We also leave __fib_validate_source() earlier
when possible, based on the same check as fib_validate_source(), once the
accept_local stuff is verified.
Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the NAT configs depend on iptables and ip6tables. However,
users should be capable of enabling NAT for nft without having to
switch on iptables.
Fix this by adding new specific IP_NF_NAT and IP6_NF_NAT config
switches for iptables and ip6tables NAT support. I have also moved
the original NF_NAT_IPV4 and NF_NAT_IPV6 configs out of the scope
of iptables to make them independent of it.
This patch also adds NETFILTER_XT_NAT which selects the xt_nat
combo that provides snat/dnat for iptables. We cannot use NF_NAT
anymore since nf_tables can select this.
Reported-by: Matteo Croce <technoboy85@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix TCP FRTO logic so that it always notices when snd_una advances,
indicating that any RTO after that point will be a new and distinct
loss episode.
Previously there was a very specific sequence that could cause FRTO to
fail to notice a new loss episode had started:
(1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
(2) receiver ACKs packet 1
(3) FRTO sends 2 more packets
(4) RTO timer fires again (should start a new loss episode)
The problem was in step (3) above, where tcp_process_loss() returned
early (in the spot marked "Step 2.b"), so that it never got to the
logic to clear icsk_retransmits. Thus icsk_retransmits stayed
non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
icsk_retransmits, decide that this RTO is not a new episode, and
decide not to cut ssthresh and remember the current cwnd and ssthresh
for undo.
There were two main consequences to the bug that we have
observed. First, ssthresh was not decreased in step (4). Second, when
there was a series of such FRTO (1-4) sequences that happened to be
followed by an FRTO undo, we would restore the cwnd and ssthresh from
before the entire series started (instead of the cwnd and ssthresh
from before the most recent RTO). This could result in cwnd and
ssthresh being restored to values much bigger than the proper values.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: e33099f96d ("tcp: implement RFC5682 F-RTO")
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
ordering of incoming connections and teardowns without the need to
hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
the last measured RTO. To do so, we keep the last seen timestamp in a
per-host indexed data structure and verify if the incoming timestamp
in a connection request is strictly greater than the saved one during
last connection teardown. Thus we can verify later on that no old data
packets will be accepted by the new connection.
During moving a socket to time-wait state we already verify if timestamps
where seen on a connection. Only if that was the case we let the
time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
will be used. But we don't verify this on incoming SYN packets. If a
connection teardown was less than TCP_PAWS_MSL seconds in the past we
cannot guarantee to not accept data packets from an old connection if
no timestamps are present. We should drop this SYN packet. This patch
closes this loophole.
Please note, this patch does not make tcp_tw_recycle in any way more
usable but only adds another safety check:
Sporadic drops of SYN packets because of reordering in the network or
in the socket backlog queues can happen. Users behing NAT trying to
connect to a tcp_tw_recycle enabled server can get caught in blackholes
and their connection requests may regullary get dropped because hosts
behind an address translator don't have synchronized tcp timestamp clocks.
tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.
In general, use of tcp_tw_recycle is disadvised.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure we use the correct address-family-specific function for
handling MTU reductions from within tcp_release_cb().
Previously AF_INET6 sockets were incorrectly always using the IPv6
code path when sometimes they were handling IPv4 traffic and thus had
an IPv4 dst.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Fixes: 563d34d057 ("tcp: dont drop MTU reduction indications")
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bytestream timestamps are correlated with a single byte in the skbuff,
recorded in skb_shinfo(skb)->tskey. When fragmenting skbuffs, ensure
that the tskey is set for the fragment in which the tskey falls
(seqno <= tskey < end_seqno).
The original implementation did not address fragmentation in
tcp_fragment or tso_fragment. Add code to inspect the sequence numbers
and move both tskey and the relevant tx_flags if necessary.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ACK timestamps are generated in tcp_clean_rtx_queue. The TSO datapath
can break out early, causing the timestamp code to be skipped. Move
the code up before the break.
Reported-by: David S. Miller <davem@davemloft.net>
Also fix a boundary condition: tp->snd_una is the next unacknowledged
byte and between tests inclusive (a <= b <= c), so generate a an ACK
timestamp if (prior_snd_una <= tskey <= tp->snd_una - 1).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since fib_lookup cannot return ESRCH no longer,
checking for this error code is no longer neccesary.
Signed-off-by: Niv Yehezkel <executerx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge incoming from Andrew Morton:
- Various misc things.
- arch/sh updates.
- Part of ocfs2. Review is slow.
- Slab updates.
- Most of -mm.
- printk updates.
- lib/ updates.
- checkpatch updates.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits)
checkpatch: update $declaration_macros, add uninitialized_var
checkpatch: warn on missing spaces in broken up quoted
checkpatch: fix false positives for --strict "space after cast" test
checkpatch: fix false positive MISSING_BREAK warnings with --file
checkpatch: add test for native c90 types in unusual order
checkpatch: add signed generic types
checkpatch: add short int to c variable types
checkpatch: add for_each tests to indentation and brace tests
checkpatch: fix brace style misuses of else and while
checkpatch: add --fix option for a couple OPEN_BRACE misuses
checkpatch: use the correct indentation for which()
checkpatch: add fix_insert_line and fix_delete_line helpers
checkpatch: add ability to insert and delete lines to patch/file
checkpatch: add an index variable for fixed lines
checkpatch: warn on break after goto or return with same tab indentation
checkpatch: emit a warning on file add/move/delete
checkpatch: add test for commit id formatting style in commit log
checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
checkpatch: improve "no space after cast" test
checkpatch: allow multiple const * types
...
All other add functions for lists have the new item as first argument
and the position where it is added as second argument. This was changed
for no good reason in this function and makes using it unnecessary
confusing.
The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup
doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is
no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these
calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock:
from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A set of small fixes pointed out just after the merge:
- make tcp_tx_timestamp static
- make tcp_gso_tstamp static
- use before() to compare TCP seqno, instead of cast to u64
- add tstamp to tx_flags in GSO, instead of overwrite tx_flags
- record skb_shinfo(skb)->tskey for all timestamps, also HW.
- optimization in tcp_tx_timestamp:
call sock_tx_timestamp only if a tstamp option is set.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Highlights:
1) Steady transitioning of the BPF instructure to a generic spot so
all kernel subsystems can make use of it, from Alexei Starovoitov.
2) SFC driver supports busy polling, from Alexandre Rames.
3) Take advantage of hash table in UDP multicast delivery, from David
Held.
4) Lighten locking, in particular by getting rid of the LRU lists, in
inet frag handling. From Florian Westphal.
5) Add support for various RFC6458 control messages in SCTP, from
Geir Ola Vaagland.
6) Allow to filter bridge forwarding database dumps by device, from
Jamal Hadi Salim.
7) virtio-net also now supports busy polling, from Jason Wang.
8) Some low level optimization tweaks in pktgen from Jesper Dangaard
Brouer.
9) Add support for ipv6 address generation modes, so that userland
can have some input into the process. From Jiri Pirko.
10) Consolidate common TCP connection request code in ipv4 and ipv6,
from Octavian Purdila.
11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.
12) Generic resizable RCU hash table, with intial users in netlink and
nftables. From Thomas Graf.
13) Maintain a name assignment type so that userspace can see where a
network device name came from (enumerated by kernel, assigned
explicitly by userspace, etc.) From Tom Gundersen.
14) Automatic flow label generation on transmit in ipv6, from Tom
Herbert.
15) New packet timestamping facilities from Willem de Bruijn, meant to
assist in measuring latencies going into/out-of the packet
scheduler, latency from TCP data transmission to ACK, etc"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
cxgb4 : Disable recursive mailbox commands when enabling vi
net: reduce USB network driver config options.
tg3: Modify tg3_tso_bug() to handle multiple TX rings
amd-xgbe: Perform phy connect/disconnect at dev open/stop
amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
net: sun4i-emac: fix memory leak on bad packet
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
Revert "net: phy: Set the driver when registering an MDIO bus device"
cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
team: Simplify return path of team_newlink
bridge: Update outdated comment on promiscuous mode
net-timestamp: ACK timestamp for bytestreams
net-timestamp: TCP timestamping
net-timestamp: SCHED timestamp on entering packet scheduler
net-timestamp: add key to disambiguate concurrent datagrams
net-timestamp: move timestamp flags out of sk_flags
net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
cxgb4i : Move stray CPL definitions to cxgb4 driver
tcp: reduce spurious retransmits due to transient SACK reneging
qlcnic: Initialize dcbnl_ops before register_netdev
...
Pull security subsystem updates from James Morris:
"In this release:
- PKCS#7 parser for the key management subsystem from David Howells
- appoint Kees Cook as seccomp maintainer
- bugfixes and general maintenance across the subsystem"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
X.509: Need to export x509_request_asymmetric_key()
netlabel: shorter names for the NetLabel catmap funcs/structs
netlabel: fix the catmap walking functions
netlabel: fix the horribly broken catmap functions
netlabel: fix a problem when setting bits below the previously lowest bit
PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
tpm: simplify code by using %*phN specifier
tpm: Provide a generic means to override the chip returned timeouts
tpm: missing tpm_chip_put in tpm_get_random()
tpm: Properly clean sysfs entries in error path
tpm: Add missing tpm_do_selftest to ST33 I2C driver
PKCS#7: Use x509_request_asymmetric_key()
Revert "selinux: fix the default socket labeling in sock_graft()"
X.509: x509_request_asymmetric_keys() doesn't need string length arguments
PKCS#7: fix sparse non static symbol warning
KEYS: revert encrypted key change
ima: add support for measuring and appraising firmware
firmware_class: perform new LSM checks
security: introduce kernel_fw_from_file hook
PKCS#7: Missing inclusion of linux/err.h
...
Conflicts:
drivers/net/Makefile
net/ipv6/sysctl_net_ipv6.c
Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.
In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
in the send() call is acknowledged. It implements the feature for TCP.
The timestamp is generated when the TCP socket cumulative ACK is moved
beyond the tracked seqno for the first time. The feature ignores SACK
and FACK, because those acknowledge the specific byte, but not
necessarily the entire contents of the buffer up to that byte.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP timestamping extends SO_TIMESTAMPING to bytestreams.
Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.
The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.
This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.
This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.
If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.
The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.
Implementation details:
- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.
- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.
Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.
The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.
The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.
When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.
Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.
To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.
A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.
We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:
(1) 90% of inter-ACK delays were less than 10ms
(2) 99% of inter-ACK delays were less than RTT/2
In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcpm_key is an array inside struct tcp_md5sig, there is no need to check it
against NULL.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.
- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.
- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.
- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.
All the changes are to the new interface and no behavior changed for
the multiple hierarchies"
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...
Use kmem_cache to allocate/free inet_frag_queue objects since they're
all the same size per inet_frags user and are alloced/freed in high volumes
thus making it a perfect case for kmem_cache.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have INET_FRAG_EVICTED we might as well use it to stop
sending icmp messages in the "frag_expire" functions instead of
stripping INET_FRAG_FIRST_IN from their flags when evicting.
Also fix the comment style in ip6_expire_frag_queue().
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a couple of functions' declaration alignments.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last_in field has been used to store various flags different from
first/last frag in so give it a more descriptive name: flags.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_local_deliver_finish() already have a rcu_read_lock/unlock, so
the rcu_read_lock/unlock is unnecessary.
See the stack below:
ip_local_deliver_finish
|
|
->icmp_rcv
|
|
->icmp_socket_deliver
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Historically the NetLabel LSM secattr catmap functions and data
structures have had very long names which makes a mess of the NetLabel
code and anyone who uses NetLabel. This patch renames the catmap
functions and structures from "*_secattr_catmap_*" to just "*_catmap_*"
which improves things greatly.
There are no substantial code or logic changes in this patch.
Signed-off-by: Paul Moore <pmoore@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
The NetLabel secattr catmap functions, and the SELinux import/export
glue routines, were broken in many horrible ways and the SELinux glue
code fiddled with the NetLabel catmap structures in ways that we
probably shouldn't allow. At some point this "worked", but that was
likely due to a bit of dumb luck and sub-par testing (both inflicted
by yours truly). This patch corrects these problems by basically
gutting the code in favor of something less obtuse and restoring the
NetLabel abstractions in the SELinux catmap glue code.
Everything is working now, and if it decides to break itself in the
future this code will be much easier to debug than the code it
replaces.
One noteworthy side effect of the changes is that it is no longer
necessary to allocate a NetLabel catmap before calling one of the
NetLabel APIs to set a bit in the catmap. NetLabel will automatically
allocate the catmap nodes when needed, resulting in less allocations
when the lowest bit is greater than 255 and less code in the LSMs.
Cc: stable@vger.kernel.org
Reported-by: Christian Evans <frodox@zoho.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
The NetLabel category (catmap) functions have a problem in that they
assume categories will be set in an increasing manner, e.g. the next
category set will always be larger than the last. Unfortunately, this
is not a valid assumption and could result in problems when attempting
to set categories less than the startbit in the lowest catmap node.
In some cases kernel panics and other nasties can result.
This patch corrects the problem by checking for this and allocating a
new catmap node instance and placing it at the front of the list.
Cc: stable@vger.kernel.org
Reported-by: Christian Evans <frodox@zoho.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
When dealing with ICMPv[46] Error Message, function icmp_socket_deliver()
and icmpv6_notify() do some valid checks on packet's length, but then some
protocols check packet's length redaudantly. So remove those duplicated
statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in
function icmp_socket_deliver() and icmpv6_notify() respectively.
In addition, add missed counter in udp6/udplite6 when socket is NULL.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains netfilter updates for net-next, they are:
1) Add the reject expression for the nf_tables bridge family, this
allows us to send explicit reject (TCP RST / ICMP dest unrech) to
the packets matching a rule.
2) Simplify and consolidate the nf_tables set dumping logic. This uses
netlink control->data to filter out depending on the request.
3) Perform garbage collection in xt_hashlimit using a workqueue instead
of a timer, which is problematic when many entries are in place in
the tables, from Eric Dumazet.
4) Remove leftover code from the removed ulog target support, from
Paul Bolle.
5) Dump unmodified flags in the netfilter packet accounting when resetting
counters, so userspace knows that a counter was in overquota situation,
from Alexey Perevalov.
6) Fix wrong usage of the bitwise functions in nfnetlink_acct, also from
Alexey.
7) Fix a crash when adding new set element with an empty NFTA_SET_ELEM_LIST
attribute.
This patchset also includes a couple of cleanups for xt_LED from
Duan Jiong and for nf_conntrack_ipv4 (using coccinelle) from
Himangi Saraogi.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d23ff7016 (tcp: add generic netlink support for tcp_metrics) introduced
netlink support for the new tcp_metrics, however it restricted getting of
tcp_metrics to root user only. This is a change from how these values could
have been fetched when in the old route cache. Unless there's a legitimate
reason to restrict the reading of these values it would be better if normal
users could fetch them.
Cc: Julian Anastasov <ja@ssi.bg>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2014-07-30
This is the last pull request for ipsec-next before I'll be
off for two weeks starting on friday. David, can you please
take urgent ipsec patches directly into net/net-next during
this time?
1) Error handling simplifications for vti and vti6.
From Mathias Krause.
2) Remove a duplicate semicolon after a return statement.
From Christoph Paasch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In vegas we do a multiplication of the cwnd and the rtt. This
may overflow and thus their result is stored in a u64. However, we first
need to cast the cwnd so that actually 64-bit arithmetic is done.
Then, we need to do do_div to allow this to be used on 32-bit arches.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Doug Leith <doug.leith@nuim.ie>
Fixes: 8d3a564da3 (tcp: tcp_vegas cong avoid fix)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
In veno we do a multiplication of the cwnd and the rtt. This
may overflow and thus their result is stored in a u64. However, we first
need to cast the cwnd so that actually 64-bit arithmetic is done.
A first attempt at fixing 76f1017757 ([TCP]: TCP Veno congestion
control) was made by 159131149c (tcp: Overflow bug in Vegas), but it
failed to add the required cast in tcp_veno_cong_avoid().
Fixes: 76f1017757 ([TCP]: TCP Veno congestion control)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ipv4 tunnels created with "local any remote $ip" didn't work properly since
7d442fab0 (ipv4: Cache dst in tunnels). 99% of packets sent via those tunnels
had src addr = 0.0.0.0. That was because only dst_entry was cached, although
fl4.saddr has to be cached too. Every time ip_tunnel_xmit used cached dst_entry
(tunnel_rtable_get returned non-NULL), fl4.saddr was initialized with
tnl_params->saddr (= 0 in our case), and wasn't changed until iptunnel_xmit().
This patch adds saddr to ip_tunnel->dst_cache, fixing this issue.
Reported-by: Sergey Popov <pinkbyte@gentoo.org>
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse warns because of implicit pointer cast.
v2: subject line correction, space between "void" and "*"
Signed-off-by: Karoly Kemeny <karoly.kemeny@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the use of the macro IS_ERR_OR_NULL in place of
tests for NULL and IS_ERR.
The following Coccinelle semantic patch was used for making the change:
@@
expression e;
@@
- e == NULL || IS_ERR(e)
+ IS_ERR_OR_NULL(e)
|| ...
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the use of the macro IS_ERR_OR_NULL in place of
tests for NULL and IS_ERR.
The following Coccinelle semantic patch was used for making the change:
@@
expression e;
@@
- e == NULL || IS_ERR(e)
+ IS_ERR_OR_NULL(e)
|| ...
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
We create a proc dir for each network device, this will cause
conflicts when the devices have name "all" or "default".
Rather than emitting an ugly kernel warning, we could just
fail earlier by checking the device name.
Reported-by: Stephane Chazelas <stephane.chazelas@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.
With commit 73f156a6e8 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.
This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.
Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.
This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.
We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.
For IPv6, adds saddr into the hash as well, but not nexthdr.
If I ping the patched target, we can see ID are now hard to predict.
21:57:11.008086 IP (...)
A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
target > A: ICMP echo reply, seq 1, length 64
21:57:12.013133 IP (...)
A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
target > A: ICMP echo reply, seq 2, length 64
21:57:13.016580 IP (...)
A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
target > A: ICMP echo reply, seq 3, length 64
[1] TCP sessions uses a per flow ID generator not changed by this patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu>
Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes init_net's high_thresh limit to be the maximum for all
namespaces, thus introducing a global memory limit threshold equal to the
sum of the individual high_thresh limits which are capped.
It also introduces some sane minimums for low_thresh as it shouldn't be
able to drop below 0 (or > high_thresh in the unsigned case), and
overall low_thresh should not ever be above high_thresh, so we make the
following relations for a namespace:
init_net:
high_thresh - max(not capped), min(init_net low_thresh)
low_thresh - max(init_net high_thresh), min (0)
all other namespaces:
high_thresh = max(init_net high_thresh), min(namespace's low_thresh)
low_thresh = max(namespace's high_thresh), min(0)
The major issue with having low_thresh > high_thresh is that we'll
schedule eviction but never evict anything and thus rely only on the
timers.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rehash is rare operation, don't force readers to take
the read-side rwlock.
Instead, we only have to detect the (rare) case where
the secret was altered while we are trying to insert
a new inetfrag queue into the table.
If it was changed, drop the bucket lock and recompute
the hash to get the 'new' chain bucket that we have to
insert into.
Joint work with Nikolay Aleksandrov.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
merge functionality into the eviction workqueue.
Instead of rebuilding every n seconds, take advantage of the upper
hash chain length limit.
If we hit it, mark table for rebuild and schedule workqueue.
To prevent frequent rebuilds when we're completely overloaded,
don't rebuild more than once every 5 seconds.
ipfrag_secret_interval sysctl is now obsolete and has been marked as
deprecated, it still can be changed so scripts won't be broken but it
won't have any effect. A comment is left above each unused secret_timer
variable to avoid confusion.
Joint work with Nikolay Aleksandrov.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'nqueues' counter is protected by the lru list lock,
once thats removed this needs to be converted to atomic
counter. Given this isn't used for anything except for
reporting it to userspace via /proc, just remove it.
We still report the memory currently used by fragment
reassembly queues.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the high_thresh limit is reached we try to toss the 'oldest'
incomplete fragment queues until memory limits are below the low_thresh
value. This happens in softirq/packet processing context.
This has two drawbacks:
1) processors might evict a queue that was about to be completed
by another cpu, because they will compete wrt. resource usage and
resource reclaim.
2) LRU list maintenance is expensive.
But when constantly overloaded, even the 'least recently used' element is
recent, so removing 'lru' queue first is not 'fairer' than removing any
other fragment queue.
This moves eviction out of the fast path:
When the low threshold is reached, a work queue is scheduled
which then iterates over the table and removes the queues that exceed
the memory limits of the namespace. It sets a new flag called
INET_FRAG_EVICTED on the evicted queues so the proper counters will get
incremented when the queue is forcefully expired.
When the high threshold is reached, no more fragment queues are
created until we're below the limit again.
The LRU list is now unused and will be removed in a followup patch.
Joint work with Nikolay Aleksandrov.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First step to move eviction handling into a work queue.
We lose two spots that accounted evicted fragments in MIB counters.
Accounting will be restored since the upcoming work-queue evictor
invokes the frag queue timer callbacks instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
hide actual hash size from individual users: The _find
function will now fold the given hash value into the required range.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ulog targets were recently killed. A few references to the Kconfig
macros CONFIG_IP_NF_TARGET_ULOG and CONFIG_BRIDGE_EBT_ULOG were left
untouched. Kill these too.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In this file, function names are otherwise used as pointers without &.
A simplified version of the Coccinelle semantic patch that makes this
change is as follows:
// <smpl>
@r@
identifier f;
@@
f(...) { ... }
@@
identifier r.f;
@@
- &f
+ f
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In this file, function names are otherwise used as pointers without &.
A simplified version of the Coccinelle semantic patch that makes this
change is as follows:
// <smpl>
@r@
identifier f;
@@
f(...) { ... }
@@
identifier r.f;
@@
- &f
+ f
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, although IP_MULTICAST_ALL and IP_MSFILTER ioctl calls succeed on
raw sockets, there is no code to implement the functionality on received
packets; it is only implemented for UDP sockets. The raw(7) man page states:
"In addition, all ip(7) IPPROTO_IP socket options valid for datagram sockets
are supported", which implies these ioctls should work on raw sockets.
To fix this, add a call to ip_mc_sf_allow on raw sockets.
This should not break any existing code, since the current position of
not calling ip_mc_sf_filter makes it behave as if neither the IP_MULTICAST_ALL
nor the IP_MSFILTER ioctl had been called. Adding the call to ip_mc_sf_allow
will therefore maintain the current behaviour so long as IP_MULTICAST_ALL and
IP_MSFILTER ioctls are not called. Any code that currently is calling
IP_MULTICAST_ALL or IP_MSFILTER ioctls on raw sockets presumably is wanting
the filter to be applied, although no filtering will currently be occurring.
Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
It hasn't been used since commit 0fd7bac(net: relax rcvbuf limits).
Signed-off-by: Sorin Dumitru <sorin@returnze.ro>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/infiniband/hw/cxgb4/device.c
The cxgb4 conflict was simply overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains updates for your net-next tree,
they are:
1) Use kvfree() helper function from x_tables, from Eric Dumazet.
2) Remove extra timer from the conntrack ecache extension, use a
workqueue instead to redeliver lost events to userspace instead,
from Florian Westphal.
3) Removal of the ulog targets for ebtables and iptables. The nflog
infrastructure superseded this almost 9 years ago, time to get rid
of this code.
4) Replace the list of loggers by an array now that we can only have
two possible non-overlapping logger flavours, ie. kernel ring buffer
and netlink logging.
5) Move Eric Dumazet's log buffer code to nf_log to reuse it from
all of the supported per-family loggers.
6) Consolidate nf_log_packet() as an unified interface for packet logging.
After this patch, if the struct nf_loginfo is available, it explicitly
selects the logger that is used.
7) Move ip and ip6 logging code from xt_LOG to the corresponding
per-family loggers. Thus, x_tables and nf_tables share the same code
for packet logging.
8) Add generic ARP packet logger, which is used by nf_tables. The
format aims to be consistent with the output of xt_LOG.
9) Add generic bridge packet logger. Again, this is used by nf_tables
and it routes the packets to the real family loggers. As a result,
we get consistent logging format for the bridge family. The ebt_log
logging code has been intentionally left in place not to break
backward compatibility since the logging output differs from xt_LOG.
10) Update nft_log to explicitly request the required family logger when
needed.
11) Finish nft_log so it supports arp, ip, ip6, bridge and inet families.
Allowing selection between netlink and kernel buffer ring logging.
12) Several fixes coming after the netfilter core logging changes spotted
by robots.
13) Use IS_ENABLED() macros whenever possible in the netfilter tree,
from Duan Jiong.
14) Removal of a couple of unnecessary branch before kfree, from Fabian
Frederick.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Many multicast sources can have the same port which can result in a very
large list when hashing by port only. Hash by address and port instead
if this is the case. This makes multicast more similar to unicast.
On a 24-core machine receiving from 500 multicast sockets on the same
port, before this patch 80% of system CPU was used up by spin locking
and only ~25% of packets were successfully delivered.
With this patch, all packets are delivered and kernel overhead is ~8%
system CPU on spinlocks.
Signed-off-by: David Held <drheld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch to using sk_nulls_for_each which shortens the code and makes it
easier to update.
Signed-off-by: David Held <drheld@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed a bug that was introduced by my GRE-GRO patch
(bf5a755f5e net-gre-gro: Add GRE
support to the GRO stack) that breaks the forwarding path
because various GSO related fields were not set. The bug will
cause on the egress path either the GSO code to fail, or a
GRE-TSO capable (NETIF_F_GSO_GRE) NICs to choke. The following
fix has been tested for both cases.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SO_TIMESTAMPING to sockets of type PF_INET[6]/SOCK_RAW:
Add the necessary sock_tx_timestamp calls to the datapath for RAW
sockets (ping sockets already had these calls).
Fix the IP output path to pass the timestamp flags on the first
fragment also for these sockets. The existing code relies on
transhdrlen != 0 to indicate a first fragment. For these sockets,
that assumption does not hold.
This fixes http://bugzilla.kernel.org/show_bug.cgi?id=77221
Tested SOCK_RAW on IPv4 and IPv6, not PING.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since Yuchung's 9b44190dc1 (tcp: refactor F-RTO), tcp_enter_cwr is always
called with set_ssthresh = 1. Thus, we can remove this argument from
tcp_enter_cwr. Further, as we remove this one, tcp_init_cwnd_reduction
is then always called with set_ssthresh = true, and so we can get rid of
this argument as well.
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, cftypes added by cgroup_add_cftypes() are used for both the
unified default hierarchy and legacy ones and subsystems can mark each
file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
appear only on one of them. This is quite hairy and error-prone.
Also, we may end up exposing interface files to the default hierarchy
without thinking it through.
cgroup_subsys will grow two separate cftype addition functions and
apply each only on the hierarchies of the matching type. This will
allow organizing cftypes in a lot clearer way and encourage subsystems
to scrutinize the interface which is being exposed in the new default
hierarchy.
In preparation, this patch adds cgroup_add_legacy_cftypes() which
currently is a simple wrapper around cgroup_add_cftypes() and replaces
all cgroup_add_cftypes() usages with it.
While at it, this patch drops a completely spurious return from
__hugetlb_cgroup_file_init().
This patch doesn't introduce any functional differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Added udp_tunnel.c which can contain some common functions for UDP
tunnels. The first function in this is udp_sock_create which is used
to open the listener port for a UDP tunnel.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To get offloads to work with Generic Routing Encapsulation (GRE), the
outer transport header has to be reset after skb_push is done. This
patch has the support for this fix and hence GRE offloading.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Tested-By: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is only tested, and declared, in the bootp code.
So, in ic_dynamic() guard it's setting with IPCONFIG_BOOTP.
Signed-off-by: David S. Miller <davem@davemloft.net>
ic_dev_xid is only used in __init ic_bootp_recv under IPCONFIG_BOOTP
and __init ic_dynamic under IPCONFIG_DYNAMIC(which is itself defined
with the same IPCONFIG_BOOTP)
This patch fixes the following warning when IPCONFIG_BOOTP is not set:
>> net/ipv4/ipconfig.c:146:15: warning: 'ic_dev_xid' defined but not used [-Wunused-variable]
static __be32 ic_dev_xid; /* Device under configuration */
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: netdev@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes 3 similar bugs where incoming packets might be routed into
wrong non-wildcard tunnels:
1) Consider the following setup:
ip address add 1.1.1.1/24 dev eth0
ip address add 1.1.1.2/24 dev eth0
ip tunnel add ipip1 remote 2.2.2.2 local 1.1.1.1 mode ipip dev eth0
ip link set ipip1 up
Incoming ipip packets from 2.2.2.2 were routed into ipip1 even if it has dst =
1.1.1.2. Moreover even if there was wildcard tunnel like
ip tunnel add ipip0 remote 2.2.2.2 local any mode ipip dev eth0
but it was created before explicit one (with local 1.1.1.1), incoming ipip
packets with src = 2.2.2.2 and dst = 1.1.1.2 were still routed into ipip1.
Same issue existed with all tunnels that use ip_tunnel_lookup (gre, vti)
2) ip address add 1.1.1.1/24 dev eth0
ip tunnel add ipip1 remote 2.2.146.85 local 1.1.1.1 mode ipip dev eth0
ip link set ipip1 up
Incoming ipip packets with dst = 1.1.1.1 were routed into ipip1, no matter what
src address is. Any remote ip address which has ip_tunnel_hash = 0 raised this
issue, 2.2.146.85 is just an example, there are more than 4 million of them.
And again, wildcard tunnel like
ip tunnel add ipip0 remote any local 1.1.1.1 mode ipip dev eth0
wouldn't be ever matched if it was created before explicit tunnel like above.
Gre & vti tunnels had the same issue.
3) ip address add 1.1.1.1/24 dev eth0
ip tunnel add gre1 remote 2.2.146.84 local 1.1.1.1 key 1 mode gre dev eth0
ip link set gre1 up
Any incoming gre packet with key = 1 were routed into gre1, no matter what
src/dst addresses are. Any remote ip address which has ip_tunnel_hash = 0 raised
the issue, 2.2.146.84 is just an example, there are more than 4 million of them.
Wildcard tunnel like
ip tunnel add gre2 remote any local any key 1 mode gre dev eth0
wouldn't be ever matched if it was created before explicit tunnel like above.
All this stuff happened because while looking for a wildcard tunnel we didn't
check that matched tunnel is a wildcard one. Fixed.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ic_dev_xid is only used in ipconfig.c
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: netdev@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
The undo code assumes that, upon entering loss recovery, TCP
1) always retransmit something
2) the retransmission never fails locally (e.g., qdisc drop)
so undo_marker is set in tcp_enter_recovery() and undo_retrans is
incremented only when tcp_retransmit_skb() is successful.
When the assumption is broken because TCP's cwnd is too small to
retransmit or the retransmit fails locally. The next (DUP)ACK
would incorrectly revert the cwnd and the congestion state in
tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
may enter the recovery state. The sender repeatedly enter and
(incorrectly) exit recovery states if the retransmits continue to
fail locally while receiving (DUP)ACKs.
The fix is to initialize undo_retrans to -1 and start counting on
the first retransmission. Always increment undo_retrans even if the
retransmissions fail locally because they couldn't cause DSACKs to
undo the cwnd reduction.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem was triggered by these steps:
1) create socket, bind and then setsockopt for add mc group.
mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
mreq.imr_interface.s_addr = inet_addr("192.168.1.2");
setsockopt(sockfd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));
2) drop the mc group for this socket.
mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
mreq.imr_interface.s_addr = inet_addr("0.0.0.0");
setsockopt(sockfd, IPPROTO_IP, IP_DROP_MEMBERSHIP, &mreq, sizeof(mreq));
3) and then drop the socket, I found the mc group was still used by the dev:
netstat -g
Interface RefCnt Group
--------------- ------ ---------------------
eth2 1 255.0.0.37
Normally even though the IP_DROP_MEMBERSHIP return error, the mc group still need
to be released for the netdev when drop the socket, but this process was broken when
route default is NULL, the reason is that:
The ip_mc_leave_group() will choose the in_dev by the imr_interface.s_addr, if input addr
is NULL, the default route dev will be chosen, then the ifindex is got from the dev,
then polling the inet->mc_list and return -ENODEV, but if the default route dev is NULL,
the in_dev and ifIndex is both NULL, when polling the inet->mc_list, the mc group will be
released from the mc_list, but the dev didn't dec the refcnt for this mc group, so
when dropping the socket, the mc_list is NULL and the dev still keep this group.
v1->v2: According Hideaki's suggestion, we should align with IPv6 (RFC3493) and BSDs,
so I add the checking for the in_dev before polling the mc_list, make sure when
we remove the mc group, dec the refcnt to the real dev which was using the mc address.
The problem would never happened again.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For a connected socket we can precompute the flow hash for setting
in skb->hash on output. This is a performance advantage over
calculating the skb->hash for every packet on the connection. The
computation is done using the common hash algorithm to be consistent
with computations done for packets of the connection in other states
where thers is no socket (e.g. time-wait, syn-recv, syn-cookies).
This patch adds sk_txhash to the sock structure. inet_set_txhash and
ip6_set_txhash functions are added which are called from points in
TCP and UDP where socket moves to established state.
skb_set_hash_from_sk is a function which sets skb->hash from the
sock txhash value. This is called in UDP and TCP transmit path when
transmitting within the context of a socket.
Tested: ran super_netperf with 200 TCP_RR streams over a vxlan
interface (in this case skb_get_hash called on every TX packet to
create a UDP source port).
Before fix:
95.02% CPU utilization
154/256/505 90/95/99% latencies
1.13042e+06 tps
Time in functions:
0.28% skb_flow_dissect
0.21% __skb_get_hash
After fix:
94.95% CPU utilization
156/254/485 90/95/99% latencies
1.15447e+06
Neither __skb_get_hash nor skb_flow_dissect appear in perf
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Always store in snt_synack the time at which the server received the
first client SYN and attempted to send the first SYNACK.
Recent commit aa27fc501 ("tcp: tcp_v[46]_conn_request: fix snt_synack
initialization") resolved an inconsistency between IPv4 and IPv6 in
the initialization of snt_synack. This commit brings back the idea
from 843f4a55e (tcp: use tcp_v4_send_synack on first SYN-ACK), which
was going for the original behavior of snt_synack from the commit
where it was added in 9ad7c049f0 ("tcp: RFC2988bis + taking RTT
sample from 3WHS for the passive open side") in v3.1.
In addition to being simpler (and probably a tiny bit faster),
unconditionally storing the time of the first SYNACK attempt has been
useful because it allows calculating a performance metric quantifying
how long it took to establish a passive TCP connection.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Octavian Purdila <octavian.purdila@intel.com>
Cc: Jerry Chu <hkchu@google.com>
Acked-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some older router implementations still send Fragmentation Needed
errors with the Next-Hop MTU field set to zero. This is explicitly
described as an eventuality that hosts must deal with by the
standard (RFC 1191) since older standards specified that those
bits must be zero.
Linux had a generic (for all of IPv4) implementation of the algorithm
described in the RFC for searching a list of MTU plateaus for a good
value. Commit 46517008e1 ("ipv4: Kill ip_rt_frag_needed().")
removed this as part of the changes to remove the routing cache.
Subsequently any Fragmentation Needed packet with a zero Next-Hop
MTU has been discarded without being passed to the per-protocol
handlers or notifying userspace for raw sockets.
When there is a router which does not implement RFC 1191 on an
MTU limited path then this results in stalled connections since
large packets are discarded and the local protocols are not
notified so they never attempt to lower the pMTU.
One example I have seen is an OpenBSD router terminating IPSec
tunnels. It's worth pointing out that this case is distinct from
the BSD 4.2 bug which incorrectly calculated the Next-Hop MTU
since the commit in question dismissed that as a valid concern.
All of the per-protocols handlers implement the simple approach from
RFC 1191 of immediately falling back to the minimum value. Although
this is sub-optimal it is vastly preferable to connections hanging
indefinitely.
Remove the Next-Hop MTU != 0 check and allow such packets
to follow the normal path.
Fixes: 46517008e1 ("ipv4: Kill ip_rt_frag_needed().")
Signed-off-by: Edward Allcutt <edward.allcutt@openmarket.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When in repair-mode and TCP_RECV_QUEUE is set, we end up calling
tcp_push with mss_now being 0. If data is in the send-queue and
tcp_set_skb_tso_segs gets called, we crash because it will divide by
mss_now:
[ 347.151939] divide error: 0000 [#1] SMP
[ 347.152907] Modules linked in:
[ 347.152907] CPU: 1 PID: 1123 Comm: packetdrill Not tainted 3.16.0-rc2 #4
[ 347.152907] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 347.152907] task: f5b88540 ti: f3c82000 task.ti: f3c82000
[ 347.152907] EIP: 0060:[<c1601359>] EFLAGS: 00210246 CPU: 1
[ 347.152907] EIP is at tcp_set_skb_tso_segs+0x49/0xa0
[ 347.152907] EAX: 00000b67 EBX: f5acd080 ECX: 00000000 EDX: 00000000
[ 347.152907] ESI: f5a28f40 EDI: f3c88f00 EBP: f3c83d10 ESP: f3c83d00
[ 347.152907] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 347.152907] CR0: 80050033 CR2: 083158b0 CR3: 35146000 CR4: 000006b0
[ 347.152907] Stack:
[ 347.152907] c167f9d9 f5acd080 000005b4 00000002 f3c83d20 c16013e6 f3c88f00 f5acd080
[ 347.152907] f3c83da0 c1603b5a f3c83d38 c10a0188 00000000 00000000 f3c83d84 c10acc85
[ 347.152907] c1ad5ec0 00000000 00000000 c1ad679c 010003e0 00000000 00000000 f3c88fc8
[ 347.152907] Call Trace:
[ 347.152907] [<c167f9d9>] ? apic_timer_interrupt+0x2d/0x34
[ 347.152907] [<c16013e6>] tcp_init_tso_segs+0x36/0x50
[ 347.152907] [<c1603b5a>] tcp_write_xmit+0x7a/0xbf0
[ 347.152907] [<c10a0188>] ? up+0x28/0x40
[ 347.152907] [<c10acc85>] ? console_unlock+0x295/0x480
[ 347.152907] [<c10ad24f>] ? vprintk_emit+0x1ef/0x4b0
[ 347.152907] [<c1605716>] __tcp_push_pending_frames+0x36/0xd0
[ 347.152907] [<c15f4860>] tcp_push+0xf0/0x120
[ 347.152907] [<c15f7641>] tcp_sendmsg+0xf1/0xbf0
[ 347.152907] [<c116d920>] ? kmem_cache_free+0xf0/0x120
[ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40
[ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40
[ 347.152907] [<c114f0f0>] ? do_wp_page+0x3e0/0x850
[ 347.152907] [<c161c36a>] inet_sendmsg+0x4a/0xb0
[ 347.152907] [<c1150269>] ? handle_mm_fault+0x709/0xfb0
[ 347.152907] [<c15a006b>] sock_aio_write+0xbb/0xd0
[ 347.152907] [<c1180b79>] do_sync_write+0x69/0xa0
[ 347.152907] [<c1181023>] vfs_write+0x123/0x160
[ 347.152907] [<c1181d55>] SyS_write+0x55/0xb0
[ 347.152907] [<c167f0d8>] sysenter_do_call+0x12/0x28
This can easily be reproduced with the following packetdrill-script (the
"magic" with netem, sk_pacing and limit_output_bytes is done to prevent
the kernel from pushing all segments, because hitting the limit without
doing this is not so easy with packetdrill):
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1460>
+0 > S. 0:0(0) ack 1 <mss 1460>
+0.1 < . 1:1(0) ack 1 win 65000
+0 accept(3, ..., ...) = 4
// This forces that not all segments of the snd-queue will be pushed
+0 `tc qdisc add dev tun0 root netem delay 10ms`
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=2`
+0 setsockopt(4, SOL_SOCKET, 47, [2], 4) = 0
+0 write(4,...,10000) = 10000
+0 write(4,...,10000) = 10000
// Set tcp-repair stuff, particularly TCP_RECV_QUEUE
+0 setsockopt(4, SOL_TCP, 19, [1], 4) = 0
+0 setsockopt(4, SOL_TCP, 20, [1], 4) = 0
// This now will make the write push the remaining segments
+0 setsockopt(4, SOL_SOCKET, 47, [20000], 4) = 0
+0 `sysctl -w net.ipv4.tcp_limit_output_bytes=130000`
// Now we will crash
+0 write(4,...,1000) = 1000
This happens since ec34232575 (tcp: fix retransmission in repair
mode). Prior to that, the call to tcp_push was prevented by a check for
tp->repair.
The patch fixes it, by adding the new goto-label out_nopush. When exiting
tcp_sendmsg and a push is not required, which is the case for tp->repair,
we go to this label.
When repairing and calling send() with TCP_RECV_QUEUE, the data is
actually put in the receive-queue. So, no push is required because no
data has been added to the send-queue.
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Fixes: ec34232575 (tcp: fix retransmission in repair mode)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an UDP application switches from AF_INET to AF_INET6 sockets, we
have a small performance degradation for IPv4 communications because of
extra cache line misses to access ipv6only information.
This can also be noticed for TCP listeners, as ipv6_only_sock() is also
used from __inet_lookup_listener()->compute_score()
This is magnified when SO_REUSEPORT is used.
Move ipv6only into struct sock_common so that it is available at
no extra cost in lookups.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have two different ways to handle changes to sk->sk_dst
First way (used by TCP) assumes socket lock is owned by caller, and use
no extra lock : __sk_dst_set() & __sk_dst_reset()
Another way (used by UDP) uses sk_dst_lock because socket lock is not
always taken. Note that sk_dst_lock is not softirq safe.
These ways are not inter changeable for a given socket type.
ipv4_sk_update_pmtu(), added in linux-3.8, added a race, as it used
the socket lock as synchronization, but users might be UDP sockets.
Instead of converting sk_dst_lock to a softirq safe version, use xchg()
as we did for sk_rx_dst in commit e47eb5dfb2 ("udp: ipv4: do not use
sk_dst_lock from softirq context")
In a follow up patch, we probably can remove sk_dst_lock, as it is
only used in IPv6.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Fixes: 9cb3a50c5f ("ipv4: Invalidate the socket cached route on pmtu events if possible")
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes build error introduced by commit 1fb6f159fd (tcp: add
tcp_conn_request):
net/ipv4/tcp_input.c: In function 'pr_drop_req':
net/ipv4/tcp_input.c:5889:130: error: 'struct sock_common' has no member named 'skc_v6_daddr'
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3328715e6c (xfrm4: Add IPsec protocol multiplexer) adds a
duplicate semicolon after the return-statement.
Although it has no negative impact, the second semicolon should be
removed.
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Create tcp_conn_request and remove most of the code from
tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add queue_add_hash member to tcp_request_sock_ops so that we can later
unify tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add mss_clamp member to tcp_request_sock_ops so that we can later
unify tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new tcp_request_sock_ops method to unify the IPv4/IPv6
signature for tcp_v[46]_send_synack. This allows us to later unify
tcp_v4_rtx_synack with tcp_v6_rtx_synack and tcp_v4_conn_request with
tcp_v4_conn_request.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More work in preparation of unifying tcp_v4_conn_request and
tcp_v6_conn_request: indirect the init sequence calls via the
tcp_request_sock_ops.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create wrappers with same signature for the IPv4/IPv6 request routing
calls and use these wrappers (via route_req method from
tcp_request_sock_ops) in tcp_v4_conn_request and tcp_v6_conn_request
with the purpose of unifying the two functions in a later patch.
We can later drop the wrapper functions and modify inet_csk_route_req
and inet6_cks_route_req to use the same signature.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the specific IPv4/IPv6 cookie sequence initialization to a new
method in tcp_request_sock_ops in preparation for unifying
tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the specific IPv4/IPv6 intializations to a new method in
tcp_request_sock_ops in preparation for unifying tcp_v4_conn_request
and tcp_v6_conn_request.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 016818d07 (tcp: TCP Fast Open Server - take SYNACK RTT after
completing 3WHS) changes the code to only take a snt_synack timestamp
when a SYNACK transmit or retransmit succeeds. This behaviour is later
broken by commit 843f4a55e (tcp: use tcp_v4_send_synack on first
SYN-ACK), as snt_synack is now updated even if tcp_v4_send_synack
fails.
Also, commit 3a19ce0ee (tcp: IPv6 support for fastopen server) misses
the required IPv6 updates for 016818d07.
This patch makes sure that snt_synack is updated only when the SYNACK
trasnmit/retransmit succeeds, for both IPv4 and IPv6.
Cc: Cardwell <ncardwell@google.com>
Cc: Daniel Lee <longinus00@gmail.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the generic plain text packet loggger for ARP packets. It is
based on the ebt_log code. Nevertheless, the output has been modified
to make it consistent with the original xt_LOG output.
This is an example output:
IN=wlan0 OUT= ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=00🆎12:34:55:63 IPSRC=192.168.10.1 MACDST=80:09:12:70:4f:50 IPDST=192.168.10.150
This patch enables packet logging from ARP chains, eg.
nft add rule arp filter input log prefix "input: "
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Before this patch, the nf_loginfo parameter specified the logging
configuration in case the specified default logger was loaded. This
patch updates the semantics of the nf_loginfo parameter in
nf_log_packet() which now indicates the logger that you explicitly
want to use.
Thus, nf_log_packet() is exposed as an unified interface which
internally routes the log message to the corresponding logger type
by family.
The module dependencies are expressed by the new nf_logger_find_get()
and nf_logger_put() functions which bump the logger module refcount.
Thus, you can not remove logger modules that are used by rules anymore.
Another important effect of this change is that the family specific
module is only loaded when required. Therefore, xt_LOG and nft_log
will just trigger the autoload of the nf_log_{ip,ip6} modules
according to the family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The plain text logging is currently embedded into the xt_LOG target.
In order to be able to use the plain text logging from nft_log, as a
first step, this patch moves the family specific code to the following
files and Kconfig symbols:
1) net/ipv4/netfilter/nf_log_ip.c: CONFIG_NF_LOG_IPV4
2) net/ipv6/netfilter/nf_log_ip6.c: CONFIG_NF_LOG_IPV6
3) net/netfilter/nf_log_common.c: CONFIG_NF_LOG_COMMON
These new modules will be required by xt_LOG and nft_log. This patch
is based on original patch from Arturo Borrero Gonzalez.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add MIB counters for rcvbuferrors in UDP to help diagnose problems.
Signed-off-by: James M Leddy <james.leddy@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The error handling in the module init and exit functions can be
shortened to safe us some code.
1/ Remove the code duplications in the init function, jump straight to
the existing cleanup code by adding some labels. Also give the error
message some more value by telling the reason why loading the module has
failed. Furthermore fix the "IPSec" typo -- it should be "IPsec" instead.
2/ Remove the error handling in the exit function as the only legitimate
reason xfrm4_protocol_deregister() might fail is inet_del_protocol()
returning -1. That, in turn, means some other protocol handler had been
registered for this very protocol in the meantime. But that essentially
means we haven't been handling that protocol any more, anyway. What it
definitely means not is that we "can't deregister tunnel". Therefore
just get rid of that bogus warning. It's plain wrong.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When IP route cache had been removed in linux-3.6, we broke assumption
that dst entries were all freed after rcu grace period. DST_NOCACHE
dst were supposed to be freed from dst_release(). But it appears
we want to keep such dst around, either in UDP sockets or tunnels.
In sk_dst_get() we need to make sure dst refcount is not 0
before incrementing it, or else we might end up freeing a dst
twice.
DST_NOCACHE set on a dst does not mean this dst can not be attached
to a socket or a tunnel.
Then, before actual freeing, we need to observe a rcu grace period
to make sure all other cpus can catch the fact the dst is no longer
usable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dormando <dormando@rydia.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This has been marked as deprecated for quite some time and the NFLOG
target replacement has been also available since 2006.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If there is an MSS change (or misbehaving receiver) that causes a SACK
to arrive that covers the end of an skb but is less than one MSS, then
tcp_match_skb_to_sack() was rounding up pkt_len to the full length of
the skb ("Round if necessary..."), then chopping all bytes off the skb
and creating a zero-byte skb in the write queue.
This was visible now because the recently simplified TLP logic in
bef1909ee3 ("tcp: fixing TLP's FIN recovery") could find that 0-byte
skb at the end of the write queue, and now that we do not check that
skb's length we could send it as a TLP probe.
Consider the following example scenario:
mss: 1000
skb: seq: 0 end_seq: 4000 len: 4000
SACK: start_seq: 3999 end_seq: 4000
The tcp_match_skb_to_sack() code will compute:
in_sack = false
pkt_len = start_seq - TCP_SKB_CB(skb)->seq = 3999 - 0 = 3999
new_len = (pkt_len / mss) * mss = (3999/1000)*1000 = 3000
new_len += mss = 4000
Previously we would find the new_len > skb->len check failing, so we
would fall through and set pkt_len = new_len = 4000 and chop off
pkt_len of 4000 from the 4000-byte skb, leaving a 0-byte segment
afterward in the write queue.
With this new commit, we notice that the new new_len >= skb->len check
succeeds, so that we return without trying to fragment.
Fixes: adb92db857 ("tcp: Make SACK code to split only at mss boundaries")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ir_mark initialization is done for both TCP v4 and v6, move it in the
common tcp_openreq_init function.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This variable is overwritten by the child socket assignment before
it ever gets used.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Its too easy to add thousand of UDP sockets on a particular bucket,
and slow down an innocent multicast receiver.
Early demux is supposed to be an optimization, we should avoid spending
too much time in it.
It is interesting to note __udp4_lib_demux_lookup() only tries to
match first socket in the chain.
10 is the threshold we already have in __udp4_lib_lookup() to switch
to secondary hash.
Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Held <drheld@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.
3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.
4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.
6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.
8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.
11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.
12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.
13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
Fix to a problem observed when losing a FIN segment that does not
contain data. In such situations, TLP is unable to recover from
*any* tail loss and instead adds at least PTO ms to the
retransmission process, i.e., RTO = RTO + PTO.
Signed-off-by: Per Hurtig <per.hurtig@kau.se>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/core/rtnetlink.c
net/core/skbuff.c
Both conflicts were very simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to gro_postpull_rcsum for GRO to work with checksum complete.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In skb_checksum complete, if we need to compute the checksum for the
packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
Subsequent checksum verification can use this.
Also, added csum_complete_sw flag to distinguish between software and
hardware generated checksum complete, we should always be able to trust
the software computation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some tunnels (though only vti as for now) can use i_key just for internal use:
for example vti uses it for fwmark'ing incoming packets. So raw i_key value
shouldn't be treated as a distinguisher for them. ip_tunnel_key_match exists for
cases when we want to compare two ip_tunnel_parms' i_keys.
Example bug:
ip link add type vti ikey 1 local 1.0.0.1 remote 2.0.0.2
ip link add type vti ikey 2 local 1.0.0.1 remote 2.0.0.2
spawned two tunnels, although it doesn't make sense.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip tunnel add remote 10.2.2.1 local 10.2.2.2 mode vti ikey 1 okey 2
translates to p->iflags = VTI_ISVTI|GRE_KEY and p->i_key = 1, but GRE_KEY !=
TUNNEL_KEY, so ip_tunnel_ioctl would set i_key to 0 (same story with o_key)
making us unable to create vti tunnels with [io]key via ip tunnel.
We cannot simply translate GRE_KEY to TUNNEL_KEY (as GRE module does) because
vti_tunnels with same local/remote addresses but different ikeys will be treated
as different then. So, imo the best option here is to move p->i_flags & *_KEY
check for vti tunnels from ip_tunnel.c to ip_vti.c and to think about [io]_mark
field for ip_tunnel_parm in the future.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4_{update_pmtu,redirect} were called with tunnel's ifindex (t->dev is a
tunnel netdevice). It caused wrong route lookup and failure of pmtu update or
redirect. We should use the same ifindex that we use in ip_route_output_* in
*tunnel_xmit code. It is t->parms.link .
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to require forcing device down on a Ethernet GRE (gretap)
tunnel to change the MAC address.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_fragment can be called from process context (from tso_fragment).
Add a new gfp parameter to allow it to preserve atomic memory if
possible.
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Reviewed-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy
Documentation/cgroups/unified-hierarchy.txt
is added.
Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"
* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...
ip_rt_put(rt) is always called in "error" branches above, but was missed in
skb_cow_head branch. As rt is not yet bound to skb here we have to release it by
hand.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is available since v3.15-rc5.
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call gso_make_checksum. This should have the benefit of using a
checksum that may have been previously computed for the packet.
This also adds NETIF_F_GSO_GRE_CSUM to differentiate devices that
offload GRE GSO with and without the GRE checksum offloaed.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added a new netif feature for GSO_UDP_TUNNEL_CSUM. This indicates
that a device is capable of computing the UDP checksum in the
encapsulating header of a UDP tunnel.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call common gso_make_checksum when calculating checksum for a
TCP GSO segment.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating a GSO packet segment we may need to set more than
one checksum in the packet (for instance a TCP checksum and
UDP checksum for VXLAN encapsulation). To be efficient, we want
to do checksum calculation for any part of the packet at most once.
This patch adds csum_start offset to skb_gso_cb. This tracks the
starting offset for skb->csum which is initially set in skb_segment.
When a protocol needs to compute a transport checksum it calls
gso_make_checksum which computes the checksum value from the start
of transport header to csum_start and then adds in skb->csum to get
the full checksum. skb->csum and csum_start are then updated to reflect
the checksum of the resultant packet starting from the transport header.
This patch also adds a flag to skbuff, encap_hdr_csum, which is set
in *gso_segment fucntions to indicate that a tunnel protocol needs
checksum calculation
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added udp_set_csum and udp6_set_csum functions to set UDP checksums
in packets. These are for simple UDP packets such as those that might
be created in UDP tunnels.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 30f38d2fdd.
fib_triestat is surrounded by a big lie: while it claims that it's a
seq_file (fib_triestat_seq_open, fib_triestat_seq_show), it isn't:
static const struct file_operations fib_triestat_fops = {
.owner = THIS_MODULE,
.open = fib_triestat_seq_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release_net,
};
Yes, fib_triestat is just a regular file.
A small detail (assuming CONFIG_NET_NS=y) is that while for seq_files
you could do seq_file_net() to get the net ptr, doing so for a regular
file would be wrong and would dereference an invalid pointer.
The fib_triestat lie claimed a victim, and trying to show the file would
be bad for the kernel. This patch just reverts the issue and fixes
fib_triestat, which still needs a rewrite to either be a seq_file or
stop claiming it is.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
This bug is discovered by an recent F-RTO issue on tcpm list
https://www.ietf.org/mail-archive/web/tcpm/current/msg08794.html
The bug is that currently F-RTO does not use DSACK to undo cwnd in
certain cases: upon receiving an ACK after the RTO retransmission in
F-RTO, and the ACK has DSACK indicating the retransmission is spurious,
the sender only calls tcp_try_undo_loss() if some never retransmisted
data is sacked (FLAG_ORIG_DATA_SACKED).
The correct behavior is to unconditionally call tcp_try_undo_loss so
the DSACK information is used properly to undo the cwnd reduction.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make fib_triestat_seq_show consistent with other /proc/net files and
use seq_file_net.
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ideally, we would need to generate IP ID using a per destination IP
generator.
linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.
1) each inet_peer struct consumes 192 bytes
2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.
3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.
4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())
5) We garbage collect inet_peer aggressively.
IP ID generation do not have to be 'perfect'
Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.
We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.
ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)
secure_ip_id() and secure_ipv6_id() no longer are needed.
Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
This small patchset contains three accumulated Netfilter/IPVS updates,
they are:
1) Refactorize common NAT code by encapsulating it into a helper
function, similarly to what we do in other conntrack extensions,
from Florian Westphal.
2) A minor format string mismatch fix for IPVS, from Masanari Iida.
3) Add quota support to the netfilter accounting infrastructure, now
you can add quotas to accounting objects via the nfnetlink interface
and use them from iptables. You can also listen to quota
notifications from userspace. This enhancement from Mathieu Poirier.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces a comma between expression statements by a semicolon.
A simplified version of the semantic patch that performs this
transformation is as follows:
// <smpl>
@r@
expression e1,e2,e;
type T;
identifier i;
@@
e1
-,
+;
e2;
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 6935 permits zero checksums to be used in IPv6 however this is
recommended only for certain tunnel protocols, it does not make
checksums completely optional like they are in IPv4.
This patch restricts the use of IPv6 zero checksums that was previously
intoduced. no_check6_tx and no_check6_rx have been added to control
the use of checksums in UDP6 RX and TX path. The normal
sk_no_check_{rx,tx} settings are not used (this avoids ambiguity when
dealing with a dual stack socket).
A helper function has been added (udp_set_no_check6) which can be
called by tunnel impelmentations to all zero checksums (send on the
socket, and accept them as valid).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define separate fields in the sock structure for configuring disabling
checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
removed UDP_CSUM_* defines since they are no longer necessary.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It doesn't seem like an protocols are setting anything other
than the default, and allowing to arbitrarily disable checksums
for a whole protocol seems dangerous. This can be done on a per
socket basis.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the value of itag is a random value from stack, and may not be initiated by
fib_validate_source, which called fib_combine_itag if CONFIG_IP_ROUTE_CLASSID
is not set
This will make the cached dst uncertainty
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Experience with the recent e114a710aa ("tcp: fix cwnd limited
checking to improve congestion control") has shown that there are
common cases where that commit can cause cwnd to be much larger than
necessary. This leads to TSO autosizing cooking skbs that are too
large, among other things.
The main problems seemed to be:
(1) That commit attempted to predict the future behavior of the
connection by looking at the write queue (if TSO or TSQ limit
sending). That prediction sometimes overestimated future outstanding
packets.
(2) That commit always allowed cwnd to grow to twice the number of
outstanding packets (even in congestion avoidance, where this is not
needed).
This commit improves both of these, by:
(1) Switching to a measurement-based approach where we explicitly
track the largest number of packets in flight during the past window
("max_packets_out"), and remember whether we were cwnd-limited at the
moment we finished sending that flight.
(2) Only allowing cwnd to grow to twice the number of outstanding
packets ("max_packets_out") in slow start. In congestion avoidance
mode we now only allow cwnd to grow if it was fully utilized.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable the module alias hookup to allow tunnel modules to be autoloaded on demand.
This is in line with how most other netdev kinds work, and will allow userspace
to create tunnels without having CAP_SYS_MODULE.
Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to initialize the fallback device to have a correct mtu
set on this device. Otherwise the mtu is set to null and the device
is unusable.
Fixes: fd58156e45 ("IPIP: Use ip-tunneling code.")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The connected check fails to check for ip_gre nbma mode tunnels
properly. ip_gre creates temporary tnl_params with daddr specified
to pass-in the actual target on per-packet basis from neighbor
layer. Detect these tunnels by inspecting the actual tunnel
configuration.
Minimal test case:
ip route add 192.168.1.1/32 via 10.0.0.1
ip route add 192.168.1.2/32 via 10.0.0.2
ip tunnel add nbma0 mode gre key 1 tos c0
ip addr add 172.17.0.0/16 dev nbma0
ip link set nbma0 up
ip neigh add 172.17.0.1 lladdr 192.168.1.1 dev nbma0
ip neigh add 172.17.0.2 lladdr 192.168.1.2 dev nbma0
ping 172.17.0.1
ping 172.17.0.2
The second ping should be going to 192.168.1.2 and head 10.0.0.2;
but cached gre tunnel level route is used and it's actually going
to 192.168.1.1 via 10.0.0.1.
The lladdr's need to go to separate dst for the bug to trigger.
Test case uses separate route entries, but this can also happen
when the route entry is same: if there is a nexthop exception or
the GRE tunnel is IPsec'ed in which case the dst points to xfrm
bundle unique to the gre lladdr.
Fixes: 7d442fab0a ("ipv4: Cache dst in tunnels")
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using command "ip tunnel add" to add a tunnel, the tunnel will be added twice,
through ip_tunnel_create() and ip_tunnel_update().
Because the second is unnecessary, so we can just break after adding tunnel
through ip_tunnel_create().
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/ip_vti.c
Steffen Klassert says:
====================
pull request (net): ipsec 2014-05-15
This pull request has a merge conflict in net/ipv4/ip_vti.c
between commit 8d89dcdf80 ("vti: don't allow to add the same
tunnel twice") and commit a32452366b ("vti4:Don't count header
length twice"). It can be solved like it is done in linux-next.
1) Fix a ipv6 xfrm output crash when a packet is rerouted
by netfilter to not use IPsec.
2) vti4 counts some header lengths twice leading to an incorrect
device mtu. Fix this by counting these headers only once.
3) We don't catch the case if an unsupported protocol is submitted
to the xfrm protocol handlers, this can lead to NULL pointer
dereferences. Fix this by adding the appropriate checks.
4) vti6 may unregister pernet ops twice on init errors.
Fix this by removing one of the calls to do it only once.
From Mathias Krause.
5) Set the vti tunnel mark before doing a lookup in the error
handlers. Otherwise we don't find the correct xfrm state.
====================
The conflict in ip_vti.c was simple, 'net' had a commit
removing a line from vti_tunnel_init() and this tree
being merged had a commit adding a line to the same
location.
Signed-off-by: David S. Miller <davem@davemloft.net>
Netdev_priv is an accessor function, and has no purpose if its result is
not used.
A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Missing a colon on definition use is a bit odd so
change the macro for the 32 bit case to declare an
__attribute__((unused)) and __deprecated variable.
The __deprecated attribute will cause gcc to emit
an error if the variable is actually used.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_local_port_range is already per netns, so should ip_local_reserved_ports
be. And since it is none by default we don't actually need it when we don't
enable CONFIG_SYSCTL.
By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.
This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.
This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established. If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.
Black-box tested using user-mode linux:
- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, routing lookups used for Path PMTU Discovery in
absence of a socket or on unmarked sockets use a mark of 0.
This causes PMTUD not to work when using routing based on
netfilter fwmark mangling and fwmark ip rules, such as:
iptables -j MARK --set-mark 17
ip rule add fwmark 17 lookup 100
This patch causes these route lookups to use the fwmark from the
received ICMP error when the fwmark_reflect sysctl is enabled.
This allows the administrator to make PMTUD work by configuring
appropriate fwmark rules to mark the inbound ICMP packets.
Black-box tested using user-mode linux by pointing different
fwmarks at routing tables egressing on different interfaces, and
using iptables mangling to mark packets inbound on each interface
with the interface's fwmark. ICMPv4 and ICMPv6 PMTU discovery
work as expected when mark reflection is enabled and fail when
it is disabled.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel-originated IP packets that have no user socket associated
with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
are emitted with a mark of zero. Add a sysctl to make them have
the same mark as the packet they are replying to.
This allows an administrator that wishes to do so to use
mark-based routing, firewalling, etc. for these replies by
marking the original packets inbound.
Tested using user-mode linux:
- ICMP/ICMPv6 echo replies and errors.
- TCP RST packets (IPv4 and IPv6).
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After all the preparatory works, supporting IPv6 in Fast Open is now easy.
We pretty much just mirror v4 code. The only difference is how we
generate the Fast Open cookie for IPv6 sockets. Since Fast Open cookie
is 128 bits and we use AES 128, we use CBC-MAC to encrypt both the
source and destination IPv6 addresses since the cookie is a MAC tag.
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a fast open socket is already accepted by the user, it should
be treated like a connected socket to record the ICMP error in
sk_softerr, so the user can fetch it. Do that in both tcp_v4_err
and tcp_v6_err.
Also refactor the sequence window check to improve readability
(e.g., there were two local variables named 'req').
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid large code duplication in IPv6, we need to first simplify
the complicate SYN-ACK sending code in tcp_v4_conn_request().
To use tcp_v4(6)_send_synack() to send all SYN-ACKs, we need to
initialize the mini socket's receive window before trying to
create the child socket and/or building the SYN-ACK packet. So we move
that initialization from tcp_make_synack() to tcp_v4_conn_request()
as a new function tcp_openreq_init_req_rwin().
After this refactoring the SYN-ACK sending code is simpler and easier
to implement Fast Open for IPv6.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate various cookie checking and generation code to simplify
the fast open processing. The main goal is to reduce code duplication
in tcp_v4_conn_request() for IPv6 support.
Removes two experimental sysctl flags TFO_SERVER_ALWAYS and
TFO_SERVER_COOKIE_NOT_CHKD used primarily for developmental debugging
purposes.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move common TFO functions that will be used by both v4 and v6
to tcp_fastopen.c. Create a helper tcp_fastopen_queue_check().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cftype->trigger() is pointless. It's trivial to ignore the input
buffer from a regular ->write() operation. Convert all ->trigger()
users to ->write() and remove ->trigger().
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts. The conversions are mostly mechanical.
* @css and @cft are accessed using of_css() and of_cft() accessors
respectively instead of being specified as arguments.
* Should return @nbytes on success instead of 0.
* @buf is not trimmed automatically. Trim if necessary. Note that
blkcg and netprio don't need this as the parsers already handle
whitespaces.
cftype->write_string() has no user left after the conversions and
removed.
While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
This patch doesn't introduce any visible behavior changes.
v2: netprio was missing from conversion. Converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
As suggested by several people, rename local_df to ignore_df,
since it means "ignore df bit if it is set".
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/altera/altera_sgdma.c
net/netlink/af_netlink.c
net/sched/cls_api.c
net/sched/sch_api.c
The netlink conflict dealt with moving to netlink_capable() and
netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
in non-init namespaces. These were simple transformations from
netlink_capable to netlink_ns_capable.
The Altera driver conflict was simply code removal overlapping some
void pointer cast cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use the mark we get from the tunnels o_key to
lookup the right vti state in the error handlers. This patch
ensures that.
Fixes: df3893c1 ("vti: Update the ipv4 side to use it's own receive hook.")
Fixes: fa9ad96d ("vti6: Update the ipv6 side to use its own receive hook.")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains netfilter fixes for your net tree, they are:
1) Fix use after free in nfnetlink when sending a batch for some
unsupported subsystem, from Denys Fedoryshchenko.
2) Skip autoload of the nat module if no binding is specified via
ctnetlink, from Florian Westphal.
3) Set local_df after netfilter defragmentation to avoid a bogus ICMP
fragmentation needed in the forwarding path, also from Florian.
4) Fix potential user after free in ip6_route_me_harder() when returning
the error code to the upper layers, from Sergey Popovich.
5) Skip possible bogus ICMP time exceeded emitted from the router (not
valid according to RFC) if conntrack zones are used, from Vasily Averin.
6) Fix fragment handling when nf_defrag_ipv4 is loaded but nf_conntrack
is not present, also from Vasily.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving validation of UDP checksum to be done in UDP not encap layer.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use skb_checksum_simple_validate to verify checksum.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use skb_checksum_simple_validate to verify checksum.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use skb_checksum_simple_validate to verify checksum.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All the callers hold RTNL lock, so there is no need to use inet_addr_hash_lock
to protect the hash list.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly, when CONFIG_SYSCTL is not set, ping_group_range should still
work, just that no one can change it. Therefore we should move it out of
sysctl_net_ipv4.c. And, it should not share the same seqlock with
ip_local_port_range.
BTW, rename it to ->ping_group_range instead.
Cc: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com>
Reported-by: Stefan de Konink <stefan@konink.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_SYSCTL is not set, ip_local_port_range should still work,
just that no one can change it. Therefore we should move it out of sysctl_inet.c.
Also, rename it to ->ip_local_ports instead.
Cc: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com>
Reported-by: Stefan de Konink <stefan@konink.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Increment fib_info_cnt in fib_create_info() right after successfuly
alllocating fib_info structure, overwise fib_metrics allocation failure
leads to fib_info_cnt incorrectly decremented in free_fib_info(), called
on error path from fib_create_info().
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 8f0ea0fe3a (snmp: reduce percpu needs by 50%)
reduced snmp array size to 1, so technically it doesn't have to be
an array any more. What's more, after the following commit:
commit 933393f58f
Date: Thu Dec 22 11:58:51 2011 -0600
percpu: Remove irqsafe_cpu_xxx variants
We simply say that regular this_cpu use must be safe regardless of
preemption and interrupt state. That has no material change for x86
and s390 implementations of this_cpu operations. However, arches that
do not provide their own implementation for this_cpu operations will
now get code generated that disables interrupts instead of preemption.
probably no arch wants to have SNMP_ARRAY_SZ == 2. At least after
almost 3 years, no one complains.
So, just convert the array to a single pointer and remove snmp_mib_init()
and snmp_mib_free() as well.
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Doing the segmentation in the forward path has one major drawback:
When using virtio, we may process gso udp packets coming
from host network stack. In that case, netfilter POSTROUTING
will see one packet with udp header followed by multiple ip
fragments.
Delay the segmentation and do it after POSTROUTING invocation
to avoid this.
Fixes: fe6cc55f3a ("net: ip, ipv6: handle gso skbs in forwarding path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
local_df means 'ignore DF bit if set', so if its set we're
allowed to perform ip fragmentation.
This wasn't noticed earlier because the output path also drops such skbs
(and emits needed icmp error) and because netfilter ip defrag did not
set local_df until couple of days ago.
Only difference is that DF-packets-larger-than MTU now discarded
earlier (f.e. we avoid pointless netfilter postrouting trip).
While at it, drop the repeated test ip_exceeds_mtu, checking it once
is enough...
Fixes: fe6cc55f3a ("net: ip, ipv6: handle gso skbs in forwarding path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip_tunnel_rcv(), set skb->network_header to inner IP header
before IP_ECN_decapsulate().
Without the fix, IP_ECN_decapsulate() takes outer IP header as
inner IP header, possibly causing error messages or packet drops.
Note that this skb_reset_network_header() call was in this spot when
the original feature for checking consistency of ECN bits through
tunnels was added in eccc1bb8d4 ("tunnel: drop packet if ECN present
with not-ECT"). It was only removed from this spot in 3d7b46cd20
("ip_tunnel: push generic protocol handling to ip_tunnel module.").
Fixes: 3d7b46cd20 ("ip_tunnel: push generic protocol handling to ip_tunnel module.")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Ying Cai <ycai@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_checksum_init instead of private functions.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defrag user check in ip_expire was not updated after adding support for
"conntrack zones".
This bug manifests as a RFC violation, since the router will send
the icmp time exceeeded message when using conntrack zones.
Signed-off-by: Vasily Averin <vvs@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
else we may fail to forward skb even if original fragments do fit
outgoing link mtu:
1. remote sends 2k packets in two 1000 byte frags, DF set
2. we want to forward but only see '2k > mtu and DF set'
3. we then send icmp error saying that outgoing link is 1500
But original sender never sent a packet that would not fit
the outgoing link.
Setting local_df makes outgoing path test size vs.
IPCB(skb)->frag_max_size, so we will still send the correct
error in case the largest original size did not fit
outgoing link mtu.
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Suggested-by: Maxime Bizon <mbizon@freebox.fr>
Fixes: 5f2d04f1f9 (ipv4: fix path MTU discovery with connection tracking)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit e114a710aa ("tcp: fix cwnd limited checking to improve
congestion control") obsoleted in_flight parameter from
tcp_is_cwnd_limited() and its callers.
This patch does the removal as promised.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung discovered tcp_is_cwnd_limited() was returning false in
slow start phase even if the application filled the socket write queue.
All congestion modules take into account tcp_is_cwnd_limited()
before increasing cwnd, so this behavior limits slow start from
probing the bandwidth at full speed.
The problem is that even if write queue is full (aka we are _not_
application limited), cwnd can be under utilized if TSO should auto
defer or TCP Small queues decided to hold packets.
So the in_flight can be kept to smaller value, and we can get to the
point tcp_is_cwnd_limited() returns false.
With TCP Small Queues and FQ/pacing, this issue is more visible.
We fix this by having tcp_cwnd_validate(), which is supposed to track
such things, take into account unsent_segs, the number of segs that we
are not sending at the moment due to TSO or TSQ, but intend to send
real soon. Then when we are cwnd-limited, remember this fact while we
are processing the window of ACKs that comes back.
For example, suppose we have a brand new connection with cwnd=10; we
are in slow start, and we send a flight of 9 packets. By the time we
have received ACKs for all 9 packets we want our cwnd to be 18.
We implement this by setting tp->lsnd_pending to 9, and
considering ourselves to be cwnd-limited while cwnd is less than
twice tp->lsnd_pending (2*9 -> 18).
This makes tcp_is_cwnd_limited() more understandable, by removing
the GSO/TSO kludge, that tried to work around the issue.
Note the in_flight parameter can be removed in a followup cleanup
patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit b9f47a3aae (tcp_cubic: limit delayed_ack ratio to prevent
divide error) try to prevent divide error, but there is still a little
chance that delayed_ack can reach zero. In case the param cnt get
negative value, then ratio+cnt would overflow and may happen to be zero.
As a result, min(ratio, ACK_RATIO_LIMIT) will calculate to be zero.
In some old kernels, such as 2.6.32, there is a bug that would
pass negative param, which then ultimately leads to this divide error.
commit 5b35e1e6e9 (tcp: fix tcp_trim_head() to adjust segment count
with skb MSS) fixed the negative param issue. However,
it's safe that we fix the range of delayed_ack as well,
to make sure we do not hit a divide by zero.
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Liu Yu <allanyuliu@tencent.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both TLP and Fast Open call __tcp_retransmit_skb() instead of
tcp_retransmit_skb() to avoid changing tp->retrans_out.
This has the side effect of missing SNMP counters increments as well
as tcp_info tcpi_total_retrans updates.
Fix this by moving the stats increments of into __tcp_retransmit_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce copy-past a bit by adding a common helper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We don't catch the case if an unsupported protocol is submitted
to the xfrm4 protocol handlers, this can lead to NULL pointer
dereferences. Fix this by adding the appropriate checks.
Fixes: 3328715e ("xfrm4: Add IPsec protocol multiplexer")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Do not initialize list twice.
list_replace_init() already takes care of initializing list.
We don't need to initialize it with LIST_HEAD() beforehand.
Signed-off-by: xiao jin <jin.xiao@intel.com>
Reviewed-by: David Cohen <david.a.cohen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.
When one of the two netns is removed, the tunnel is destroyed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 0e280af026 ("tcp: introduce TCPSpuriousRtxHostQueues SNMP
counter") we added a logic to detect when a packet was retransmitted
while the prior clone was still in a qdisc or driver queue.
We are now confident we can do better, and catch the problem before
we fragment a TSO packet before retransmit, or in TLP path.
This patch fully exploits the logic by simply canceling the spurious
retransmit.
Original packet is in a queue and will eventually leave the host.
This helps to avoid network collapses when some events make the RTO
estimations very wrong, particularly when dealing with huge number of
sockets with synchronized blast.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make tcp_cwnd_application_limited() static and move it from tcp_input.c to
tcp_output.c
Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a TCP_FASTOPEN socket option to get a max backlog on its
listener to getsockopt().
Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because the netdevice may be in another netns than the i/o netns, we should
use the i/o netns instead of dev_net(dev).
The variable 'tunnel' was used only to get 'itn', hence to simplify code I
remove it and use 't' instead.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In my special case, when a packet is redirected from veth0 to lo,
its skb->dev->ifindex would be LOOPBACK_IFINDEX. Meanwhile we
pass the hard-coded LOOPBACK_IFINDEX to fib_validate_source()
in ip_route_input_slow(). This would cause the following check
in fib_validate_source() fail:
(dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev))
when rp_filter is disabeld on loopback. As suggested by Julian,
the caller should pass 0 here so that we will not end up by
calling __fib_validate_source().
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by Julian:
Simply, flowi4_iif must not contain 0, it does not
look logical to ignore all ip rules with specified iif.
because in fib_rule_match() we do:
if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
goto out;
flowi4_iif should be LOOPBACK_IFINDEX by default.
We need to move LOOPBACK_IFINDEX to include/net/flow.h:
1) It is mostly used by flowi_iif
2) Fix the following compile error if we use it in flow.h
by the patches latter:
In file included from include/linux/netfilter.h:277:0,
from include/net/netns/netfilter.h:5,
from include/net/net_namespace.h:21,
from include/linux/netdevice.h:43,
from include/linux/icmpv6.h:12,
from include/linux/ipv6.h:61,
from include/net/ipv6.h:16,
from include/linux/sunrpc/clnt.h:27,
from include/linux/nfs_fs.h:30,
from init/do_mounts.c:32:
include/net/flow.h: In function ‘flowi4_init_output’:
include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently count the size of LL_MAX_HEADER and struct iphdr
twice for vti4 devices, this leads to a wrong device mtu.
The size of LL_MAX_HEADER and struct iphdr is already counted in
ip_tunnel_bind_dev(), so don't do it again in vti_tunnel_init().
Fixes: b9959fd3 ("vti: switch to new ip tunnel code")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.
The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.
Fixes: 8f646c922d ("vxlan: keep original skb ownership")
Reported-by: lucien xin <lucien.xin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_queue_xmit() assumes the skb it has to transmit is attached to an
inet socket. Commit 31c70d5956 ("l2tp: keep original skb ownership")
changed l2tp to not change skb ownership and thus broke this assumption.
One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
so that we do not assume skb->sk points to the socket used by l2tp
tunnel.
Fixes: 31c70d5956 ("l2tp: keep original skb ownership")
Reported-by: Zhan Jianyu <nasa4836@gmail.com>
Tested-by: Zhan Jianyu <nasa4836@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend commit 13378cad02
("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid
RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0
which is displayed as 'iif *'.
inet_iif is not appropriate to use because skb_iif is not set.
Use the skb->dev->ifindex instead.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Plug a group_info refcount leak in ping_init.
group_info is only needed during initialization and
the code failed to release the reference on exit.
While here move grabbing the reference to a place
where it is actually needed.
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Zhang Dongxing <dongxing.zhang@intel.com>
Signed-off-by: xiaoming wang <xiaoming.wang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before the patch, it was possible to add two times the same tunnel:
ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41
ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41
It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
argument dev->type, which was set only later (when calling ndo_init handler
in register_netdevice()). Let's set this type in the setup handler, which is
called before newlink handler.
Introduced by commit b9959fd3b0 ("vti: switch to new ip tunnel code").
CC: Cong Wang <amwang@redhat.com>
CC: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before the patch, it was possible to add two times the same tunnel:
ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249
ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249
It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
argument dev->type, which was set only later (when calling ndo_init handler
in register_netdevice()). Let's set this type in the setup handler, which is
called before newlink handler.
Introduced by commit c544193214 ("GRE: Refactor GRE tunneling code.").
CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several spots in the kernel perform a sequence like:
skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);
But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.
Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.
And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.
So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.
Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more networking updates from David Miller:
1) If a VXLAN interface is created with no groups, we can crash on
reception of packets. Fix from Mike Rapoport.
2) Missing includes in CPTS driver, from Alexei Starovoitov.
3) Fix string validations in isdnloop driver, from YOSHIFUJI Hideaki
and Dan Carpenter.
4) Missing irq.h include in bnxw2x, enic, and qlcnic drivers. From
Josh Boyer.
5) AF_PACKET transmit doesn't statistically count TX drops, from Daniel
Borkmann.
6) Byte-Queue-Limit enabled drivers aren't handled properly in
AF_PACKET transmit path, also from Daniel Borkmann.
Same problem exists in pktgen, and Daniel fixed it there too.
7) Fix resource leaks in driver probe error paths of new sxgbe driver,
from Francois Romieu.
8) Truesize of SKBs can gradually get more and more corrupted in NAPI
packet recycling path, fix from Eric Dumazet.
9) Fix uniprocessor netfilter build, from Florian Westphal. In the
longer term we should perhaps try to find a way for ARRAY_SIZE() to
work even with zero sized array elements.
10) Fix crash in netfilter conntrack extensions due to mis-estimation of
required extension space. From Andrey Vagin.
11) Since we commit table rule updates before trying to copy the
counters back to userspace (it's the last action we perform), we
really can't signal the user copy with an error as we are beyond the
point from which we can unwind everything. This causes all kinds of
use after free crashes and other mysterious behavior.
From Thomas Graf.
12) Restore previous behvaior of div/mod by zero in BPF filter
processing. From Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
net: sctp: wake up all assocs if sndbuf policy is per socket
isdnloop: several buffer overflows
netdev: remove potentially harmful checks
pktgen: fix xmit test for BQL enabled devices
net/at91_ether: avoid NULL pointer dereference
tipc: Let tipc_release() return 0
at86rf230: fix MAX_CSMA_RETRIES parameter
mac802154: fix duplicate #include headers
sxgbe: fix duplicate #include headers
net: filter: be more defensive on div/mod by X==0
netfilter: Can't fail and free after table replacement
xen-netback: Trivial format string fix
net: bcmgenet: Remove unnecessary version.h inclusion
net: smc911x: Remove unused local variable
bonding: Inactive slaves should keep inactive flag's value
netfilter: nf_tables: fix wrong format in request_module()
netfilter: nf_tables: set names cannot be larger than 15 bytes
netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len
netfilter: Add {ipt,ip6t}_osf aliases for xt_osf
netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks
...
The RT_CACHE_STAT_INC macro triggers the new preemption checks
for __this_cpu ops.
I do not see any other synchronization that would allow the use of a
__this_cpu operation here however in commit dbd2915ce8 ("[IPV4]:
RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of
raw_smp_processor_id() here because "we do not care" about races. In
the past we agreed that the price of disabling interrupts here to get
consistent counters would be too high. These counters may be inaccurate
due to race conditions.
The use of __this_cpu op improves the situation already from what commit
dbd2915ce8 did since the single instruction emitted on x86 does not
allow the race to occur anymore. However, non x86 platforms could still
experience a race here.
Trace:
__this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193
caller is __this_cpu_preempt_check+0x38/0x60
CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF 3.12.0-rc4+ #187
Call Trace:
check_preemption_disabled+0xec/0x110
__this_cpu_preempt_check+0x38/0x60
__ip_route_output_key+0x575/0x8c0
ip_route_output_flow+0x27/0x70
udp_sendmsg+0x825/0xa20
inet_sendmsg+0x85/0xc0
sock_sendmsg+0x9c/0xd0
___sys_sendmsg+0x37c/0x390
__sys_sendmsg+0x49/0x90
SyS_sendmsg+0x12/0x20
tracesys+0xe1/0xe6
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ipv6 xfrm output path is not aware that packets can be
rerouted by NAT to not use IPsec. We crash in this case
because we expect to have a xfrm state at the dst_entry.
This crash happens if the ipv6 layer does IPsec and NAT
or if we have an interfamily IPsec tunnel with ipv4 NAT.
We fix this by checking for a NAT rerouted packet in each
address family and dst_output() to the new destination
in this case.
Reported-by: Martin Pelikan <martin.pelikan@gmail.com>
Tested-by: Martin Pelikan <martin.pelikan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
All xtables variants suffer from the defect that the copy_to_user()
to copy the counters to user memory may fail after the table has
already been exchanged and thus exposed. Return an error at this
point will result in freeing the already exposed table. Any
subsequent packet processing will result in a kernel panic.
We can't copy the counters before exposing the new tables as we
want provide the counter state after the old table has been
unhooked. Therefore convert this into a silent error.
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:
- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.
After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.
- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.
- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.
- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.
There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"
* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...
Conflicts:
drivers/net/ethernet/marvell/mvneta.c
The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems I missed one change in get_timewait4_sock() to compute
the remaining time before deletion of IPV4 timewait socket.
This could result in wrong output in /proc/net/tcp for tm->when field.
Fixes: 96f817fede ("tcp: shrink tcp6_timewait_sock by one cache line")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to allocate 15 bytes in excess for a SYNACK packet,
as it contains no data, only headers.
SYNACK are always generated in softirq context, and contain a single
segment, we can use TCP_INC_STATS_BH()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit d4589926d7 (tcp: refine TSO splits), tcp_nagle_check() does
not use parameter mss_now anymore.
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 10ddceb22b (ip_tunnel:multicast process cause panic due
to skb->_skb_refdst NULL pointer) removed dst-drop call from
ip-tunnel-recv.
Following commit reintroduce dst-drop and fix the original bug by
checking loopback packet before releasing dst.
Original bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
Documentation/devicetree/bindings/net/micrel-ks8851.txt
net/core/netpoll.c
The net/core/netpoll.c conflict is a bug fix in 'net' happening
to code which is completely removed in 'net-next'.
In micrel-ks8851.txt we simply have overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4_ifdown_dst does nothing after IPv4 route caches removal,
so we can remove it.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8cd3ac9f9b ("ipmr: advertise new mfc entries via rtnl") reuses the
function ipmr_fill_mroute() to notify mfc events.
But this function was used only for dump and thus was always setting the
flag NLM_F_MULTI, which is wrong in case of a single notification.
Libraries like libnl will wait forever for NLMSG_DONE.
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
cftype->write_string() just passes on the writeable buffer from kernfs
and there's no reason to add const restriction on the buffer. The
only thing const achieves is unnecessarily complicating parsing of the
buffer. Drop const from @buffer.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Steffen Klassert says:
====================
One patch to rename a newly introduced struct. The rest is
the rework of the IPsec virtual tunnel interface for ipv6 to
support inter address family tunneling and namespace crossing.
1) Rename the newly introduced struct xfrm_filter to avoid a
conflict with iproute2. From Nicolas Dichtel.
2) Introduce xfrm_input_afinfo to access the address family
dependent tunnel callback functions properly.
3) Add and use a IPsec protocol multiplexer for ipv6.
4) Remove dst_entry caching. vti can lookup multiple different
dst entries, dependent of the configured xfrm states. Therefore
it does not make to cache a dst_entry.
5) Remove caching of flow informations. vti6 does not use the the
tunnel endpoint addresses to do route and xfrm lookups.
6) Update the vti6 to use its own receive hook.
7) Remove the now unused xfrm_tunnel_notifier. This was used from vti
and is replaced by the IPsec protocol multiplexer hooks.
8) Support inter address family tunneling for vti6.
9) Check if the tunnel endpoints of the xfrm state and the vti interface
are matching and return an error otherwise.
10) Enable namespace crossing for vti devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
most relevantly they are:
* cleanup to remove double semicolon from stephen hemminger.
* calm down sparse warning in xt_ipcomp, from Fan Du.
* nf_ct_labels support for nf_tables, from Florian Westphal.
* new macros to simplify rcu dereferences in the scope of nfnetlink
and nf_tables, from Patrick McHardy.
* Accept queue and drop (including reason for drop) to verdict
parsing in nf_tables, also from Patrick.
* Remove unused random seed initialization in nfnetlink_log, from
Florian Westphal.
* Allow to attach user-specific information to nf_tables rules, useful
to attach user comments to rule, from me.
* Return errors in ipset according to the manpage documentation, from
Jozsef Kadlecsik.
* Fix coccinelle warnings related to incorrect bool type usage for ipset,
from Fengguang Wu.
* Add hash:ip,mark set type to ipset, from Vytas Dauksa.
* Fix message for each spotted by ipset for each netns that is created,
from Ilia Mirkin.
* Add forceadd option to ipset, which evicts a random entry from the set
if it becomes full, from Josh Hunt.
* Minor IPVS cleanups and fixes from Andi Kleen and Tingwei Liu.
* Improve conntrack scalability by removing a central spinlock, original
work from Eric Dumazet. Jesper Dangaard Brouer took them over to address
remaining issues. Several patches to prepare this change come in first
place.
* Rework nft_hash to resolve bugs (leaking chain, missing rcu synchronization
on element removal, etc. from Patrick McHardy.
* Restore context in the rule deletion path, as we now release rule objects
synchronously, from Patrick McHardy. This gets back event notification for
anonymous sets.
* Fix NAT family validation in nft_nat, also from Patrick.
* Improve scalability of xt_connlimit by using an array of spinlocks and
by introducing a rb-tree of hashtables for faster lookup of accounted
objects per network. This patch was preceded by several patches and
refactorizations to accomodate this change including the use of kmem_cache,
from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the bh safe variant with the hard irq safe variant.
We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.
Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/r8152.c
drivers/net/xen-netback/netback.c
Both the r8152 and netback conflicts were simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 can be build as a module, so we need mechanism to access
the address family dependent callback functions properly.
Therefore we introduce xfrm_input_afinfo, similar to that
what we have for the address family dependent part of
policies and states.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
All skb in socket write queue should be properly timestamped.
In case of FastOpen, we special case the SYN+DATA 'message' as we
queue in socket wrote queue the two fallback skbs:
1) SYN message by itself.
2) DATA segment by itself.
We should make sure these skbs have proper timestamps.
Add a WARN_ON_ONCE() to eventually catch future violations.
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usage of skb->tstamp should remain private to TCP stack
(only set on packets on write queue, not on cloned ones)
Otherwise, packets given to loopback interface with a non null tstamp
can confuse netif_rx() / net_timestamp_check()
Other possibility would be to clear tstamp in loopback_xmit(),
as done in skb_scrub_packet()
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Alexander Aring:
While fragmentation and unloading of 6lowpan module I got this kernel Oops
after few seconds:
BUG: unable to handle kernel paging request at f88bbc30
[..]
Modules linked in: ipv6 [last unloaded: 6lowpan]
Call Trace:
[<c012af4c>] ? call_timer_fn+0x54/0xb3
[<c012aef8>] ? process_timeout+0xa/0xa
[<c012b66b>] run_timer_softirq+0x140/0x15f
Problem is that incomplete frags are still around after unload; when
their frag expire timer fires, we get crash.
When a netns is removed (also done when unloading module), inet_frag
calls the evictor with 'force' argument to purge remaining frags.
The evictor loop terminates when accounted memory ('work') drops to 0
or the lru-list becomes empty. However, the mem accounting is done
via percpu counters and may not be accurate, i.e. loop may terminate
prematurely.
Alter evictor to only stop once the lru list is empty when force is
requested.
Reported-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Reported-by: Alexander Aring <alex.aring@gmail.com>
Tested-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Can be invoked from non-BH context.
Based upon a patch by Eric Dumazet.
Fixes: f19c29e3e3 ("tcp: snmp stats for Fast Open, SYN rtx, and data pkts")
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c
The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.
The two wireless conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions
(basically the users of inet_frag_kill/inet_frag_put).
What happens is that after a fragment has been added to the hash chain
but before it's been added to the lru_list (inet_frag_lru_add) in
inet_frag_intern, it may get deleted (either by an expired timer if
the system load is high or the timer sufficiently low, or by the
fraq_queue function for different reasons) before it's added to the
lru_list, then after it gets added it's a matter of time for the
evictor to get to a piece of memory which has been freed leading to a
number of different bugs depending on what's left there.
I've been able to trigger this on both IPv4 and IPv6 (which is normal
as the frag code is the same), but it's been much more difficult to
trigger on IPv4 due to the protocol differences about how fragments
are treated.
The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
in a RR bond, so the same flow can be seen on multiple cards at the
same time. Then I used multiple instances of ping/ping6 to generate
fragmented packets and flood the machines with them while running
other processes to load the attacked machine.
*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.
An important point to make is that any caller (frag_queue or timer) of
inet_frag_kill will remove both the timer refcount and the
original/guarding refcount thus removing everything that's keeping the
frag from being freed at the next inet_frag_put. All of this could
happen before the frag was ever added to the LRU list, then it gets
added and the evictor uses a freed fragment.
An example for IPv6 would be if a fragment is being added and is at
the stage of being inserted in the hash after the hash lock is
released, but before inet_frag_lru_add executes (or is able to obtain
the lru lock) another overlapping fragment for the same flow arrives
at a different CPU which finds it in the hash, but since it's
overlapping it drops it invoking inet_frag_kill and thus removing all
guarding refcounts, and afterwards freeing it by invoking
inet_frag_put which removes the last refcount added previously by
inet_frag_find, then inet_frag_lru_add gets executed by
inet_frag_intern and we have a freed fragment in the lru_list.
The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for
the fragment to be added to the lru_list, and then it'll remove it (it
works because the hash chain removal is done before the lru_list one
and there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.
Fixes: 3ef0eb0db4 ("net: frag, move LRU list maintenance outside of
rwlock")
CC: Florian Westphal <fw@strlen.de>
CC: Jesper Dangaard Brouer <brouer@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following snmp stats:
TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse
the remote does not accept it or the attempts timed out.
TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
more useful to track the TCP retransmission rate.
Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to
TCPFastOpenPassive.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Lawrence Brakmo <brakmo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when ip_tunnel process multicast packets, it may check if the packet is looped
back packet though 'rt_is_output_route(skb_rtable(skb))' in ip_tunnel_rcv(),
but before that , skb->_skb_refdst has been dropped in iptunnel_pull_header(),
so which leads to a panic.
fix the bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RTT may be bogus with tall loss probe (TLP) when a packet
is retransmitted and latter (s)acked without TCPCB_SACKED_RETRANS flag.
For example, TLP calls __tcp_retransmit_skb() instead of
tcp_retransmit_skb(). The skb timestamps are updated but the sacked
flag is not marked with TCPCB_SACKED_RETRANS. As a result we'll
get bogus RTT in tcp_clean_rtx_queue() or in tcp_sacktag_one() on
spurious retransmission.
The fix is to apply the sticky flag TCP_EVER_RETRANS to enforce Karn's
check on RTT sampling. However this will disable F-RTO if timeout occurs
after TLP, by resetting undo_marker in tcp_enter_loss(). We relax this
check to only if any pending retransmists are still in-flight.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
This is the rework of the IPsec virtual tunnel interface
for ipv4 to support inter address family tunneling and
namespace crossing. The only change to the last RFC version
is a compile fix for an odd configuration where CONFIG_XFRM
is set but CONFIG_INET is not set.
1) Add and use a IPsec protocol multiplexer.
2) Add xfrm_tunnel_skb_cb to the skb common buffer
to store a receive callback there.
3) Make vti work with i_key set by not including the i_key
when comupting the hash for the tunnel lookup in case of
vti tunnels.
4) Update ip_vti to use it's own receive hook.
5) Remove xfrm_tunnel_notifier, this is replaced by the IPsec
protocol multiplexer.
6) We need to be protocol family indepenent, so use the on xfrm_lookup
returned dst_entry instead of the ipv4 rtable in vti_tunnel_xmit().
7) Add support for inter address family tunneling.
8) Check if the tunnel endpoints of the xfrm state and the vti interface
are matching and return an error otherwise.
8) Enable namespace crossing tor vti devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Upcoming congestion controls for TCP require usec resolution for RTT
estimations. Millisecond resolution is simply not enough these days.
FQ/pacing in DC environments also require this change for finer control
and removal of bimodal behavior due to the current hack in
tcp_update_pacing_rate() for 'small rtt'
TCP_CONG_RTT_STAMP is no longer needed.
As Julian Anastasov pointed out, we need to keep user compatibility :
tcp_metrics used to export RTT and RTTVAR in msec resolution,
so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
to use the new attributes if provided by the kernel.
In this example ss command displays a srtt of 32 usecs (10Gbit link)
lpk51:~# ./ss -i dst lpk52
Netid State Recv-Q Send-Q Local Address:Port Peer
Address:Port
tcp ESTAB 0 1 10.246.11.51:42959
10.246.11.52:64614
cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
cwnd:10 send
3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
Updated iproute2 ip command displays :
lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
10.246.11.51
Old binary displays :
lpk51:~# ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
10.246.11.51
With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Larry Brakmo <brakmo@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
generation of fragments if the interface mtu is exceeded, it is very
hard to make use of this option in already deployed name server software
for which I introduced this option.
This patch adds yet another new IP_MTU_DISCOVER option to not honor any
path mtu information and not accepting new icmp notifications destined for
the socket this option is enabled on. But we allow outgoing fragmentation
in case the packet size exceeds the outgoing interface mtu.
As such this new option can be used as a drop-in replacement for
IP_PMTUDISC_DONT, which is currently in use by most name server software
making the adoption of this option very smooth and easy.
The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
ignoring incoming path MTU updates and not honoring discovered path MTUs
in the output path.
Fixes: 482fc6094a ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket
is attached to the skb (in case of forwarding) or determines the mtu like
we do in ip_finish_output, which actually checks if we should branch to
ip_fragment. Thus use the same function to determine the mtu here, too.
This is important for the introduction of IP_PMTUDISC_OMIT, where we
want the packets getting cut in pieces of the size of the outgoing
interface mtu. IPv6 already does this correctly.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three counters are added:
- one to track when we went from non-zero to zero window
- one to track the reverse
- one counter incremented when we want to announce zero window,
but can't because we would shrink current window.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES can only be incremented
in tcp_transmit_skb() from softirq (incoming message or timer
activation), it is better to use NET_INC_STATS() instead of
NET_INC_STATS_BH() as tcp_transmit_skb() can be called from process
context.
This will avoid copy/paste confusion when/if we want to add
other SNMP counters in tcp_transmit_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the UFO fragmentation process does not correctly handle inner
UDP frames.
(The following tcpdumps are captured on the parent interface with ufo
disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280,
both sit device):
IPv6:
16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300)
192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000
16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844)
192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471
We can see that fragmentation header offset is not correctly updated.
(fragmentation id handling is corrected by 916e4cf46d ("ipv6: reuse
ip6_frag_id from ip6_ufo_append_data")).
IPv4:
16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296)
192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276)
192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000
16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792)
192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772)
192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109
In this case fragmentation id is incremented and offset is not updated.
First, I aligned inet_gso_segment and ipv6_gso_segment:
* align naming of flags
* ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we
always ensure that the state of this flag is left untouched when
returning from upper gso segmenation function
* ipv6_gso_segment: move skb_reset_inner_headers below updating the
fragmentation header data, we don't care for updating fragmentation
header data
* remove currently unneeded comment indicating skb->encapsulation might
get changed by upper gso_segment callback (gre and udp-tunnel reset
encapsulation after segmentation on each fragment)
If we encounter an IPIP or SIT gso skb we now check for the protocol ==
IPPROTO_UDP and that we at least have already traversed another ip(6)
protocol header.
The reason why we have to special case GSO_IPIP and GSO_SIT is that
we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner
protocol of GSO_UDP_TUNNEL or GSO_GRE packets.
Reported-by: Wolfgang Walter <linux@stwm.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tunnel endpoints of the xfrm_state we got from the xfrm_lookup
must match the tunnel endpoints of the vti interface. This patch
ensures this matching.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
With this patch we can tunnel ipv6 traffic via a vti4
interface. A vti4 interface can now have an ipv6 address
and ipv6 traffic can be routed via a vti4 interface.
The resulting traffic is xfrm transformed and tunneled
throuhg ipv4 if matching IPsec policies and states are
present.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We need to be protocol family indepenent to support
inter addresss family tunneling with vti. So use a
dst_entry instead of the ipv4 rtable in vti_tunnel_xmit.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This was used from vti and is replaced by the IPsec protocol
multiplexer hooks. It is now unused, so remove it.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
With this patch, vti uses the IPsec protocol multiplexer to
register it's own receive side hooks for ESP, AH and IPCOMP.
Vti now does the following on receive side:
1. Do an input policy check for the IPsec packet we received.
This is required because this packet could be already
prosecces by IPsec, so an inbuond policy check is needed.
2. Mark the packet with the i_key. The policy and the state
must match this key now. Policy and state belong to the outer
namespace and policy enforcement is done at the further layers.
3. Call the generic xfrm layer to do decryption and decapsulation.
4. Wait for a callback from the xfrm layer to properly clean the
skb to not leak informations on namespace and to update the
device statistics.
On transmit side:
1. Mark the packet with the o_key. The policy and the state
must match this key now.
2. Do a xfrm_lookup on the original packet with the mark applied.
3. Check if we got an IPsec route.
4. Clean the skb to not leak informations on namespace
transitions.
5. Attach the dst_enty we got from the xfrm_lookup to the skb.
6. Call dst_output to do the IPsec processing.
7. Do the device statistics.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Vti uses the o_key to mark packets that were transmitted or received
by a vti interface. Unfortunately we can't apply different marks
to in and outbound packets with only one key availabe. Vti interfaces
typically use wildcard selectors for vti IPsec policies. On forwarding,
the same output policy will match for both directions. This generates
a loop between the IPsec gateways until the ttl of the packet is
exceeded.
The gre i_key/o_key are usually there to find the right gre tunnel
during a lookup. When vti uses the i_key to mark packets, the tunnel
lookup does not work any more because vti does not use the gre keys
as a hash key for the lookup.
This patch workarounds this my not including the i_key when comupting
the hash for the tunnel lookup in case of vti tunnels.
With this we have separate keys available for the transmitting and
receiving side of the vti interface.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
IPsec vti_rcv needs to remind the tunnel pointer to
check it later at the vti_rcv_cb callback. So add
this pointer to the IPsec common buffer, initialize
it and check it to avoid transport state matching of
a tunneled packet.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch add an IPsec protocol multiplexer. With this
it is possible to add alternative protocol handlers as
needed for IPsec virtual tunnel interfaces.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
tcp_is_cwnd_limited() allows GSO/TSO enabled flows to increase
their cwnd to allow a full size (64KB) TSO packet to be sent.
Non GSO flows only allow an extra room of 3 MSS.
For most flows with a BDP below 10 MSS, this results in a bloat
of cwnd reaching 90, and an inflate of RTT.
Thanks to TSO auto sizing, we can restrict the bloat to the number
of MSS contained in a TSO packet (tp->xmit_size_goal_segs), to keep
original intent without performance impact.
Because we keep cwnd small, it helps to keep TSO packet size to their
optimal value.
Example for a 10Mbit flow, with low TCP Small queue limits (no more than
2 skb in qdisc/device tx ring)
Before patch :
lpk51:~# ./ss -i dst lpk52:44862 | grep cwnd
cubic wscale:6,6 rto:215 rtt:15.875/2.5 mss:1448 cwnd:96
ssthresh:96
send 70.1Mbps unacked:14 rcv_space:29200
After patch :
lpk51:~# ./ss -i dst lpk52:52916 | grep cwnd
cubic wscale:6,6 rto:206 rtt:5.206/0.036 mss:1448 cwnd:15
ssthresh:14
send 33.4Mbps unacked:4 rcv_space:29200
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) Introduce skb_to_sgvec_nomark function to add further data to the sg list
without calling sg_unmark_end first. Needed to add extended sequence
number informations. From Fan Du.
2) Add IPsec extended sequence numbers support to the Authentication Header
protocol for ipv4 and ipv6. From Fan Du.
3) Make the IPsec flowcache namespace aware, from Fan Du.
4) Avoid creating temporary SA for every packet when no key manager is
registered. From Horia Geanta.
5) Support filtering of SA dumps to show only the SAs that match a
given filter. From Nicolas Dichtel.
6) Remove caching of xfrm_policy_sk_bundles. The cached socket policy bundles
are never used, instead we create a new cache entry whenever xfrm_lookup()
is called on a socket policy. Most protocols cache the used routes to the
socket, so this caching is not needed.
7) Fix a forgotten SADB_X_EXT_FILTER length check in pfkey, from Nicolas
Dichtel.
8) Cleanup error handling of xfrm_state_clone.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes two bugs in fastopen :
1) The tcp_sendmsg(..., @size) argument was ignored.
Code was relying on user not fooling the kernel with iovec mismatches
2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5
allocations, which are likely to fail when memory gets fragmented.
Fixes: 783237e8da ("net-tcp: Fast Open client - sending SYN-data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Tested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bug introduced by commit 7d442fab0a ("ipv4: Cache dst in tunnels").
Because sit code does not call ip_tunnel_init(), the dst_cache was not
initialized.
CC: Tom Herbert <therbert@google.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/built-in.o:(.rodata+0x1707c): undefined reference to `ip_tunnel_get_stats64'
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the kernel tries to announce a zero window when free_space
is below the current receiver mss estimate.
When a sender is transmitting small packets and reader consumes data
slowly (or not at all), receiver might be unable to shrink the receive
win because
a) we cannot withdraw already-commited receive window, and,
b) we have to round the current rwin up to a multiple of the wscale
factor, else we would shrink the current window.
This causes the receive buffer to fill up until the rmem limit is hit.
When this happens, we start dropping packets.
Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
even if socket is not being read from.
As we cannot avoid the "current_win is rounded up to multiple of mss"
issue [we would violate a) above] at least try to prevent the receive buf
growth towards tcp_rmem[2] limit by attempting to move to zero-window
announcement when free_space becomes less than 1/16 of the current
allowed receive buffer maximum. If tcp_rmem[2] is large, this will
increase our chances to get a zero-window announcement out in time.
Reproducer:
On server:
$ nc -l -p 12345
<suspend it: CTRL-Z>
Client:
#!/usr/bin/env python
import socket
import time
sock = socket.socket()
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
sock.connect(("192.168.4.1", 12345));
while True:
sock.send('A' * 23)
time.sleep(0.005)
socket buffer on server-side will grow until tcp_rmem[2] is hit,
at which point the client rexmits data until -EDTIMEOUT:
tcp_data_queue invokes tcp_try_rmem_schedule which will call
tcp_prune_queue which calls tcp_clamp_window(). And that function will
grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].
Thanks to Eric Dumazet for running regression tests.
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we decide in udp6_sendmsg to send the packet down the ipv4
udp_sendmsg path because the destination is either of family AF_INET or
the destination is an ipv4 mapped ipv6 address, we don't honor the
maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.
We simply can check for this option in ip_cmsg_send because no calls to
ipv6 module functions are needed to do so.
Reported-by: Gert Doering <gert@space.net>
Cc: Tore Anderson <tore@fud.no>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
* Fix nf_trace in nftables if XT_TRACE=n, from Florian Westphal.
* Don't use the fast payload operation in nf_tables if the length is
not power of 2 or it is not aligned, from Nikolay Aleksandrov.
* Fix missing break statement the inet flavour of nft_reject, which
results in evaluating IPv4 packets with the IPv6 evaluation routine,
from Patrick McHardy.
* Fix wrong kconfig symbol in nft_meta to match the routing realm,
from Paul Bolle.
* Allocate the NAT null binding when creating new conntracks via
ctnetlink to avoid that several packets race at initializing the
the conntrack NAT extension, original patch from Florian Westphal,
revisited version from me.
* Fix DNAT handling in the snmp NAT helper, the same handling was being
done for SNAT and DNAT and 2.4 already contains that fix, from
Francois-Xavier Le Bail.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_3ad.h
drivers/net/bonding/bond_main.c
Two minor conflicts in bonding, both of which were overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
since commit 89aef8921bf("ipv4: Delete routing cache."), the counter
in_slow_tot can't work correctly.
The counter in_slow_tot increase by one when fib_lookup() return successfully
in ip_route_input_slow(), but actually the dst struct maybe not be created and
cached, so we can increase in_slow_tot after the dst struct is created.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get
lots of "TRACE: filter:output:policy:1 IN=..." warnings as several
places will leave skb->nf_trace uninitialised.
Unlike iptables tracing functionality is not conditional in nftables,
so always copy/zero nf_trace setting when nftables is enabled.
Move this into __nf_copy() helper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently this always returns ENOBUFS, because the return value of
__ip_tunnel_create is discarded.
A more common failure is a duplicate name (EEXIST). Propagate the real
error code so userspace can display a more meaningful error message.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
since commit 251da413("ipv4: Cache ip_error() routes even when not forwarding."),
the counter IPSTATS_MIB_INADDRERRORS can't work correctly, because the value of
err was always set to ENETUNREACH.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two new fields to struct tcp_info, to report sk_pacing_rate
and sk_max_pacing_rate to monitoring applications, as ss from iproute2.
User exported fields are 64bit, even if kernel is currently using 32bit
fields.
lpaa5:~# ss -i
..
skmem:(r0,rb357120,t0,tb2097152,f1584,w1980880,o0,bl0) ts sack cubic
wscale:6,6 rto:400 rtt:0.875/0.75 mss:1448 cwnd:1 ssthresh:12 send
13.2Mbps pacing_rate 3336.2Mbps unacked:15 retrans:1/5448 lost:15
rcv_space:29200
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The solution was found by Patrick in 2.4 kernel sources.
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Even if the 'time_before' macro expand with parentheses, the look is bad.
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets which have L2 address different from ours should be
already filtered before entering into ip_forward().
Perform that check at the beginning to avoid processing such packets.
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of my pet coding style peeves is the practice of
adding extra return; at the end of function.
Kill several instances of this in network code.
I suppose some coccinelle wizardy could do this automatically.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 684bad1107 "tcp: use PRR to reduce cwin in CWR state" removed all
calls to min_cwnd, so we can safely remove it.
Also, remove tcp_reno_min_cwnd because it was only used for min_cwnd.
Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add esn support for AH input stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.
Then the ICV value will guard upper 32bits sequence number as well when
packet getting in.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch add esn support for AH output stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.
Then the ICV value will guard upper 32bits sequence number as well when
packet going out.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Commit 46d3ceabd8 ("tcp: TCP Small Queues") introduced a possible
regression for applications using TCP_NODELAY.
If TCP session is throttled because of tsq, we should consult
tp->nonagle when TX completion is done and allow us to send additional
segment, especially if this segment is not a full MSS.
Otherwise this segment is sent after an RTO.
[edumazet] : Cooked the changelog, added another fix about testing
sk_wmem_alloc twice because TX completion can happen right before
setting TSQ_THROTTLED bit.
This problem is particularly visible with recent auto corking,
but might also be triggered with low tcp_limit_output_bytes
values or NIC drivers delaying TX completion by hundred of usec,
and very low rtt.
Thomas Glanzmann for example reported an iscsi regression, caused
by tcp auto corking making this bug quite visible.
Fixes: 46d3ceabd8 ("tcp: TCP Small Queues")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thomas Glanzmann <thomas@glanzmann.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
As far as I can tell we have used a default of 60 seconds for
FIN_WAIT2 timeout for ages (since 2.x times??).
In any case, the timeout these days is 60 seconds, so the 3 min
comment is wrong (and cost me a few minutes of my life when I was
debugging a FIN_WAIT2 related problem in a userspace application and
checked the kernel source for details).
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/nftables/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes, mostly nftables
fixes, most relevantly they are:
* Fix a crash in the h323 conntrack NAT helper due to expectation list
corruption, from Alexey Dobriyan.
* A couple of RCU race fixes for conntrack, one manifests by hitting BUG_ON
in nf_nat_setup_info() and the destroy path, patches from Andrey Vagin and
me.
* Dump direction attribute in nft_ct only if it is set, from Arturo
Borrero.
* Fix IPVS bug in its own connection tracking system that may lead to
copying only 4 bytes of the IPv6 address when initializing the
ip_vs_conn object, from Michal Kubecek.
* Fix -EBUSY errors in nftables when deleting the rules, chain and tables
in a row due mixture of asynchronous and synchronous object releasing,
from me.
* Three fixes for the nf_tables set infrastructure when using intervals and
mappings, from me.
* Four patches to fixing the nf_tables log, reject and ct expressions from
the new inet table, from Patrick McHardy.
* Fix memory overrun in the map that is used to dynamically allocate names
from anonymous sets, also from Patrick.
* Fix a potential oops if you dump a set with NFPROTO_UNSPEC and a table
name, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
cgroup_subsys is a bit messier than it needs to be.
* The name of a subsys can be different from its internal identifier
defined in cgroup_subsys.h. Most subsystems use the matching name
but three - cpu, memory and perf_event - use different ones.
* cgroup_subsys_id enums are postfixed with _subsys_id and each
cgroup_subsys is postfixed with _subsys. cgroup.h is widely
included throughout various subsystems, it doesn't and shouldn't
have claim on such generic names which don't have any qualifier
indicating that they belong to cgroup.
* cgroup_subsys->subsys_id should always equal the matching
cgroup_subsys_id enum; however, we require each controller to
initialize it and then BUG if they don't match, which is a bit
silly.
This patch cleans up cgroup_subsys names and initialization by doing
the followings.
* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
cgroup_subsys with _cgrp_subsys.
* With the above, renaming subsys identifiers to match the userland
visible names doesn't cause any naming conflicts. All non-matching
identifiers are renamed to match the official names.
cpu_cgroup -> cpu
mem_cgroup -> memory
perf -> perf_event
* controllers no longer need to initialize ->subsys_id and ->name.
They're generated in cgroup core and set automatically during boot.
* Redundant cgroup_subsys declarations removed.
* While updating BUG_ON()s in cgroup_init_early(), convert them to
WARN()s. BUGging that early during boot is stupid - the kernel
can't print anything, even through serial console and the trap
handler doesn't even link stack frame properly for back-tracing.
This patch doesn't introduce any behavior changes.
v2: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
classid handling into core").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
TCP pacing depends on an accurate srtt estimation.
Current srtt estimation is using jiffie resolution,
and has an artificial offset of at least 1 ms, which can produce
slowdowns when FQ/pacing is used, especially in DC world,
where typical rtt is below 1 ms.
We are planning a switch to usec resolution for linux-3.15,
but in the meantime, this patch removes the 1 ms offset.
All we need is to have tp->srtt minimal value of 1 to differentiate
the case of srtt being initialized or not, not 8.
The problematic behavior was observed on a 40Gbit testbed,
where 32 concurrent netperf were reaching 12Gbps of aggregate
speed, instead of line speed.
This patch also has the effect of reporting more accurate srtt and send
rates to iproute2 ss command as in :
$ ss -i dst cca2
Netid State Recv-Q Send-Q Local Address:Port
Peer Address:Port
tcp ESTAB 0 0 10.244.129.1:56984
10.244.129.2:12865
cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
463.4Mbps rcv_rtt:1 rcv_space:29200
tcp ESTAB 0 390960 10.244.129.1:60247
10.244.129.2:50204
cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
Reported-by: Vytautas Valancius <valas@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a reject module for NFPROTO_INET. It does nothing but dispatch
to the AF-specific modules based on the hook family.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently the nft_reject module depends on symbols from ipv6. This is
wrong since no generic module should force IPv6 support to be loaded.
Split up the module into AF-specific and a generic part.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
RCU writer side should use rcu_dereference_protected() and not
rcu_dereference(), fix that. This also removes the "suspicious RCU usage"
warning seen when running with CONFIG_PROVE_RCU.
Also, don't use rcu_assign_pointer/rcu_dereference for pointers
which are invisible beyond the udp offload code.
Fixes: b582ef0 ('net: Add GRO support for UDP encapsulating protocols')
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting rt variable to NULL at the beginning of ip_tunnel_xmit()
missed possible use of this variable as a scratch value.
Also fixes a possible dst leak in tunnel_dst_check() :
If we had to call tunnel_dst_reset(), we forgot to
release the reference on dst.
Merges tunnel_dst_get()/tunnel_dst_check() into
a single tunnel_rtable_get() function for clarity.
Many thanks to Tommi for his report and tests.
Fixes: 7d442fab0a ("ipv4: Cache dst in tunnels")
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since udp_add_offload() can be called from non-sleepable context e.g
under this call tree from the vxlan driver use case:
vxlan_socket_create() <-- holds the spinlock
-> vxlan_notify_add_rx_port()
-> udp_add_offload() <-- schedules
we should allocate the udp_offloads structure in atomic manner.
Fixes: b582ef0 ('net: Add GRO support for UDP encapsulating protocols')
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dealing with icmp messages, the skb->data points the
ip header that triggered the sending of the icmp message.
In gre_cisco_err(), the parse_gre_header() is called, and the
iptunnel_pull_header() is called to pull the skb at the end of
the parse_gre_header(), so the skb->data doesn't point the
inner ip header.
Unfortunately, the ipgre_err still needs those ip addresses in
inner ip header to look up tunnel by ip_tunnel_lookup().
So just use icmp_hdr() to get inner ip header instead of skb->data.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I see a memory leak when using a transparent HTTP proxy using TPROXY
together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
unreferenced object 0xffff88008cba4a40 (size 1696):
comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
hex dump (first 32 bytes):
0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2.....
02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
[<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
[<ffffffff812702cf>] sk_clone_lock+0x14/0x283
[<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
[<ffffffff8129a893>] netlink_broadcast+0x14/0x16
[<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
[<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
[<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
[<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
[<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
[<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
[<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
[<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
[<ffffffff8127aa77>] process_backlog+0xee/0x1c5
[<ffffffff8127c949>] net_rx_action+0xa7/0x200
[<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
But there are many more, resulting in the machine going OOM after some
days.
From looking at the TPROXY code, and with help from Florian, I see
that the memory leak is introduced in tcp_v4_early_demux():
void tcp_v4_early_demux(struct sk_buff *skb)
{
/* ... */
iph = ip_hdr(skb);
th = tcp_hdr(skb);
if (th->doff < sizeof(struct tcphdr) / 4)
return;
sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
iph->saddr, th->source,
iph->daddr, ntohs(th->dest),
skb->skb_iif);
if (sk) {
skb->sk = sk;
where the socket is assigned unconditionally to skb->sk, also bumping
the refcnt on it. This is problematic, because in our case the skb
has already a socket assigned in the TPROXY target. This then results
in the leak I see.
The very same issue seems to be with IPv6, but haven't tested.
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
also include missing err.h header.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) BPF debugger and asm tool by Daniel Borkmann.
2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.
3) Correct reciprocal_divide and update users, from Hannes Frederic
Sowa and Daniel Borkmann.
4) Currently we only have a "set" operation for the hw timestamp socket
ioctl, add a "get" operation to match. From Ben Hutchings.
5) Add better trace events for debugging driver datapath problems, also
from Ben Hutchings.
6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
have a small send and a previous packet is already in the qdisc or
device queue, defer until TX completion or we get more data.
7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.
8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
Borkmann.
9) Share IP header compression code between Bluetooth and IEEE802154
layers, from Jukka Rissanen.
10) Fix ipv6 router reachability probing, from Jiri Benc.
11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.
12) Support tunneling in GRO layer, from Jerry Chu.
13) Allow bonding to be configured fully using netlink, from Scott
Feldman.
14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
already get the TCI. From Atzm Watanabe.
15) New "Heavy Hitter" qdisc, from Terry Lam.
16) Significantly improve the IPSEC support in pktgen, from Fan Du.
17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
Herbert.
18) Add Proportional Integral Enhanced packet scheduler, from Vijay
Subramanian.
19) Allow openvswitch to mmap'd netlink, from Thomas Graf.
20) Key TCP metrics blobs also by source address, not just destination
address. From Christoph Paasch.
21) Support 10G in generic phylib. From Andy Fleming.
22) Try to short-circuit GRO flow compares using device provided RX
hash, if provided. From Tom Herbert.
The wireless and netfilter folks have been busy little bees too.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
net/cxgb4: Fix referencing freed adapter
ipv6: reallocate addrconf router for ipv6 address when lo device up
fib_frontend: fix possible NULL pointer dereference
rtnetlink: remove IFLA_BOND_SLAVE definition
rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
qlcnic: update version to 5.3.55
qlcnic: Enhance logic to calculate msix vectors.
qlcnic: Refactor interrupt coalescing code for all adapters.
qlcnic: Update poll controller code path
qlcnic: Interrupt code cleanup
qlcnic: Enhance Tx timeout debugging.
qlcnic: Use bool for rx_mac_learn.
bonding: fix u64 division
rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
sfc: Use the correct maximum TX DMA ring size for SFC9100
Add Shradha Shah as the sfc driver maintainer.
net/vxlan: Share RX skb de-marking and checksum checks with ovs
tulip: cleanup by using ARRAY_SIZE()
ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
net/cxgb4: Don't retrieve stats during recovery
...
The two commits 0115e8e30d (net: remove delay at device dismantle) and
748e2d9396 (net: reinstate rtnl in call_netdevice_notifiers()) silently
removed a NULL pointer check for in_dev since Linux 3.7.
This patch re-introduces this check as it causes crashing the kernel when
setting small mtu values on non-ip capable netdevices.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the definition is centralized in <linux/kernel.h>, the
definitions of U32_MAX (and related) elsewhere in the kernel can be
removed.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Sage Weil <sage@inktank.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The symbol U32_MAX is defined in several spots. Change these
definitions to be conditional. This is in preparation for the next
patch, which centralizes the definition in <linux/kernel.h>.
Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Sage Weil <sage@inktank.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit a622260254ee48("ip_tunnel: fix kernel panic with icmp_dest_unreach")
clear IPCB in ip_tunnel_xmit() , or else skb->cb[] may contain garbage from
GSO segmentation layer.
But commit 0e6fbc5b6c621("ip_tunnels: extend iptunnel_xmit()") refactor codes,
and it clear IPCB behind the dst_link_failure().
So clear IPCB in ip_tunnel_xmit() just like commti a622260254ee48("ip_tunnel:
fix kernel panic with icmp_dest_unreach").
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed few issues around using __rcu prefix and rcu_assign_pointer, also
fixed a warning print to use ntohs(port) and not htons(port).
net/ipv4/udp_offload.c:112:9: error: incompatible types in comparison expression (different address spaces)
net/ipv4/udp_offload.c:113:9: error: incompatible types in comparison expression (different address spaces)
net/ipv4/udp_offload.c:176:19: error: incompatible types in comparison expression (different address spaces)
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket may be v6/v4-mapped. In that case sk->sk_family is AF_INET6,
but the IP being used is actually an IPv4-address.
Current's tcp-metrics will thus represent it as an IPv6-address:
root@server:~# ip tcp_metrics
::ffff:10.1.1.2 age 22.920sec rtt 18750us rttvar 15000us cwnd 10
10.1.1.2 age 47.970sec rtt 16250us rttvar 10000us cwnd 10
This patch modifies the tcp-metrics so that they are able to handle the
v6/v4-mapped sockets correctly.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Workqueue used in ipv4 layer have no real dependency of scheduling these on the
cpu which scheduled them.
On a idle system, it is observed that an idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.
This patch replaces normal workqueues with power efficient versions. This
doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is
enabled.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In bbf852b96e I introduced the tmlist, which allows to delete
multiple entries from the cache that match a specified destination if no
source-IP is specified.
However, as the cache is an RCU-list, we should not create this tmlist, as
it will change the tcpm_next pointer of the element that will be deleted
and so a thread iterating over the cache's entries while holding the
RCU-lock might get "redirected" to this tmlist.
This patch fixes this, by reverting back to the old behavior prior to
bbf852b96e, which means that we simply change the tcpm_next
pointer of the previous element (pp) to jump over the one we are
deleting.
The difference is that we call kfree_rcu() directly on the cache entry,
which allows us to delete multiple entries from the list.
Fixes: bbf852b96e (tcp: metrics: Delete all entries matching a certain destination)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add GRO handlers for protocols that do UDP encapsulation, with the intent of
being able to coalesce packets which encapsulate packets belonging to
the same TCP session.
For GRO purposes, the destination UDP port takes the role of the ether type
field in the ethernet header or the next protocol in the IP header.
The UDP GRO handler will only attempt to coalesce packets whose destination
port is registered to have gro handler.
Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive
code twice on a packet. This solves the problem of udp encapsulated packets whose
inner VM packet is udp and happen to carry a port which has registered offloads.
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow
connecting and binding to them. sendmsg logic does already check msg->name
for this but must trust already connected sockets which could be set up
for connection to ipv4 address family.
Per-socket flag ipv6only is of no use here, as it is under users control
by setsockopt.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen
sock, we can delete the redundant calls of tcp_mtup_init() in
tcp_{v4,v6}_syn_recv_sock().
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit c544193214("GRE: Refactor GRE tunneling code")
introduced function ip_tunnel_hash(), the argument itn is no
longer in use, so remove it.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull scheduler changes from Ingo Molnar:
- Add the initial implementation of SCHED_DEADLINE support: a real-time
scheduling policy where tasks that meet their deadlines and
periodically execute their instances in less than their runtime quota
see real-time scheduling and won't miss any of their deadlines.
Tasks that go over their quota get delayed (Available to privileged
users for now)
- Clean up and fix preempt_enable_no_resched() abuse all around the
tree
- Do sched_clock() performance optimizations on x86 and elsewhere
- Fix and improve auto-NUMA balancing
- Fix and clean up the idle loop
- Apply various cleanups and fixes
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
sched: Fix __sched_setscheduler() nice test
sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
sched: Fix up attr::sched_priority warning
sched: Fix up scheduler syscall LTP fails
sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
sched/core: Fix htmldocs warnings
sched/deadline: No need to check p if dl_se is valid
sched/deadline: Remove unused variables
sched/deadline: Fix sparse static warnings
m68k: Fix build warning in mac_via.h
sched, thermal: Clean up preempt_enable_no_resched() abuse
sched, net: Fixup busy_loop_us_clock()
sched, net: Clean up preempt_enable_no_resched() abuse
sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
sched/preempt, locking: Rework local_bh_{dis,en}able()
sched/clock, x86: Avoid a runtime condition in native_sched_clock()
sched/clock: Fix up clear_sched_clock_stable()
sched/clock, x86: Use a static_key for sched_clock_stable
sched/clock: Remove local_irq_disable() from the clocks
sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
...
We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
for IPv4 datagrams on IPv6 sockets.
This patch splits the ip6_datagram_recv_ctl into two functions, one
which handles both protocol families, AF_INET and AF_INET6, while the
ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.
ip6_datagram_recv_*_ctl never reported back any errors, so we can make
them return void. Also provide a helper for protocols which don't offer dual
personality to further use ip6_datagram_recv_ctl, which is exported to
modules.
I needed to shuffle the code for ping around a bit to make it easier to
implement dual personality for ping ipv6 sockets in future.
Reported-by: Gert Doering <gert@space.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace some dev_kfree_skb() with kfree_skb() calls when
we drop one skb, this might help bug tracking.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a follow-up patch to f3d3342602 ("net: rework recvmsg
handler msg_name and msg_namelen logic").
DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.
Signed-off-by: Steffen Hurrle <steffen@hurrle.net>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
net/ipv4/tcp_metrics.c
Overlapping changes between the "don't create two tcp metrics objects
with the same key" race fix in net and the addition of the destination
address in the lookup key in net-next.
Minor overlapping changes in bnx2x driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch :
1) Remove a dst leak if DST_NOCACHE was set on dst
Fix this by holding a reference only if dst really cached.
2) Remove a lockdep warning in __tunnel_dst_set()
This was reported by Cong Wang.
3) Remove usage of a spinlock where xchg() is enough
4) Remove some spurious inline keywords.
Let compiler decide for us.
Fixes: 7d442fab0a ("ipv4: Cache dst in tunnels")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because the tcp-metrics is an RCU-list, it may be that two
soft-interrupts are inside __tcp_get_metrics() for the same
destination-IP at the same time. If this destination-IP is not yet part of
the tcp-metrics, both soft-interrupts will end up in tcpm_new and create
a new entry for this IP.
So, we will have two tcp-metrics with the same destination-IP in the list.
This patch checks twice __tcp_get_metrics(). First without holding the
lock, then while holding the lock. The second one is there to confirm
that the entry has not been added by another soft-irq while waiting for
the spin-lock.
Fixes: 51c5d0c4b1 (tcp: Maintain dynamic metrics in local cache.)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is following the commit b903d324be (ipv6: tcp: fix TCLASS
value in ACK messages sent from TIME_WAIT).
For the same reason than tclass, we have to store the flow label in the
inet_timewait_sock to provide consistency of flow label on the last ACK.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent commit 438e38fadc
("gre_offload: statically build GRE offloading support") added
new module_init/module_exit calls to the gre_offload.c file.
The file is obj-y and can't be anything other than built-in.
Currently it can never be built modular, so using module_init
as an alias for __initcall can be somewhat misleading.
Fix this up now, so that we can relocate module_init from
init.h into module.h in the future. If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing. We also make the inclusion explicit.
Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups. As __initcall gets
mapped onto device_initcall, our use of device_initcall
directly in this change means that the runtime impact is
zero -- it will remain at level 6 in initcall ordering.
As for the module_exit, rather than replace it with __exitcall,
we simply remove it, since it appears only UML does anything
with those, and even for UML, there is no relevant cleanup
to be done here.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_gso_segment() and tcp_gro_receive() no longer need to be
exported. IPv4 and IPv6 offloads are statically linked.
Note that tcp_gro_complete() is still used by bnx2x, unfortunately.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bob Falken reported that after 4G packets, multicast forwarding stopped
working. This was because of a rule reference counter overflow which
freed the rule as soon as the overflow happend.
This patch solves this by adding the FIB_LOOKUP_NOREF flag to
fib_rules_lookup calls. This is safe even from non-rcu locked sections
as in this case the flag only implies not taking a reference to the rule,
which we don't need at all.
Rules only hold references to the namespace, which are guaranteed to be
available during the call of the non-rcu protected function reg_vif_xmit
because of the interface reference which itself holds a reference to
the net namespace.
Fixes: f0ad0860d0 ("ipv4: ipmr: support multiple tables")
Fixes: d1db275dd3 ("ipv6: ip6mr: support multiple tables")
Reported-by: Bob Falken <NetFestivalHaveFun@gmx.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.
Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We still need this notifier even when we don't config
PROC_FS.
It should be rare to have a kernel without PROC_FS,
so just for completeness.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT
and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock
(not just TIME_WAIT), and for such sockets the tw_substate field holds
the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2.
This brings the inet_diag state-matching code in line with the field
it uses to populate idiag_state. This is also analogous to the info
exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state
and get_timewait4_sock() exports tw->tw_substate.
Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state
fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi
state time-wait" would also return sockets in state TCP_FIN_WAIT2.
This is an old bug that predates 05dbc7b ("tcp/dccp: remove twchain").
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify the GRE header length calculation in gre_gso_segment().
Switch to an approach that is simpler, faster, and more general. The
new approach will continue to be correct even if we add support for
the optional variable-length routing info that may be present in a GRE
header.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warning:
net/ipv4/gre_offload.c:253:5: warning:
symbol 'gre_gro_complete' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
to be honored by protocols which do more stringent validation on the
ICMP's packet payload. This knob is useful for people who e.g. want to
run an unmodified DNS server in a namespace where they need to use pmtu
for TCP connections (as they are used for zone transfers or fallback
for requests) but don't want to use possibly spoofed UDP pmtu information.
Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
if the returned packet is in the window or if the association is valid.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While forwarding we should not use the protocol path mtu to calculate
the mtu for a forwarded packet but instead use the interface mtu.
We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
introduced for multicast forwarding. But as it does not conflict with
our usage in unicast code path it is perfect for reuse.
I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
dependencies because of IPSKB_FORWARDED.
Because someone might have written a software which does probe
destinations manually and expects the kernel to honour those path mtus
I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
can disable this new behaviour. We also still use mtus which are locked on a
route for forwarding.
The reason for this change is, that path mtus information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv4 forwarding path to
wrongfully emit fragmentation needed notifications or start to fragment
packets along a path.
Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
won't be set and further fragmentation logic will use the path mtu to
determine the fragmentation size. They also recheck packet size with
help of path mtu discovery and report appropriate errors.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to be able to get/del tcp-metrics based on the src IP. This
patch adds the necessary parsing of the netlink attribute and if the
source address is set, it will match on this one too.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we now can have multiple entries per destination-IP, the "ip
tcp_metrics delete address ADDRESS" command deletes all of them.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new netlink attribute for the source-IP and appends it
to the netlink reply. Now, iproute2 can have access to the source-IP.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
We add the source-address to the tcp-metrics, so that different metrics
will be used per source/destination-pair. We use the destination-hash to
store the metric inside the hash-table. That way, deleting and dumping
via "ip tcp_metrics" is easy.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we will add also the source-address, we rename all accesses to the
tcp-metrics address to use "daddr".
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
nf_tables updates for net-next
The following patchset contains the following nf_tables updates,
mostly updates from Patrick McHardy, they are:
* Add the "inet" table and filter chain type for this new netfilter
family: NFPROTO_INET. This special table/chain allows IPv4 and IPv6
rules, this should help to simplify the burden in the administration
of dual stack firewalls. This also includes several patches to prepare
the infrastructure for this new table and a new meta extension to
match the layer 3 and 4 protocol numbers, from Patrick McHardy.
* Load both IPv4 and IPv6 conntrack modules in nft_ct if the rule is used
in NFPROTO_INET, as we don't certainly know which one would be used,
also from Patrick McHardy.
* Do not allow to delete a table that contains sets, otherwise these
sets become orphan, from Patrick McHardy.
* Hold a reference to the corresponding nf_tables family module when
creating a table of that family type, to avoid the module deletion
when in use, from Patrick McHardy.
* Update chain counters before setting the chain policy to ensure that
we don't leave the chain in inconsistent state in case of errors (aka.
restore chain atomicity). This also fixes a possible leak if it fails
to allocate the chain counters if no counters are passed to be restored,
from Patrick McHardy.
* Don't check for overflows in the table counter if we are just renaming
a chain, from Patrick McHardy.
* Replay the netlink request after dropping the nfnl lock to load the
module that supports provides a chain type, from Patrick.
* Fix chain type module references, from Patrick.
* Several cleanups, function renames, constification and code
refactorizations also from Patrick McHardy.
* Add support to set the connmark, this can be used to set it based on
the meta mark (similar feature to -j CONNMARK --restore), from
Kristian Evensen.
* A couple of fixes to the recently added meta/set support and nft_reject,
and fix missing chain type unregistration if we fail to register our
the family table/filter chain type, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warning:
net/ipv4/ip_tunnel.c:116:18: warning:
symbol 'tunnel_dst_check' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Minor nf_chain_type cleanups:
- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In some cases we neither take a reference to the AF info nor to the
chain type, allowing the module to be unloaded while in use.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new table family and a new filter chain that you can
use to attach IPv4 and IPv6 rules. This should help to simplify
rule-set maintainance in dual-stack setups.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support to register chains to multiple hooks for different address
families for mixed IPv4/IPv6 tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Currently the AF-specific hook functions override the chain-type specific
hook functions. That doesn't make too much sense since the chain types
are a special case of the AF-specific hooks.
Make the AF-specific hook functions the default and make the optional
chain type hooks override them.
As a side effect, the necessary code restructuring reduces the code size,
f.i. in case of nf_tables_ipv4.o:
nf_tables_ipv4_init_net | -24
nft_do_chain_ipv4 | -113
2 functions changed, 137 bytes removed, diff: -137
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch built on top of Commit 299603e837
("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
stack. It also serves as an example for supporting other encapsulation
protocols in the GRO stack in the future.
The patch supports version 0 and all the flags (key, csum, seq#) but
will flush any pkt with the S (seq#) flag. This is because the S flag
is not support by GSO, and a GRO pkt may end up in the forwarding path,
thus requiring GSO support to break it up correctly.
Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
be easily added, so is the support for GRE variations like NVGRE.
The patch also support csum offload. Specifically if the csum flag is on
and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
the code will take advantage of the csum computed by the h/w when
validating the GRE csum.
Note that commit 60769a5dcd "ipv4: gre:
add GRO capability" already introduces GRO capability to IPv4 GRE
tunnels, using the gro_cells infrastructure. But GRO is done after
GRE hdr has been removed (i.e., decapped). The following patch applies
GRO when pkts first come in (before hitting the GRE tunnel code). There
is some performance advantage for applying GRO as early as possible.
Also this approach is transparent to other subsystem like Open vSwitch
where GRE decap is handled outside of the IP stack hence making it
harder for the gro_cells stuff to apply. On the other hand, some NICs
are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
In that case the GRO processing of pkts from the same remote host will
all happen on the same CPU and the performance may be suboptimal.
I'm including some rough preliminary performance numbers below. Note
that the performance will be highly dependent on traffic load, mix as
usual. Moreover it also depends on NIC offload features hence the
following is by no means a comprehesive study. Local testing and tuning
will be needed to decide the best setting.
All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
(super_netperf 50 -H 192.168.1.18 -l 30)
An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
is configured.
The GRO support for pkts AFTER decap are controlled through the device
feature of the GRE device (e.g., ethtool -K gre1 gro on/off).
1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
thruput: 9.16Gbps
CPU utilization: 19%
1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
thruput: 5.9Gbps
CPU utilization: 15%
1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 12-13%
1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 10%
The following tests were performed on a different NIC that is capable of
csum offload. I.e., the h/w is capable of computing IP payload csum
(CHECKSUM_COMPLETE).
2.1 ethtool -K gre1 gro on (hence will use gro_cells)
2.1.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 8.53Gbps
CPU utilization: 9%
2.1.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 8.97Gbps
CPU utilization: 7-8%
2.1.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 8.83Gbps
CPU utilization: 5-6%
2.1.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.98Gbps
CPU utilization: 5%
2.2 ethtool -K gre1 gro off
2.2.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 5.93Gbps
CPU utilization: 9%
2.2.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 5.62Gbps
CPU utilization: 8%
2.2.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 7.69Gbps
CPU utilization: 8%
2.2.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.96Gbps
CPU utilization: 5-6%
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When lo is brought up, new ifa is created. Then, devconf and neigh values
bitfield should be set so later changes of default values would not
affect lo values.
Note that the same behaviour is in ipv6. Also note that this is likely
not an issue in many distros (for example Fedora 19) because userspace
sets address to lo manually before bringing it up.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRO/GSO layers can be enabled on a node, even if said
node is only forwarding packets.
This patch permits GSO (and upcoming GRO) support for GRE
encapsulated packets, even if the host has no GRE tunnel setup.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
net/ipv6/ip6_tunnel.c
net/ipv6/ip6_vti.c
ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.
qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP out_of_order_queue lock is not used, as queue manipulation
happens with socket lock held and we therefore use the lockless
skb queue routines (as __skb_queue_head())
We can use __skb_queue_head_init() instead of skb_queue_head_init()
to make this more consistent.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says: <pablo@netfilter.org>
====================
nftables updates for net-next
The following patchset contains nftables updates for your net-next tree,
they are:
* Add set operation to the meta expression by means of the select_ops()
infrastructure, this allows us to set the packet mark among other things.
From Arturo Borrero Gonzalez.
* Fix wrong format in sscanf in nf_tables_set_alloc_name(), from Daniel
Borkmann.
* Add new queue expression to nf_tables. These comes with two previous patches
to prepare this new feature, one to add mask in nf_tables_core to
evaluate the queue verdict appropriately and another to refactor common
code with xt_NFQUEUE, from Eric Leblond.
* Do not hide nftables from Kconfig if nfnetlink is not enabled, also from
Eric Leblond.
* Add the reject expression to nf_tables, this adds the missing TCP RST
support. It comes with an initial patch to refactor common code with
xt_NFQUEUE, again from Eric Leblond.
* Remove an unused variable assignment in nf_tables_dump_set(), from Michal
Nazarewicz.
* Remove the nft_meta_target code, now that Arturo added the set operation
to the meta expression, from me.
* Add help information for nf_tables to Kconfig, also from me.
* Allow to dump all sets by specifying NFPROTO_UNSPEC, similar feature is
available to other nf_tables objects, requested by Arturo, from me.
* Expose the table usage counter, so we can know how many chains are using
this table without dumping the list of chains, from Tomasz Bursztyka.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for your net-next tree,
they are:
* Add full port randomization support. Some crazy researchers found a way
to reconstruct the secure ephemeral ports that are allocated in random mode
by sending off-path bursts of UDP packets to overrun the socket buffer of
the DNS resolver to trigger retransmissions, then if the timing for the
DNS resolution done by a client is larger than usual, then they conclude
that the port that received the burst of UDP packets is the one that was
opened. It seems a bit aggressive method to me but it seems to work for
them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a
new NAT mode to fully randomize ports using prandom.
* Add a new classifier to x_tables based on the socket net_cls set via
cgroups. These includes two patches to prepare the field as requested by
Zefan Li. Also from Daniel Borkmann.
* Use prandom instead of get_random_bytes in several locations of the
netfilter code, from Florian Westphal.
* Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack
mark, also from Florian Westphal.
* Fix compilation warning due to unused variable in IPVS, from Geert
Uytterhoeven.
* Add support for UID/GID via nfnetlink_queue, from Valentina Giusti.
* Add IPComp extension to x_tables, from Fan Du.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
They are same, so unify them as one, pcpu_sw_netstats.
Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats
from if_tunnel and remove br_cpu_netstats from br_private.h
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu route cache eliminates share of dst refcnt between CPUs.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid doing a route lookup on every packet being tunneled.
In ip_tunnel.c cache the route returned from ip_route_output if
the tunnel is "connected" so that all the rouitng parameters are
taken from tunnel parms for a packet. Specifically, not NBMA tunnel
and tos is from tunnel parms (not inner packet).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following code is not used in current upstream code.
Some of this seems to be old hooks, other might be used by some
out of tree module (which I don't care about breaking), and
the need_ipv4_conntrack was used by old NAT code but no longer
called.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
VM to VM GSO traffic is broken if it goes through VXLAN or GRE
tunnel and the physical NIC on the host supports hardware VXLAN/GRE
GSO offload (e.g. bnx2x and next-gen mlx4).
Two issues -
(VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with
SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header
integrity check fails in udp4_ufo_fragment if inner protocol is
TCP. Also gso_segs is calculated incorrectly using skb->len that
includes tunnel header. Fix: robust check should only be applied
to the inner packet.
(VXLAN & GRE) Once GSO header integrity check passes, NULL segs
is returned and the original skb is sent to hardware. However the
tunnel header is already pulled. Fix: tunnel header needs to be
restored so that hardware can perform GSO properly on the original
packet.
Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gratuitous arp packets are useful in switchover scenarios to update
client arp tables as quickly as possible. Currently, the mac address
of a neighbour is only updated after a locktime period has elapsed
since the last update. In most use cases such delays are unacceptable
for network admins. Moreover, the "updated" field of the neighbour
stucture doesn't record the last time the address of a neighbour
changed but records any change that happens to the neighbour. This is
clearly a bug since locktime uses that field as meaning "addr_updated".
With this observation, I was able to perpetuate a stale address by
sending a stream of gratuitous arp packets spaced less than locktime
apart. With this change the address is updated when a gratuitous arp
is received and the arp_accept sysctl is set.
Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves nft_reject_ipv4 to nft_reject and adds support
for IPv6 protocol. This patch uses functions included in nf_reject.h
to implement reject by TCP reset.
The code has to be build as a module if NF_TABLES_IPV6 is also a
module to avoid compilation error due to usage of IPv6 functions.
This has been done in Kconfig by using the construct:
depends on NF_TABLES_IPV6 || !NF_TABLES_IPV6
This seems a bit weird in terms of syntax but works perfectly.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch prepares the addition of TCP reset support in
the nft_reject module by moving reusable code into a header
file.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove leftover code that is not used anywhere in current tree.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following are only used in one file:
tcp_connect_init
tcp_set_rto
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't export ping_table or ping_v4_sendmsg. Both are only used
inside ping code.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
inetpeer_invalidate_family defined but never used
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't export arp_invalidate, only used in arp.c
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make fib_detect_death function static only used in one file.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This cleanup the comments in tcp_yeah.c.
1.The old link is dead,use a new one to instead.
2.'lin' add nothing useful,remove it.
3.do not use C99 // comments.
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix checkpatch errors like:
ERROR: spaces required around that XXX
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We report different pmtu values back on the first write and on further
writes on an corked socket.
Also don't include the dst.header_len (respectively exthdrlen) as this
should already be dealt with by the interface mtu of the outgoing
(virtual) interface and policy of that interface should dictate if
fragmentation should happen.
Instead reduce the pmtu data by IP options as we do for IPv6. Make the
same changes for ip_append_data, where we did not care about options or
dst.header_len at all.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use same field for both IPv4 (proxy_arp) and IPv6 (proxy_ndp)
so fix it before API is set to be a common name
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.
We can receive a ACK for a packet still in NIC TX ring buffer or in a
softnet completion queue.
Fix this by always pushing the skb if it is at the head of write queue.
Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
test after setting TSQ_THROTTLED.
erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
erd:~# ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1
(repeat 3 times)
Before patch :
18,1049.87,41004,39631,6295.47
17,239.52,40804,48,2912.79
18,348.40,40877,54,3573.39
After patch :
18,22.84,4606,38,16.39
17,21.56,2871,36,13.51
17,22.46,2705,37,11.83
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f54b311142 ("tcp: auto corking")
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2013-12-19
1) Use the user supplied policy index instead of a generated one
if present. From Fan Du.
2) Make xfrm migration namespace aware. From Fan Du.
3) Make the xfrm state and policy locks namespace aware. From Fan Du.
4) Remove ancient sleeping when the SA is in acquire state,
we now queue packets to the policy instead. This replaces the
sleeping code.
5) Remove FLOWI_FLAG_CAN_SLEEP. This was used to notify xfrm about the
posibility to sleep. The sleeping code is gone, so remove it.
6) Check user specified spi for IPComp. Thr spi for IPcomp is only
16 bit wide, so check for a valid value. From Fan Du.
7) Export verify_userspi_info to check for valid user supplied spi ranges
with pfkey and netlink. From Fan Du.
8) RFC3173 states that if the total size of a compressed payload and the IPComp
header is not smaller than the size of the original payload, the IP datagram
must be sent in the original non-compressed form. These packets are dropped
by the inbound policy check because they are not transformed. Document the need
to set 'level use' for IPcomp to receive such packets anyway. From Fan Du.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].
In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...
rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL);
... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:
r->id.idiag_src[0] = inet->inet_rcv_saddr;
r->id.idiag_dst[0] = inet->inet_daddr;
struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.
So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).
Similarly, fix also other places where we fill that out.
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipgre_header_parse() needs to parse the tunnel's ip header and it
uses mac_header to locate the iphdr. This got broken when gre tunneling
was refactored as mac_header is no longer updated to point to iphdr.
Introduce skb_pop_mac_header() helper to do the mac_header assignment
and use it in ipgre_rcv() to fix msg_name parsing.
Bug introduced in commit c544193214 (GRE: Refactor GRE tunneling code.)
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new mode discards all incoming fragmentation-needed notifications
as I guess was originally intended with this knob. To not break backward
compatibility too much, I only added a special case for mode 2 in the
receiving path.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The other field in ipv4_config, log_martians, was converted to a
per-interface setting, so we can just remove the whole structure.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/i40e/i40e_main.c
drivers/net/macvtap.c
Both minor merge hassles, simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
In several places 'skb->rxhash = 0' is being done to clear the
rxhash value in an skb. This does not clear l4_rxhash which could
still be set so that the rxhash wouldn't be recalculated on subsequent
call to skb_get_rxhash. This patch adds an explict function to clear
all the rxhash related information in the skb properly.
skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
l4_rxhash.
Fixed up places where 'skb->rxhash = 0' was being called.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set,
or if no small packet is in flight.
Example with request/response of 4K/4K
IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
We can avoid this split by including the Nagle tests at the right place.
Note : If some NIC had trouble sending TSO packets with a partial
last segment, we would have hit the problem in GRO/forwarding workload already.
tcp_minshall_update() is moved to tcp_output.c and is updated as we might
feed a TSO packet with a partial last segment.
This patch tremendously improves performance, as the traffic now looks
like :
IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
Before :
lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774
Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
205719.049006 task-clock # 9.278 CPUs utilized
8,449,968 context-switches # 0.041 M/sec
1,935,997 CPU-migrations # 0.009 M/sec
160,541 page-faults # 0.780 K/sec
548,478,722,290 cycles # 2.666 GHz [83.20%]
455,240,670,857 stalled-cycles-frontend # 83.00% frontend cycles idle [83.48%]
272,881,454,275 stalled-cycles-backend # 49.75% backend cycles idle [66.73%]
166,091,460,030 instructions # 0.30 insns per cycle
# 2.74 stalled cycles per insn [83.39%]
29,150,229,399 branches # 141.699 M/sec [83.30%]
1,943,814,026 branch-misses # 6.67% of all branches [83.32%]
22.173517844 seconds time elapsed
lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests 16851063 0.0
IpExtOutOctets 23878580777 0.0
After patch :
lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877
Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
107496.071918 task-clock # 4.847 CPUs utilized
5,635,458 context-switches # 0.052 M/sec
1,374,707 CPU-migrations # 0.013 M/sec
160,920 page-faults # 0.001 M/sec
281,500,010,924 cycles # 2.619 GHz [83.28%]
228,865,069,307 stalled-cycles-frontend # 81.30% frontend cycles idle [83.38%]
142,462,742,658 stalled-cycles-backend # 50.61% backend cycles idle [66.81%]
95,227,712,566 instructions # 0.34 insns per cycle
# 2.40 stalled cycles per insn [83.43%]
16,209,868,171 branches # 150.795 M/sec [83.20%]
874,252,952 branch-misses # 5.39% of all branches [83.37%]
22.175821286 seconds time elapsed
lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests 11239428 0.0
IpExtOutOctets 23595191035 0.0
Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.
Many thanks to Neal for review and ideas.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains two Netfilter fixes for your net
tree, they are:
* Fix endianness in nft_reject, the NFTA_REJECT_TYPE netlink attributes
was not converted to network byte order as needed by all nfnetlink
subsystems, from Eric Leblond.
* Restrict SYNPROXY target to INPUT and FORWARD chains, this avoid a
possible crash due to misconfigurations, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Using sk_dst_lock from softirq context is not supported right now.
Instead of adding BH protection everywhere,
udp_sk_rx_dst_set() can instead use xchg(), as suggested
by David.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: 9750223102 ("udp: ipv4: must add synchronization in udp_sk_rx_dst_set()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to netconf to show changes to proxy-arp status on a per
interface basis via netlink in a manner similar to forwarding
and reverse path state.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch modifies the GRO stack to avoid the use of "network_header"
and associated macros like ip_hdr() and ipv6_hdr() in order to allow
an arbitary number of IP hdrs (v4 or v6) to be used in the
encapsulation chain. This lays the foundation for various IP
tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.
With this patch, the GRO stack traversing now is mostly based on
skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
skb->network_header). As a result all but the top layer (i.e., the
the transport layer) must have hdrs of the same length in order for
a pkt to be considered for aggregation. Therefore when adding a new
encap layer (e.g., for tunneling), one must check and skip flows
(e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
different hdr length.
Note that unlike the network header, the transport header can and
will continue to be set by the GRO code since there will be at
most one "transport layer" in the encap chain.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dump function in nft_reject_ipv4 was not converting a u32
field to network order before sending it to userspace, this
needs to happen for consistency with other nf_tables and
nfnetlink subsystems.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unlike TCP, UDP input path does not hold the socket lock.
Before messing with sk->sk_rx_dst, we must use a spinlock, otherwise
multiple cpus could leak a refcount.
This patch also takes care of renewing a stale dst entry.
(When the sk->sk_rx_dst would not be used by IP early demux)
Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() can reallocate skb->head, we need to move the
initialization of iph and uh pointers after its call.
Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Help of this function says: "in_dev: only on this interface, 0=any interface",
but since commit 39a6d06300 ("[NETNS]: Process inet_confirm_addr in the
correct namespace."), the code supposes that it will never be NULL. This
function is never called with in_dev == NULL, but it's exported and may be used
by an external module.
Because this patch restore the ability to call inet_confirm_addr() with in_dev
== NULL, I partially revert the above commit, as suggested by Julian.
CC: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a crash in synproxy_send_tcp() when using the SYNPROXY target in the
PREROUTING chain caused by missing routing information.
Reported-by: Nicki P. <xastx@gmx.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Various spelling fixes in networking stack
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This changes ensures that the routing entry investigated by the suppress
function actually does point to a device struct before following that pointer,
fixing a possible kernel oops situation when verifying the interface group
associated with a routing table entry.
According to Daniel Golle, this Oops can be triggered by a user process trying
to establish an outgoing IPv6 connection while having no real IPv6 connectivity
set up (only autoassigned link-local addresses).
Fixes: 6ef94cfafb ("fib_rules: add route suppression based on ifgroup")
Reported-by: Daniel Golle <daniel.golle@gmail.com>
Tested-by: Daniel Golle <daniel.golle@gmail.com>
Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously inet devices were only constructed when addresses are added.
Therefore the default neigh parms values they get are the ones at the
time of these operations.
Now that we're creating inet devices earlier, this changes the behaviour
of default neigh parms values in an incompatible way (see bug #8519).
This patch creates a compromise by setting the default values at the
same point as before but only for those that have not been explicitly
set by the user since the inet device's creation.
Introduced by:
commit 8030f54499
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu Feb 22 01:53:47 2007 +0900
[IPV4] devinet: Register inetdev earlier.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the neigh param members to an array. This allows easier
manipulation which will be needed later on to provide better management of
default values.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Compiler doesn't know skb_shinfo(skb) pointer is usually constant.
By using a temporary variable, we help generating smaller code.
For example, tcp_init_nondata_skb() is inlined after this patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files refer to an old address for the Free Software Foundation
in the file header comment. Resolve by replacing the address with
the URL <http://www.gnu.org/licenses/> so that we do not have to keep
updating the header comments anytime the address changes.
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility
to sleep until the needed states are resolved. This code is gone,
so FLOWI_FLAG_CAN_SLEEP is not needed anymore.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
kill memcg_tcp_enter_memory_pressure. The only function of
memcg_tcp_enter_memory_pressure was to reduce deal with the
unnecessary abstraction that was tcp_memcontrol. Now that struct
tcp_memcontrol is gone remove this unnecessary function, the
unnecessary function pointer, and modify sk_enter_memory_pressure to
set this field directly, just as sk_leave_memory_pressure cleas this
field directly.
This fixes a small bug I intruduced when killing struct tcp_memcontrol
that caused memcg_tcp_enter_memory_pressure to never be called and
thus failed to ever set cg_proto->memory_pressure.
Remove the cg_proto enter_memory_pressure function as it now serves
no useful purpose.
Don't test cg_proto->memory_presser in sk_leave_memory_pressure before
clearing it. The test was originally there to ensure that the pointer
was non-NULL. Now that cg_proto is not a pointer the pointer does not
matter.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Here is a pile of bug fixes that accumulated while I was in Europe"
1) In fixing kernel leaks to userspace during copying of socket
addresses, we broke a case that used to work, namely the user
providing a buffer larger than the in-kernel generic socket address
structure. This broke Ruby amongst other things. Fix from Dan
Carpenter.
2) Fix regression added by byte queue limit support in 8139cp driver,
from Yang Yingliang.
3) The addition of MSG_SENDPAGE_NOTLAST buggered up a few sendpage
implementations, they should just treat it the same as MSG_MORE.
Fix from Richard Weinberger and Shawn Landden.
4) Handle icmpv4 errors received on ipv6 SIT tunnels correctly, from
Oussama Ghorbel. In particular we should send an ICMPv6 unreachable
in such situations.
5) Fix some regressions in the recent genetlink fixes, in particular
get the pmcraid driver to use the new safer interfaces correctly.
From Johannes Berg.
6) macvtap was converted to use a per-cpu set of statistics, but some
code was still bumping tx_dropped elsewhere. From Jason Wang.
7) Fix build failure of xen-netback due to missing include on some
architectures, from Andy Whitecroft.
8) macvtap double counts received packets in statistics, fix from Vlad
Yasevich.
9) Fix various cases of using *_STATS_BH() when *_STATS() is more
appropriate. From Eric Dumazet and Hannes Frederic Sowa.
10) Pktgen ipsec mode doesn't update the ipv4 header length and checksum
properly after encapsulation. Fix from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
net/mlx4_en: Remove selftest TX queues empty condition
{pktgen, xfrm} Update IPv4 header total len and checksum after tranformation
virtio_net: make all RX paths handle erors consistently
virtio_net: fix error handling for mergeable buffers
virtio_net: Fixed a trivial typo (fitler --> filter)
netem: fix gemodel loss generator
netem: fix loss 4 state model
netem: missing break in ge loss generator
net/hsr: Support iproute print_opt ('ip -details ...')
net/hsr: Very small fix of comment style.
MAINTAINERS: Added net/hsr/ maintainer
ipv6: fix possible seqlock deadlock in ip6_finish_output2
ixgbe: Make ixgbe_identify_qsfp_module_generic static
ixgbe: turn NETIF_F_HW_L2FW_DOFFLOAD off by default
ixgbe: ixgbe_fwd_ring_down needs to be static
e1000: fix possible reset_task running after adapter down
e1000: fix lockdep warning in e1000_reset_task
e1000: prevent oops when adapter is being closed and reset simultaneously
igb: Fixed Wake On LAN support
inet: fix possible seqlock deadlocks
...
In commit c9e9042994 ("ipv4: fix possible seqlock deadlock") I left
another places where IP_INC_STATS_BH() were improperly used.
udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from
process context, not from softirq context.
This was detected by lockdep seqlock support.
Reported-by: jongman heo <jongman.heo@samsung.com>
Fixes: 584bdf8cbd ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 35f9c09fe (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag MSG_SENDPAGE_NOTLAST, similar to
MSG_MORE.
algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages()
and need to see the new flag as identical to MSG_MORE.
This fixes sendfile() on AF_ALG.
v3: also fix udp
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> # 3.4.x + 3.2.x
Reported-and-tested-by: Shawn Landden <shawnlandden@gmail.com>
Original-patch: Richard Weinberger <richard@nod.at>
Signed-off-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
since f9242b6b28
inet: Sanitize inet{,6} protocol demux.
there are not pretended hash tables for ipv4 or
ipv6 protocol handler.
Signed-off-by: Baker Zhang <Baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit bceaa90240 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.
As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.
This broke traceroute and such.
Fixes: bceaa90240 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: Brad Spengler <spender@grsecurity.net>
Reported-by: Tom Labanowski
Cc: mpb <mpb.mail@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch simplifies the checksum verification in tcpX_gro_receive
by reusing the CHECKSUM_COMPLETE code for CHECKSUM_NONE. All it
does for CHECKSUM_NONE is compute the partial checksum and then
treat it as if it came from the hardware (CHECKSUM_COMPLETE).
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some cases we may receive IP packets that are longer than
their stated lengths. Such packets are never merged in GRO.
However, we may end up computing their checksums incorrectly
and end up allowing packets with a bogus checksum enter our
stack with the checksum status set as verified.
Since such packets are rare and not performance-critical, this
patch simply skips the checksum verification for them.
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Thanks,
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix memory leaks and other issues in mwifiex driver, from Amitkumar
Karwar.
2) skb_segment() can choke on packets using frag lists, fix from
Herbert Xu with help from Eric Dumazet and others.
3) IPv4 output cached route instantiation properly handles races
involving two threads trying to install the same route, but we
forgot to propagate this logic to input routes as well. Fix from
Alexei Starovoitov.
4) Put protections in place to make sure that recvmsg() paths never
accidently copy uninitialized memory back into userspace and also
make sure that we never try to use more that sockaddr_storage for
building the on-kernel-stack copy of a sockaddr. Fixes from Hannes
Frederic Sowa.
5) R8152 driver transmit flow bug fixes from Hayes Wang.
6) Fix some minor fallouts from genetlink changes, from Johannes Berg
and Michael Opdenacker.
7) AF_PACKET sendmsg path can race with netdevice unregister notifier,
fix by using RCU to make sure the network device doesn't go away
from under us. Fix from Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
gso: handle new frag_list of frags GRO packets
genetlink: fix genl_set_err() group ID
genetlink: fix genlmsg_multicast() bug
packet: fix use after free race in send path when dev is released
xen-netback: stop the VIF thread before unbinding IRQs
wimax: remove dead code
net/phy: Add the autocross feature for forced links on VSC82x4
net/phy: Add VSC8662 support
net/phy: Add VSC8574 support
net/phy: Add VSC8234 support
net: add BUG_ON if kernel advertises msg_namelen > sizeof(struct sockaddr_storage)
net: rework recvmsg handler msg_name and msg_namelen logic
bridge: flush br's address entry in fdb when remove the
net: core: Always propagate flag changes to interfaces
ipv4: fix race in concurrent ip_route_input_slow()
r8152: fix incorrect type in assignment
r8152: support stopping/waking tx queue
r8152: modify the tx flow
r8152: fix tx/rx memory overflow
netfilter: ebt_ip6: fix source and destination matching
...
Pablo Neira Ayuso says:
====================
netfilter fixes for net
The following patchset contains fixes for your net tree, they are:
* Remove extra quote from connlimit configuration in Kconfig, from
Randy Dunlap.
* Fix missing mss option in syn packets sent to the backend in our
new synproxy target, from Martin Topholm.
* Use window scale announced by client when sending the forged
syn to the backend, from Martin Topholm.
* Fix IPv6 address comparison in ebtables, from Luís Fernando
Cornachioni Estrozi.
* Fix wrong endianess in sequence adjustment which breaks helpers
in NAT configurations, from Phil Oester.
* Fix the error path handling of nft_compat, from me.
* Make sure the global conntrack counter is decremented after the
object has been released, also from me.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull slave-dmaengine changes from Vinod Koul:
"This brings for slave dmaengine:
- Change dma notification flag to DMA_COMPLETE from DMA_SUCCESS as
dmaengine can only transfer and not verify validaty of dma
transfers
- Bunch of fixes across drivers:
- cppi41 driver fixes from Daniel
- 8 channel freescale dma engine support and updated bindings from
Hongbo
- msx-dma fixes and cleanup by Markus
- DMAengine updates from Dan:
- Bartlomiej and Dan finalized a rework of the dma address unmap
implementation.
- In the course of testing 1/ a collection of enhancements to
dmatest fell out. Notably basic performance statistics, and
fixed / enhanced test control through new module parameters
'run', 'wait', 'noverify', and 'verbose'. Thanks to Andriy and
Linus [Walleij] for their review.
- Testing the raid related corner cases of 1/ triggered bugs in
the recently added 16-source operation support in the ioatdma
driver.
- Some minor fixes / cleanups to mv_xor and ioatdma"
* 'next' of git://git.infradead.org/users/vkoul/slave-dma: (99 commits)
dma: mv_xor: Fix mis-usage of mmio 'base' and 'high_base' registers
dma: mv_xor: Remove unneeded NULL address check
ioat: fix ioat3_irq_reinit
ioat: kill msix_single_vector support
raid6test: add new corner case for ioatdma driver
ioatdma: clean up sed pool kmem_cache
ioatdma: fix selection of 16 vs 8 source path
ioatdma: fix sed pool selection
ioatdma: Fix bug in selftest after removal of DMA_MEMSET.
dmatest: verbose mode
dmatest: convert to dmaengine_unmap_data
dmatest: add a 'wait' parameter
dmatest: add basic performance metrics
dmatest: add support for skipping verification and random data setup
dmatest: use pseudo random numbers
dmatest: support xor-only, or pq-only channels in tests
dmatest: restore ability to start test at module load and init
dmatest: cleanup redundant "dmatest: " prefixes
dmatest: replace stored results mechanism, with uniform messages
Revert "dmatest: append verify result to results"
...
CPUs can ask for local route via ip_route_input_noref() concurrently.
if nh_rth_input is not cached yet, CPUs will proceed to allocate
equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
via rt_cache_route()
Most of the time they succeed, but on occasion the following two lines:
orig = *p;
prev = cmpxchg(p, orig, rt);
in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
so dst is leaking. dst_destroy() is never called and 'lo' device
refcnt doesn't go to zero, which can be seen in the logs as:
unregister_netdevice: waiting for lo to become free. Usage count = 1
Adding mdelay() between above two lines makes it easily reproducible.
Fix it similar to nh_pcpu_rth_output case.
Fixes: d2d68ba9fe ("ipv4: Cache input routes in fib_info nexthops.")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.
The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
snd_nxt must be updated synchronously with sk_send_head. Otherwise
tp->packets_out may be updated incorrectly, what may bring a kernel panic.
Here is a kernel panic from my host.
[ 103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[ 103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150
...
[ 146.301158] Call Trace:
[ 146.301158] [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0
Before this panic a tcp socket was restored. This socket had sent and
unsent data in the write queue. Sent data was restored in repair mode,
then the socket was switched from reapair mode and unsent data was
restored. After that the socket was switched back into repair mode.
In that moment we had a socket where write queue looks like this:
snd_una snd_nxt write_seq
|_________|________|
|
sk_send_head
After a second switching from repair mode the state of socket was
changed:
snd_una snd_nxt, write_seq
|_________ ________|
|
sk_send_head
This state is inconsistent, because snd_nxt and sk_send_head are not
synchronized.
Bellow you can find a call trace, how packets_out can be incremented
twice for one skb, if snd_nxt and sk_send_head are not synchronized.
In this case packets_out will be always positive, even when
sk_write_queue is empty.
tcp_write_wakeup
skb = tcp_send_head(sk);
tcp_fragment
if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq))
tcp_adjust_pcount(sk, skb, diff);
tcp_event_new_data_sent
tp->packets_out += tcp_skb_pcount(skb);
I think update of snd_nxt isn't required, when a socket is switched from
repair mode. Because it's initialized in tcp_connect_init. Then when a
write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent,
so it's always is in consistent state.
I have checked, that the bug is not reproduced with this patch and
all tests about restoring tcp connections work fine.
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After searching rt by the vti tunnel dst/src parameter,
if this rt has neither attached to any transformation
nor the transformation is not tunnel oriented, this rt
should be released back to ip layer.
otherwise causing dst memory leakage.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A plain read() on a socket does set msg->msg_name to NULL. So check for
NULL pointer first.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only update *addr_len when we actually fill in sockaddr, otherwise we
can return uninitialized memory from the stack to the caller in the
recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL)
checks because we only get called with a valid addr_len pointer either
from sock_common_recvmsg or inet_recvmsg.
If a blocking read waits on a socket which is concurrently shut down we
now return zero and set msg_msgnamelen to 0.
Reported-by: mpb <mpb.mail@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the synproxy_parse_options is called on the client ack the mss
option will not be present. Consequently mss wont be included in the
backend syn packet, which falls back to 536 bytes mss.
Therefore XT_SYNPROXY_OPT_MSS is explicitly flagged when recovering mss
value from cookie.
Signed-off-by: Martin Topholm <mph@one.com>
Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ip4_datagram_connect() being called from process context,
it should use IP_INC_STATS() instead of IP_INC_STATS_BH()
otherwise we can deadlock on 32bit arches, or get corruptions of
SNMP counters.
Fixes: 584bdf8cbd ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that genl_ops are no longer modified in place when
registering, they can be made const. This patch was done
mostly with spatch:
@@
identifier ops;
@@
+const
struct genl_ops ops[] = {
...
};
(except the struct thing in net/openvswitch/datapath.c)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We had some reports of crashes using TCP fastopen, and Dave Jones
gave a nice stack trace pointing to the error.
Issue is that tcp_get_metrics() should not be called with a NULL dst
Fixes: 1fe4c481ba ("net-tcp: Fast Open client - cookie cache")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Tested-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit c9eeec26e3 ("tcp: TSQ can use a dynamic limit"), several
users reported throughput regressions, notably on mvneta and wifi
adapters.
802.11 AMPDU requires a fair amount of queueing to be effective.
This patch partially reverts the change done in tcp_write_xmit()
so that the minimal amount is sysctl_tcp_limit_output_bytes.
It also remove the use of this sysctl while building skb stored
in write queue, as TSO autosizing does the right thing anyway.
Users with well behaving NICS and correct qdisc (like sch_fq),
can then lower the default sysctl_tcp_limit_output_bytes value from
128KB to 8KB.
This new usage of sysctl_tcp_limit_output_bytes permits each driver
authors to check how their driver performs when/if the value is set
to a minimum of 4KB.
Normally, line rate for a single TCP flow should be possible,
but some drivers rely on timers to perform TX completion and
too long TX completion delays prevent reaching full throughput.
Fixes: c9eeec26e3 ("tcp: TSQ can use a dynamic limit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sujith Manoharan <sujith@msujith.org>
Reported-by: Arnaud Ebalard <arno@natisbad.org>
Tested-by: Sujith Manoharan <sujith@msujith.org>
Cc: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 06a23fe31c
("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()")
and refactoring 64261f230a
("dev: move skb_scrub_packet() after eth_type_trans()")
are forcing pkt_type to be PACKET_HOST when skb traverses veth.
which means that ip forwarding will kick in inside netns
even if skb->eth->h_dest != dev->dev_addr
Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb()
and in ip_tunnel_rcv()
Fixes: 06a23fe31c ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()")
CC: Isaku Yamahata <yamahatanetdev@gmail.com>
CC: Maciej Zenczykowski <zenczykowski@gmail.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>