Typically, each TC filter has its own action. All the actions of the
same type are saved in its hash table. But the hash buckets are too
small that it degrades to a list. And the performance is greatly
affected. For example, it takes about 0m11.914s to insert 64K rules.
If we convert the hash table to IDR, it only takes about 0m1.500s.
The improvement is huge.
But please note that the test result is based on previous patch that
cls_flower uses IDR.
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, all filters with the same priority are linked in a doubly
linked list. Every filter should have a unique handle. To make the
handle unique, we need to iterate the list every time to see if the
handle exists or not when inserting a new filter. It is time-consuming.
For example, it takes about 5m3.169s to insert 64K rules.
This patch changes cls_flower to use IDR. With this patch, it
takes about 0m1.127s to insert 64K rules. The improvement is huge.
But please note that in this testing, all filters share the same action.
If every filter has a unique action, that is another bottleneck.
Follow-up patch in this patchset addresses that.
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 45f119bf93.
Eric Dumazet says:
We found at Google a significant regression caused by
45f119bf93 tcp: remove header prediction
In typical RPC (TCP_RR), when a TCP socket receives data, we now call
tcp_ack() while we used to not call it.
This touches enough cache lines to cause a slowdown.
so problem does not seem to be HP removal itself but the tcp_ack()
call. Therefore, it might be possible to remove HP after all, provided
one finds a way to elide tcp_ack for most cases.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change was a followup to the header prediction removal,
so first revert this as a prerequisite to back out hp removal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian reported UDP xmit drops that could be root caused to the
too small neigh limit.
Current limit is 64 KB, meaning that even a single UDP socket would hit
it, since its default sk_sndbuf comes from net.core.wmem_default
(~212992 bytes on 64bit arches).
Once ARP/ND resolution is in progress, we should allow a little more
packets to be queued, at least for one producer.
Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
limit and either block in sendmsg() or return -EAGAIN.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new nsh/ directory. It currently holds only GSO functions but more
will come: in particular, code shared by openvswitch and tc to manipulate
NSH headers.
For now, assume there's no hardware support for NSH segmentation. We can
always introduce netdev->nsh_features later.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch handles a default IFE type if it's not given by user space
netlink api. The default IFE type will be the registered ethertype by
IEEE for IFE ForCES.
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few useful tracepoints to trace bridge forwarding
database updates.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Creating as specific xdp_redirect_map variant of the xdp tracepoints
allow users to write simpler/faster BPF progs that get attached to
these tracepoints.
Goal is to still keep the tracepoints in xdp_redirect and xdp_redirect_map
similar enough, that a tool can read the top part of the TP_STRUCT and
produce similar monitor statistics.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a need to separate the xdp_redirect tracepoint into two
tracepoints, for separating the error case from the normal forward
case.
Due to the extreme speeds XDP is operating at, loading a tracepoint
have a measurable impact. Single core XDP REDIRECT (ethtool tuned
rx-usecs 25) can do 13.7 Mpps forwarding, but loading a simple
bpf_prog at the tracepoint (with a return 0) reduce perf to 10.2 Mpps
(CPU E5-1650 v4 @ 3.60GHz, driver: ixgbe)
The overhead of loading a bpf-based tracepoint can be calculated to
cost 25 nanosec ((1/13782002-1/10267937)*10^9 = -24.83 ns).
Using perf record on the tracepoint event, with a non-matching --filter
expression, the overhead is much larger. Performance drops to 8.3 Mpps,
cost 48 nanosec ((1/13782002-1/8312497)*10^9 = -47.74))
Having a separate tracepoint for err cases, which should be less
frequent, allow running a continuous monitor for errors while not
affecting the redirect forward performance (this have also been
verified by measurements).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make sense of the map index, the tracepoint user also need to know
that map we are talking about. Supply the map pointer but only expose
the map->id.
The 'to_index' is renamed 'to_ifindex'. In the xdp_redirect_map case,
this is the result of the devmap lookup. The map lookup key is exposed
as map_index, which is needed to troubleshoot in case the lookup failed.
The 'to_ifindex' is placed after 'err' to keep TP_STRUCT as common as
possible.
This also keeps the TP_STRUCT similar enough, that userspace can write
a monitor program, that doesn't need to care about whether
bpf_redirect or bpf_redirect_map were used.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Supplying the action argument XDP_REDIRECT to the tracepoint xdp_redirect
is redundant as it is only called in-case this action was specified.
Remove the argument, but keep "act" member of the tracepoint struct and
populate it with XDP_REDIRECT. This makes it easier to write a common bpf_prog
processing events.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWaVa1/Sw1s6N8H32AQLHmw/9FZCtcFZ5WwCqu/14+kpXcYFAbDDS3YQ4
tQPmPC5bLMN+M8oVR2LGw3kSPBU4TCMK83uv3mR3TAToDSJrHJ2EMpoZYdfXZAz6
+NqXfmqBli1/iSSoyUj2zrMteXjYnxAOwJ2ZI9LFgu6hxBrhbvMIWhl0FrCn9tUo
4522YWF6W2R/DdSonEXfyv8kFZibYuCJFSUmRA+3S2B5qCfXdziMWrV1Udc5F/Jh
l/vkU4i68hI2f8n23mVa0GRmV197mDcC6laTSytKuNgeVYdDrgNGqkjESTU6ICha
r2yLUJ2XmLURBqjsq0A+4qP//wBJoAQOUeRrnH6lMQhn+twafpRoF32RZNoAgBsu
T8Riw29djDsdrBlOY2Qg+3Sxy7ttxf6N9cW9weNSpSFOil+z0J4yk6lN7s9XiTl8
5aynEwrILfdtvJPWBx0RhFl6dcTOf0f4faR0I3HI1J7vogrAADCeDTdWkTJ+SmGV
OJ3/retdyra2sss39anG3/Rw5uMi+a66X8zNK7t1xjsXaywF+iINmZds8Ld2sQ+M
aPNH2xlu7/PUSZMbyh3Zjw7mm2HR8nYJ0fH/Y1h3lz4kpLd/PKlnYKIM3XLGQWTY
yNi0IXJFIoakh9u/D7/qscp9w5Y+2WUBzajFhe9tfbG0OKWgNgEKMbupdYksXQpB
SrK6CX9NkhM=
=eSSs
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-next-20170829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Miscellany
Here are a number of patches that make some changes/fixes and add a couple
of extensions to AF_RXRPC for kernel services to use. The changes and
fixes are:
(1) Use time64_t rather than u32 outside of protocol or
UAPI-representative structures.
(2) Use the correct time stamp when loading a key from an XDR-encoded
Kerberos 5 key.
(3) Fix IPv6 support.
(4) Fix some places where the error code is being incorrectly made
positive before returning.
(5) Remove some white space.
And the extensions:
(6) Add an end-of-Tx phase notification, thereby allowing kAFS to
transition the state on its own call record at the correct point,
rather than having to do it in advance and risk non-completion of the
call in the wrong state.
(7) Allow a kernel client call to be retried if it fails on a network
error, thereby making it possible for kAFS to iterate over a number of
IP addresses without having to reload the Tx queue and re-encrypt data
each time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There appears to be no need to use rtnl, addrlabel entries are refcounted
and add/delete is serialized by the addrlabel table spinlock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow a client call that failed on network error to be retried, provided
that the Tx queue still holds DATA packet 1. This allows an operation to
be submitted to another server or another address for the same server
without having to repackage and re-encrypt the data so far processed.
Two new functions are provided:
(1) rxrpc_kernel_check_call() - This is used to find out the completion
state of a call to guess whether it can be retried and whether it
should be retried.
(2) rxrpc_kernel_retry_call() - Disconnect the call from its current
connection, reset the state and submit it as a new client call to a
new address. The new address need not match the previous address.
A call may be retried even if all the data hasn't been loaded into it yet;
a partially constructed will be retained at the same point it was at when
an error condition was detected. msg_data_left() can be used to find out
how much data was packaged before the error occurred.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a callback to rxrpc_kernel_send_data() so that a kernel service can get
a notification that the AF_RXRPC call has transitioned out the Tx phase and
is now waiting for a reply or a final ACK.
This is called from AF_RXRPC with the call state lock held so the
notification is guaranteed to come before any reply is passed back.
Further, modify the AFS filesystem to make use of this so that we don't have
to change the afs_call state before sending the last bit of data.
Signed-off-by: David Howells <dhowells@redhat.com>
call->error is stored as 0 or a negative error code. Don't negate this
value (ie. make it positive) before returning it from a kernel function
(though it should still be negated before passing to userspace through a
control message).
Signed-off-by: David Howells <dhowells@redhat.com>
Fix IPv6 support in AF_RXRPC in the following ways:
(1) When extracting the address from a received IPv4 packet, if the local
transport socket is open for IPv6 then fill out the sockaddr_rxrpc
struct for an IPv4-mapped-to-IPv6 AF_INET6 transport address instead
of an AF_INET one.
(2) When sending CHALLENGE or RESPONSE packets, the transport length needs
to be set from the sockaddr_rxrpc::transport_len field rather than
sizeof() on the IPv4 transport address.
(3) When processing an IPv4 ICMP packet received by an IPv6 socket, set up
the address correctly before searching for the affected peer.
Signed-off-by: David Howells <dhowells@redhat.com>
When an XDR-encoded Kerberos 5 ticket is added as an rxrpc-type key, the
expiry time should be drawn from the k5 part of the token union (which was
what was filled in), rather than the kad part of the union.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Since the 'expiry' variable of 'struct key_preparsed_payload' has been
changed to 'time64_t' type, which is year 2038 safe on 32bits system.
In net/rxrpc subsystem, we need convert 'u32' type to 'time64_t' type
when copying ticket expires time to 'prep->expiry', then this patch
introduces two helper functions to help convert 'u32' to 'time64_t'
type.
This patch also uses ktime_get_real_seconds() to get current time instead
of get_seconds() which is not year 2038 safe on 32bits system.
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Make use of the ndo_vlan_rx_{add,kill}_vid callbacks to have the NCSI
stack process new VLAN tags and configure the channel VLAN filter
appropriately.
Several VLAN tags can be set and a "Set VLAN Filter" packet must be sent
for each one, meaning the ncsi_dev_state_config_svf state must be
repeated. An internal list of VLAN tags is maintained, and compared
against the current channel's ncsi_channel_filter in order to keep track
within the state. VLAN filters are removed in a similar manner, with the
introduction of the ncsi_dev_state_config_clear_vids state. The maximum
number of VLAN tag filters is determined by the "Get Capabilities"
response from the channel.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's time to get rid of IRDA. It's long been broken, and no one seems
to use it anymore. So move it to staging and after a while, we can
delete it from there.
To start, move the network irda core from net/irda to
drivers/staging/irda/net/
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 520ac30f45 ("net_sched: drop packets after root qdisc lock
is released) made a big change of tc for performance. But there are
some points which are not changed in SFQ enqueue operation.
1. Fail to find the SFQ hash slot;
2. When the queue is full;
Now use qdisc_drop instead free skb directly.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It uses VMBus ringbuffer as the
transportation layer.
With hv_sock, applications between the host (Windows 10, Windows Server
2016 or newer) and the guest can talk with each other using the traditional
socket APIs.
More info about Hyper-V Sockets is available here:
"Make your own integration services":
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/make-integration-service
The patch implements the necessary support in Linux guest by introducing a new
vsock transport for AF_VSOCK.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Andy King <acking@vmware.com>
Cc: Dmitry Torokhov <dtor@vmware.com>
Cc: George Zhang <georgezhang@vmware.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Reilly Grant <grantr@vmware.com>
Cc: Asias He <asias@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Cc: Marcelo Cerri <marcelo.cerri@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Twice patches trying to constify inet{6}_protocol have been reverted:
39294c3df2 ("Revert "ipv6: constify inet6_protocol structures"") to
revert 3a3a4e3054 and then 03157937fe ("Revert "ipv4: make
net_protocol const"") to revert aa8db499ea.
Add a comment that the structures can not be const because the
early_demux field can change based on a sysctl.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to gre, vxlan, geneve, ipip tunnels, allow ERSPAN tunnels to
operate in 'collect metadata' mode. bpf_skb_[gs]et_tunnel_key() helpers
can make use of it right away. OVS can use it as well in the future.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch refactors the gre_fb_xmit function, by creating
prepare_fb_xmit function for later ERSPAN collect_md mode patch.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make this const as it is either used during a copy operation or passed
to a const argument of the function rhltable_init
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make these const as they are only passed to a const argument of the
function inet_add_protocol.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make this const as it is only passed to a const argument of the function
ebt_register_table.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOCKMAP uses strparser code (compiled with Kconfig option
CONFIG_STREAM_PARSER) to run the parser BPF program. Without this
config option set sockmap wont be compiled. However, at the moment
the only way to pull in the strparser code is to enable KCM.
To resolve this create a BPF specific config option to pull
only the strparser piece in that sockmap needs. This also
allows folks who want to use BPF/syscall/maps but don't need
sockmap to easily opt out.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
syszkaller got a hang in tcp stack, related to a bug in
tcp_sendpage_locked()
root@syzkaller:~# cat /proc/3059/stack
[<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0
[<ffffffff83de9473>] lock_sock_nested+0xf3/0x110
[<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50
[<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0
[<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110
[<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60
[<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280
[<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160
[<ffffffff84089203>] tcp_sendpage+0x43/0x60
[<ffffffff841641da>] inet_sendpage+0x1aa/0x660
[<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0
[<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0
[<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0
[<ffffffff81b67243>] __splice_from_pipe+0x343/0x750
[<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330
[<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50
[<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610
[<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe
Fixes: 306b13eb3c ("proto_ops: Add locked held versions of sendmsg and sendpage")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is ugly to hide a u32-filter-specific pointer inside Qdisc,
this breaks the TC layers:
1. Qdisc is a generic representation, should not have any specific
data of any type
2. Qdisc layer is above filter layer, should only save filters in
the list of struct tcf_proto.
This pointer is used as the head of the chain of u32 hash tables,
that is struct tc_u_hnode, because u32 filter is very special,
it allows to create multiple hash tables within one qdisc and
across multiple u32 filters.
Instead of using this ugly pointer, we can just save it in a global
hash table key'ed by (dev ifindex, qdisc handle), therefore we can
still treat it as a per qdisc basis data structure conceptually.
Of course, because of network namespaces, this key is not unique
at all, but it is fine as we already have a pointer to Qdisc in
struct tc_u_common, we can just compare the pointers when collision.
And this only affects slow paths, has no impact to fast path,
thanks to the pointer ->tp_c.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:
1) For class modification and dumping paths, we already hold RTNL lock,
so all of these ->get(),->change(),->put() are atomic.
2) For filter bindiing/unbinding, we use other reference counter than
this one, and they should have RTNL lock too.
3) For ->qlen_notify(), it is special because it is called on ->enqueue()
path, but we already hold qdisc tree lock there, and we hold this
tree lock when graft or delete the class too, so it should not be gone
or changed until we release the tree lock.
Therefore, this patch removes ->get() and ->put(), but:
1) Adds a new ->find() to find the pointer to a class by classid, no
refcnt.
2) Move the original class destroy upon the last refcnt into ->delete(),
right after releasing tree lock. This is fine because the class is
already removed from hash when holding the lock.
For those who also use ->put() as ->unbind(), just rename them to reflect
this change.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like for TC actions, ->delete() is a special case,
we have to prepare and fill the notification before delete
otherwise would get use-after-free after we remove the
reference count.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is not needed if we move them up properly.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the following seg6local actions.
- SEG6_LOCAL_ACTION_END_T: regular SRH processing and forward to the
next-hop looked up in the specified routing table.
- SEG6_LOCAL_ACTION_END_DX2: decapsulate an L2 frame and forward it to
the specified network interface.
- SEG6_LOCAL_ACTION_END_DX4: decapsulate an IPv4 packet and forward it,
possibly to the specified next-hop.
- SEG6_LOCAL_ACTION_END_DT6: decapsulate an IPv6 packet and forward it
to the next-hop looked up in the specified routing table.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds three helper functions to be used with the seg6local packet
processing actions.
The decap_and_validate() function will be used by the End.D* actions, that
decapsulate an SR-enabled packet.
The advance_nextseg() function applies the fundamental operations to update
an SRH for the next segment.
The lookup_nexthop() function helps select the next-hop for the processed
SR packets. It supports an optional next-hop address to route the packet
specifically through it, and an optional routing table to use.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch ensures that the seg6local lightweight tunnel is used solely
with IPv6 routes and processes only IPv6 packets.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the L2 frame encapsulation mechanism, referred to
as T.Encaps.L2 in the SRv6 specifications [1].
A new type of SRv6 tunnel mode is added (SEG6_IPTUN_MODE_L2ENCAP). It only
accepts packets with an existing MAC header (i.e., it will not work for
locally generated packets). The resulting packet looks like IPv6 -> SRH ->
Ethernet -> original L3 payload. The next header field of the SRH is set to
NEXTHDR_NONE.
[1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables the SRv6 encapsulation mode to carry an IPv4 payload.
All the infrastructure was already present, I just had to add a parameter
to seg6_do_srh_encap() to specify the inner packet protocol, and perform
some additional checks.
Usage example:
ip route add 1.2.3.4 encap seg6 mode encap segs fc00::1,fc00::2 dev eth0
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow our callers to influence the choice of ECMP link by honoring the
hash passed together with the flow info. This allows for special
treatment of ICMP errors which we would like to route over the same path
as the IPv6 datagram that triggered the error.
Also go through rt6_multipath_hash(), in the usual case when we aren't
dealing with an ICMP error, so that there is one central place where
multipath hash is computed.
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 644d0e6569 ("ipv6 Use get_hash_from_flowi6 for rt6 hash") has
turned rt6_info_hash_nhsfn() into a one-liner, so it no longer makes
sense to keep it around. Also remove the accompanying comment that has
become outdated.
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When forwarding or sending out an ICMPv6 error, look at the embedded
packet that triggered the error and compute a flow hash over its
headers.
This let's us route the ICMP error together with the flow it belongs to
when multipath (ECMP) routing is in use, which in turn makes Path MTU
Discovery work in ECMP load-balanced or anycast setups (RFC 7690).
Granted, end-hosts behind the ECMP router (aka servers) need to reflect
the IPv6 Flow Label for PMTUD to work.
The code is organized to be in parallel with ipv4 stack:
ip_multipath_l3_keys -> ip6_multipath_l3_keys
fib_multipath_hash -> rt6_multipath_hash
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reflecting IPv6 Flow Label at server nodes is useful in environments
that employ multipath routing to load balance the requests. As "IPv6
Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error
messages generated in response to a downstream packets from the server
can be routed by a load balancer back to the original server without
looking at transport headers, if the server applies the flow label
reflection. This enables the Path MTU Discovery past the ECMP router in
load-balance or anycast environments where each server node is reachable
by only one path.
Introduce a sysctl to enable flow label reflection per net namespace for
all newly created sockets. Same could be earlier achieved only per
socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR
socket option.
[1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>