bpf_trace_printk() is a helper function used to debug eBPF programs.
Let socket and TC programs use it as well.
Note, it's DEBUG ONLY helper. If it's used in the program,
the kernel will print warning banner to make sure users don't use
it in production.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:
u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid
u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid
bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf
They can be used from the programs attached to TC as well to classify packets
based on current task fields.
Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_header_pointer() returns a pointer that must be checked.
Fixes infinite loop reported by Alexei, and add __must_check to
catch these errors earlier.
Fixes: 6a74fcf426 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs")
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If dst, hop-by-hop or routing extension headers are present determine
length of the options and skip over them in flow dissection.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to shift after masking to get label value for comparison.
Fixes: b3baa0fbd0 ("mpls: Add MPLS entropy label in flow_keys")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f76858
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.
This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.
alloc_skb_with_frags is the same.
The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.
V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer
Cc: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Debabrata Banerjee <dbavatar@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add strings array of the current supported tunable options.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 70008aa50e ("skbuff: convert to skb_orphan_frags") replaced
open coded tests of SKBTX_DEV_ZEROCOPY and skb_copy_ubufs with calls
to helper function skb_orphan_frags. Apply that to the last remaining
open coded site.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
allow programs read/write skb->mark, tc_index fields and
((struct qdisc_skb_cb *)cb)->data.
mark and tc_index are generically useful in TC.
cb[0]-cb[4] are primarily used to pass arguments from one
program to another called via bpf_tail_call() which can
be seen in sockex3_kern.c example.
All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs.
mark, tc_index are writeable from tc_cls_act only.
cb[0]-cb[4] are writeable by both sockets and tc_cls_act.
Add verifier tests and improve sample code.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
For ingress L2 header is already pulled, whereas for egress it's present.
This is known to program writers which are currently forced to use
BPF_LL_OFF workaround.
Since programs don't change skb internal pointers it is safe to do
pull/push right around invocation of the program and earlier taps and
later pt->func() will not be affected.
Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
around run_filter/BPF_PROG_RUN even if skb_shared.
This fix finally allows programs to use optimized LD_ABS/IND instructions
without BPF_LL_OFF for higher performance.
tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
w/o JIT w/JIT
before 20.5 23.6 Mpps
after 21.8 26.6 Mpps
Old programs with BPF_LL_OFF will still work as-is.
We can now undo most of the earlier workaround commit:
a166151cbe ("bpf: fix bpf helpers to use skb->mac_header relative offsets")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In flow dissector if an MPLS header contains an entropy label this is
saved in the new keyid field of flow_keys. The entropy label is
then represented in the flow hash function input.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In flow dissector if a GRE header contains a keyid this is saved in the
new keyid field of flow_keys. The GRE keyid is then represented
in the flow hash function input.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In flow_dissector set the flow label in flow_keys for IPv6. This also
removes the shortcircuiting of flow dissection when a non-zero label
is present, the flow label can be considered to provide additional
entropy for a hash.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In flow_dissector set vlan_id in flow_keys when VLAN is found.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't need to return the IPv6 address hash as part of flow keys.
In general, using the IPv6 address hash is risky in a hash value
since the underlying use of xor provides no entropy. If someone
really needs the hash value they can get it from the full IPv6
addresses in flow keys (e.g. from flow_get_u32_src).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds full IPv6 addresses into flow_keys and uses them as
input to the flow hash function. The implementation supports either
IPv4 or IPv6 addresses in a union, and selector is used to determine
how may words to input to jhash2.
We also add flow_get_u32_dst and flow_get_u32_src functions which are
used to get a u32 representation of the source and destination
addresses. For IPv6, ipv6_addr_hash is called. These functions retain
getting the legacy values of src and dst in flow_keys.
With this patch, Ethertype and IP protocol are now included in the
flow hash input.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes flow hashing to use jhash2 over the flow_keys
structure instead just doing jhash_3words over src, dst, and ports.
This method will allow us take more input into the hashing function
so that we can include full IPv6 addresses, VLAN, flow labels etc.
without needing to resort to xor'ing which makes for a poor hash.
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
key_basic is set twice in __skb_flow_dissect which seems unnecessary.
Remove second one.
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do break when we see routing flag or a non-zero version number in GRE
header.
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix build error:
net/core/filter.c: In function 'bpf_clone_redirect':
net/core/filter.c:1429:18: error: 'struct sk_buff' has no member named 'tc_verd'
if (G_TC_AT(skb2->tc_verd) & AT_INGRESS)
Fixes: 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper")
Reported-by: Or Gerlitz <gerlitz.or@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow eBPF programs attached to classifier/actions to call
bpf_clone_redirect(skb, ifindex, flags) helper which will
mirror or redirect the packet by dynamic ifindex selection
from within the program to a target device either at ingress
or at egress. Can be used for various scenarios, for example,
to load balance skbs into veths, split parts of the traffic
to local taps, etc.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h
iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.
The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
When we scan a packet for GRO processing, we want to see the most
common packet types in the front of the offload_base list.
So add a priority field so we can handle this properly.
IPv4/IPv6 get the highest priority with the implicit zero priority
field.
Next comes ethernet with a priority of 10, and then we have the MPLS
types with a priority of 15.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit f96dee13b8.
It isn't right, ethtool is meant to manage one PHY instance
per netdevice at a time, and this is selected by the SET
command. Therefore by definition the GET command must only
return the settings for the configured and selected PHY.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
As this is already exported from tracing side via commit d9847d310a
("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might
as well want to move it to the core, so also networking users can make
use of it, e.g. to measure diffs for certain flows from ingress/egress.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove automatic variable 'err' in register_netevent_notifier() and
return the result of atomic_notifier_chain_register() directly.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
classic BPF already exposes skb->dev->ifindex via SKF_AD_IFINDEX extension.
Allow eBPF program to access it as well. Note that classic aborts execution
of the program if 'skb->dev == NULL' (which is inconvenient for program
writers), whereas eBPF returns zero in such case.
Also expose the 'skb_iif' field, since programs triggered by redirected
packet need to known the original interface index.
Summary:
__skb->ifindex -> skb->dev->ifindex
__skb->ingress_ifindex -> skb->skb_iif
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
By making sure sk->sk_gso_max_segs minimal value is one,
and sysctl_tcp_min_tso_segs minimal value is one as well,
tcp_tso_autosize() will return a non zero value.
We can then revert 843925f33f
("tcp: Do not apply TSO segment limit to non-TSO packets")
and save few cpu cycles in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
make C=2 CF=-D__CHECK_ENDIAN__ net/core/utils.o
...
net/core/utils.c:307:72: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:307:72: expected restricted __wsum [usertype] addend
net/core/utils.c:307:72: got restricted __be32 [usertype] from
net/core/utils.c:308:34: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:308:34: expected restricted __wsum [usertype] addend
net/core/utils.c:308:34: got restricted __be32 [usertype] to
net/core/utils.c:310:70: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:310:70: expected restricted __wsum [usertype] addend
net/core/utils.c:310:70: got restricted __be32 [usertype] from
net/core/utils.c:310:77: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:310:77: expected restricted __wsum [usertype] addend
net/core/utils.c:310:77: got restricted __be32 [usertype] to
net/core/utils.c:312:72: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:312:72: expected restricted __wsum [usertype] addend
net/core/utils.c:312:72: got restricted __be32 [usertype] from
net/core/utils.c:313:35: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:313:35: expected restricted __wsum [usertype] addend
net/core/utils.c:313:35: got restricted __be32 [usertype] to
Note we can use csum_replace4() helper
Fixes: 58e3cac561 ("net: optimise inet_proto_csum_replace4()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
make C=2 CF=-D__CHECK_ENDIAN__ net/core/secure_seq.o
net/core/secure_seq.c:157:50: warning: restricted __be32 degrades to
integer
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/pktgen.c:2672:43: warning: incorrect type in assignment (different base types)
net/core/pktgen.c:2672:43: expected unsigned short [unsigned] [short] [usertype] <noident>
net/core/pktgen.c:2672:43: got restricted __be16 [usertype] protocol
Let's use proper struct ethhdr instead of hard coding everything.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
unix_stream_recvmsg is refactored to unix_stream_read_generic in this
patch and enhanced to deal with pipe splicing. The refactoring is
inneglible, we mostly have to deal with a non-existing struct msghdr
argument.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.
AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
Giving /proc/net/pktgen/pgctrl an invalid command just returns shell
success and prints a warning in dmesg. This is not very useful for
shell scripting, as it can only detect the error by parsing dmesg.
Instead return -EINVAL when the command is unknown, as this provides
userspace shell scripting a way of detecting this.
Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl
output this version number which would allow to detect this small
semantic change, and (2) because the pktgen version tag have not been
updated since 2010.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Too many spaces were introduced in commit 63adc6fb8a ("pktgen: cleanup
checkpatch warnings"), thus misaligning "src_min:" to other columns.
Fixes: 63adc6fb8a ("pktgen: cleanup checkpatch warnings")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When trying to configure the settings for PHY1, using commands
like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
modify other settings apart from the speed of the PHY1, in the
above case.
The ethtool seems to query the settings for PHY0, and use this
as the base to apply the new settings to the PHY1. This is
causing the other settings of the PHY 1 to be wrongly
configured.
The issue is caused by the '_ethtool_get_settings()' API, which
gets called because of the 'ETHTOOL_GSET' command, is clearing
the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
memset. This clears all the parameters (if any) passed for the
'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
with 'cmd->phy_address' as '0'.
The '_ethtool_get_settings()' is called from other files in the
'net/core'. So the fix is applied to the 'ethtool_get_settings()'
which is only called in the context of the 'ethtool'.
Signed-off-by: Arun Parameswaran <aparames@broadcom.com>
Reviewed-by: Ray Jui <rjui@broadcom.com>
Reviewed-by: Scott Branden <sbranden@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This restored previous behaviour. If caller does not want ports to be
filled, we should not break.
Fixes: 06635a35d1 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce ifdef pollution slightly, no functional change. We can simply
remove the extra alternative definition of handle_ing() and nf_ingress().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
[1] When entering NUD_PROBE state via neigh_update(), perhaps received
from userspace, correctly (re)initialize the probes count to zero.
This is useful for forcing revalidation of a neighbor (for example
if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).
[2] Notify listeners when a neighbor goes into NUD_PROBE state.
By sending notifications on entry to NUD_PROBE state listeners get
more timely warnings of imminent connectivity issues.
The current notifications on entry to NUD_STALE have somewhat
limited usefulness: NUD_STALE is a perfectly normal state, as is
NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
a neighbor reachability problem has been confirmed (typically after
three probes).
Signed-off-by: Erik Kline <ek@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 06635a35d1 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.
SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before the patch, the command 'ip link add bond2 type bond mode 802.3ad'
causes the kernel to send a rtnl message for the bond2 interface, with an
ifindex 0.
'ip monitor' shows:
0: bond2: <BROADCAST,MULTICAST,MASTER> mtu 1500 state DOWN group default
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
9: bond2@NONE: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff
[snip]
The patch fixes the spotted bug by checking in bond driver if the interface
is registered before calling the notifier chain.
It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the
future.
Fixes: d4261e5650 ("bonding: create netlink event when bonding option is changed")
CC: Jiri Pirko <jiri@resnulli.us>
Reported-by: Julien Meunier <julien.meunier@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d2788d3488 ("net: sched: further simplify handle_ing")
removed the call to qdisc_enqueue_root().
However, after this removal we no longer set qdisc pkt length.
This breaks traffic policing on ingress.
This is the minimum fix: set qdisc pkt length before tc_classify.
Only setting the length does remove support for 'stab' on ingress, but
as Alexei pointed out:
"Though it was allowed to add qdisc_size_table to ingress, it's useless.
Nothing takes advantage of recomputed qdisc_pkt_len".
Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that
would result in qdisc_pkt_len_init to no longer get inlined due to the
additional 2nd call site.
ingress policing is rare and GRO doesn't really work that well with police
on ingress, as we see packets > mtu and drop skbs that -- without
aggregation -- would still have fitted the policier budget.
Thus to have reliable/smooth ingress policing GRO has to be turned off.
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Fixes: d2788d3488 ("net: sched: further simplify handle_ing")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlock was missing on error path.
Fixes: 95f38411df ("netns: use a spin_lock to protect nsid management")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new config switch enables the ingress filtering infrastructure that is
controlled through the ingress_needed static key. This prepares the
introduction of the Netfilter ingress hook that resides under this unique
static key.
Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
problem since this also depends on CONFIG_NET_CLS_ACT.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I had inlined __alloc_rx_skb into __netdev_alloc_skb and
__napi_alloc_skb I had overlooked the fact that there was a return in the
__alloc_rx_skb. As a result we weren't reserving headroom or setting the
skb->dev in certain cases. This change corrects that by adding a couple of
jump labels to jump to depending on __alloc_skb either succeeding or failing.
Fixes: 9451980a66 ("net: Use cached copy of pfmemalloc to avoid accessing page")
Reported-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
So far, only hashes made out of ipv6 addresses could be dissected. This
patch introduces support for dissection of full ipv6 addresses.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce dissector infrastructure which allows user to specify which
parts of skb he wants to dissect.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
next to its user. No relation to flow_dissector so it makes no sense to
have it in flow_dissector.c
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_tx_hash function has no relation to flow_dissect so just move it
to dev.c
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since these functions are defined in flow_dissector.c, move header
declarations from skbuff.h into flow_dissector.h
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Four minor merge conflicts:
1) qca_spi.c renamed the local variable used for the SPI device
from spi_device to spi, meanwhile the spi_set_drvdata() call
got moved further up in the probe function.
2) Two changes were both adding new members to codel params
structure, and thus we had overlapping changes to the
initializer function.
3) 'net' was making a fix to sk_release_kernel() which is
completely removed in 'net-next'.
4) In net_namespace.c, the rtnl_net_fill() call for GET operations
had the command value fixed, meanwhile 'net-next' adjusted the
argument signature a bit.
This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.
Signed-off-by: David S. Miller <davem@davemloft.net>
Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions
which was causing build issues on the newly added switchdev attr/obj
structs. Fix this by using named union on structs.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pkt_gen->last_ok was not set properly, so after the first burst
pktgen instead of allocating new packet, will reuse old one, advance
eth_type_trans further, which would mean the stack will be seeing very
short bogus packets.
Fixes: 62f64aed62 ("pktgen: introduce xmit_mode '<start_xmit|netif_receive>'")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions compile to 60 bytes of machine code each.
With this .config: http://busybox.net/~vda/kernel_config
there are 617 calls of netif_tx_stop_queue()
and 49 calls of netif_tx_stop_all_queues() in vmlinux.
To fix this, remove WARN_ON in netif_tx_stop_queue()
as suggested by davem, and deinline netif_tx_stop_all_queues().
Change in code size is about 20k:
text data bss dec hex filename
82426986 22255416 20627456 125309858 77813a2 vmlinux.before
82406248 22255416 20627456 125289120 777c2a0 vmlinux
gcc-4.7.2 still creates deinlined version of netif_tx_stop_queue
sometimes:
$ nm --size-sort vmlinux | grep netif_tx_stop_queue | wc -l
190
ffffffff81b558a8 <netif_tx_stop_queue>:
ffffffff81b558a8: 55 push %rbp
ffffffff81b558a9: 48 89 e5 mov %rsp,%rbp
ffffffff81b558ac: f0 80 8f e0 01 00 00 lock orb $0x1,0x1e0(%rdi)
ffffffff81b558b3: 01
ffffffff81b558b4: 5d pop %rbp
ffffffff81b558b5: c3 retq
This needs additional fixing.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Alexei Starovoitov <alexei.starovoitov@gmail.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Joe Perches <joe@perches.com>
CC: David S. Miller <davem@davemloft.net>
CC: Jiri Pirko <jpirko@redhat.com>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: netfilter-devel@vger.kernel.org
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usually, RTM_NEWxxx is returned on a get (same as a dump).
Fixes: 0c7aecd4bd ("netns: add rtnl cmd to add and get peer netns ids")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa said remove the feature flag for this series and she'll work on
bringing it back if needed at a later date.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch ID is just a gettable port attribute. Convert switchdev op
switchdev_parent_id_get to a switchdev attr.
Note: for sysfs and netlink interfaces, SWITCHDEV_ATTR_PORT_PARENT_ID is
called with SWITCHDEV_F_NO_RECUSE to limit switch ID user-visiblity to only
port netdevs. So when a port is stacked under bond/bridge, the user can
only query switch id via the switch ports, but not via the upper devices
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds a function called skb_free_frag which is meant to
compliment the function netdev_alloc_frag. The general idea is to enable a
more lightweight version of page freeing since we don't actually need all
the overhead of a put_page, and we don't quite fit the model of __free_pages.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change moves the __alloc_page_frag functionality out of the networking
stack and into the page allocation portion of mm. The idea it so help make
this maintainable by placing it with other page allocation functions.
Since we are moving it from skbuff.c to page_alloc.c I have also renamed
the basic defines and structure from netdev_alloc_cache to page_frag_cache
to reflect that this is now part of a different kernel subsystem.
I have also added a simple __free_page_frag function which can handle
freeing the frags based on the skb->head pointer. The model for this is
based off of __free_pages since we don't actually need to deal with all of
the cases that put_page handles. I incorporated the virt_to_head_page call
and compound_order into the function as it actually allows for a signficant
size reduction by reducing code duplication.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that we store the virtual address of the page
in the netdev_alloc_cache instead of the page pointer. The idea behind
this is to avoid multiple calls to page_address since the virtual address
is required for every access, but the page pointer is only needed at
allocation or reset of the page.
While I was at it I also reordered the netdev_alloc_cache structure a bit
so that the size is always 16 bytes by dropping size in the case where
PAGE_SIZE is greater than or equal to 32KB.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing I found that the testing for pfmemalloc in build_skb was
rather expensive. I found the issue to be two-fold. First we have to get
from the virtual address to the head page and that comes at the cost of
something like 11 cycles. Then there is the cost for reading pfmemalloc out
of the head page which can be cache cold due to the fact that
put_page_testzero is likely invalidating the cache-line on one or more
CPUs as the fragments can be shared.
To avoid this extra expense I have added a pfmemalloc member to the
netdev_alloc_cache. I then pushed pieces of __alloc_rx_skb into
__napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to
make use of the cached pfmemalloc value. The result is that my perf traces
show a reduction from 9.28% overhead to 3.7% for the code covered by
build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with
the packet being dropped instead of being handed to napi_gro_receive.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601a ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Given quite some code has been removed from ing_filter(), we can just
consolidate that function into handle_ing() and get rid of a few
instructions at the same time.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions are no longer needed and no longer used kill them.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that sk_alloc knows when a kernel socket is being allocated modify
it to not reference count the network namespace of kernel sockets.
Keep track of if a socket needs reference counting by adding a flag to
struct sock called sk_net_refcnt.
Update all of the callers of sock_create_kern to stop using
sk_change_net and sk_release_kernel as those hacks are no longer
needed, to avoid reference counting a kernel socket.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce xmit_mode 'netif_receive' for pktgen which generates the
packets using familiar pktgen commands, but feeds them into
netif_receive_skb() instead of ndo_start_xmit().
Default mode is called 'start_xmit'.
It is designed to test netif_receive_skb and ingress qdisc
performace only. Make sure to understand how it works before
using it for other rx benchmarking.
Sample script 'pktgen.sh':
\#!/bin/bash
function pgset() {
local result
echo $1 > $PGDEV
result=`cat $PGDEV | fgrep "Result: OK:"`
if [ "$result" = "" ]; then
cat $PGDEV | fgrep Result:
fi
}
[ -z "$1" ] && echo "Usage: $0 DEV" && exit 1
ETH=$1
PGDEV=/proc/net/pktgen/kpktgend_0
pgset "rem_device_all"
pgset "add_device $ETH"
PGDEV=/proc/net/pktgen/$ETH
pgset "xmit_mode netif_receive"
pgset "pkt_size 60"
pgset "dst 198.18.0.1"
pgset "dst_mac 90:e2:ba:ff:ff:ff"
pgset "count 10000000"
pgset "burst 32"
PGDEV=/proc/net/pktgen/pgctrl
echo "Running... ctrl^C to stop"
pgset "start"
echo "Done"
cat /proc/net/pktgen/$ETH
Usage:
$ sudo ./pktgen.sh eth2
...
Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
43033682pps 20656Mb/sec (20656167360bps) errors: 10000000
Raw netif_receive_skb speed should be ~43 million packet
per second on 3.7Ghz x86 and 'perf report' should look like:
37.69% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core
25.81% kpktgend_0 [kernel.vmlinux] [k] kfree_skb
7.22% kpktgend_0 [kernel.vmlinux] [k] ip_rcv
5.68% kpktgend_0 [pktgen] [k] pktgen_thread_worker
If fib_table_lookup is seen on top, it means skb was processed
by the stack. To benchmark netif_receive_skb only make sure
that 'dst_mac' of your pktgen script is different from
receiving device mac and it will be dropped by ip_rcv
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags,
with a negation of the flag like !NO_TIMESTAMP.
Also document the option flag NO_TIMESTAMP.
Fixes: afb84b6261 ("pktgen: add flag NO_TIMESTAMP to disable timestamping")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More accurately, listen all netns that have a nsid assigned into the netns
where the netlink socket is opened.
For this purpose, a netlink socket option is added:
NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
socket will receive netlink notifications from all netns that have a nsid
assigned into the netns where the socket has been opened. The nsid is sent
to userland via an anscillary data.
With this patch, a daemon needs only one socket to listen many netns. This
is useful when the number of netns is high.
Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
the field nsid is valid or not. skb->cb is initialized to 0 on skb
allocation, thus we are sure that we will never send a nsid 0 by error to
the userland.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch, nsid were protected by the rtnl lock. The goal of this
patch is to be able to find a nsid without needing to hold the rtnl lock.
The next patch will introduce a netlink socket option to listen to all
netns that have a nsid assigned into the netns where the socket is opened.
Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
avoid a recursive lock (nsid are notified via rtnl). This was the main
reason of the previous patch.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no functional change with this patch. It will ease the refactoring
of the locking system that protects nsids and the support of the netlink
socket option NETLINK_LISTEN_ALL_NSID.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a following commit, a new function will be introduced to only lookup for
a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
existing function is renamed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The goal of this commit is to prepare the rework of the locking of nsnid
protection.
After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(),
ie no idr_* operation into this function.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
All callers of this function expect a nsid, not an error.
Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers
don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under tcp memory pressure, calling epoll_wait() in edge triggered
mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
even when there is sufficient memory available to continue making
progress. The problem is that when __sk_mem_schedule() returns 0
under memory pressure, we do not set the SOCK_NOSPACE flag in the
tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since
SOCK_NOSPACE is used to trigger wakeups when incoming acks create
sufficient new space in the write queue, all outstanding packets
are acked, but we never wake up with the the EPOLLOUT that we are
expecting from epoll_wait().
This issue is currently limited to epoll() when used in edge trigger
mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
This is sufficient for poll()/select() and epoll() in level trigger
mode. However, in edge trigger mode, epoll() is relying on the write
path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we
can only call epoll_wait() after read/write return -EAGAIN. Thus, in
the case of the socket write, we are relying on the fact that
tcp_sendmsg()/network write paths are going to issue a wakeup for
us at some point in the future when we get -EAGAIN.
Normally, epoll() edge trigger works fine when we've exceeded the
sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when
we return -EAGAIN from the write path b/c we are over the tcp memory
limits and not b/c we are over the sndbuf, we are never going to get
another wakeup.
I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule()
will return 0, or failure more readily with SO_SNDBUF:
1) create socket and set SO_SNDBUF to N
2) add socket as edge trigger
3) write to socket and block in epoll on -EAGAIN
4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem
The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
when the socket is non-blocking. Note that SOCK_NOSPACE, in addition
to waking up outstanding waiters is also used to expand the size of
the sk->sndbuf. However, we will not expand it by setting it in this
case because tcp_should_expand_sndbuf(), ensures that no expansion
occurs when we are under tcp memory pressure.
Note that we could still hang if sk->sk_wmem_queue is 0, when we get
the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
are waiting for and event that will never happen. I believe
that this case is harder to hit (and did not hit in my testing),
in that over the tcp 'soft' memory limits, we continue to guarantee a
minimum write buffer size. Perhaps, we could return -ENOSPC in this
case, or maybe we simply issue a wakeup in this case, such that we
keep retrying the write. Note that this case is not specific to
epoll() ET, but rather would affect blocking sockets as well. So I
view this patch as bringing epoll() edge-trigger into sync with the
current poll()/select()/epoll() level trigger and blocking sockets
behavior.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Seccomp has always been a special candidate when it comes to preparation
of its filters in seccomp_prepare_filter(). Due to the extra checks and
filter rewrite it partially duplicates code and has BPF internals exposed.
This patch adds a generic API inside the BPF code code that seccomp can use
and thus keep it's filter preparation code minimal and better maintainable.
The other side-effect is that now classic JITs can add seccomp support as
well by only providing a BPF_LDX | BPF_W | BPF_ABS translation.
Tested with seccomp and BPF test suites.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When seccomp BPF was added, it was discussed to add __GFP_NOWARN
flag for their configuration path as f.e. up to 32K allocations are
more prone to fail under stress. As we're going to reuse BPF API,
add __GFP_NOWARN flags where larger kmalloc() and friends allocations
could fail.
It doesn't make much sense to pass around __GFP_NOWARN everywhere as
an extra argument only for seccomp while we just as well could run
into similar issues for socket filters, where it's not desired to
have a user application throw a WARN() due to allocation failure.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the calls to bpf_check_classic(), bpf_convert_filter() and
bpf_migrate_runtime() and let bpf_prepare_filter() take care of that
instead.
seccomp_check_filter() is passed to bpf_prepare_filter() so that it
gets called from there, after bpf_check_classic().
We can now remove exposure of two internal classic BPF functions
previously used by seccomp. The export of bpf_check_classic() symbol,
previously known as sk_chk_filter(), was there since pre git times,
and no in-tree module was using it, therefore remove it.
Joint work with Daniel Borkmann.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is in preparation for use by the seccomp code, the rationale is
not to duplicate additional code within the seccomp layer, but instead,
have it abstracted and hidden within the classic BPF API.
As an interim step, this now also makes bpf_prepare_filter() visible
(not as exported symbol though), so that seccomp can reuse that code
path instead of reimplementing it.
Joint work with Daniel Borkmann.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
> net/core/skbuff.c:4108:13: sparse: incorrect type in assignment (different base types)
> net/ipv6/mcast_snoop.c:63 ipv6_mc_check_exthdrs() warn: unsigned 'offset' is never less than zero.
Introduced by 9afd85c9e4
("net: Export IGMP/MLD message validation code")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code in __netdev_upper_dev_link() has an over-stringent
loop detection logic that actually prevents valid configurations
from working correctly.
In particular, the logic returns an error if an upper device
is already in the list of all upper devices for a given dev.
This particular check seems to be a overzealous as it disallows
perfectly valid configurations. For example:
# ip l a link eth0 name eth0.10 type vlan id 10
# ip l a dev br0 typ bridge
# ip l s eth0.10 master br0
# ip l s eth0 master br0 <--- Will fail
If you switch the last two commands (add eth0 first), then both
will succeed. If after that, you remove eth0 and try to re-add
it, it will fail!
It appears to be enough to simply check adj_list to keeps things
safe.
I've tried stacking multiple devices multiple times in all different
combinations, and either rx_handler registration prevented the stacking
of the device linking cought the error.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).
Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>