Sasha's report:
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel with the KASAN patchset, I've stumbled on the following spew:
>
> [ 4448.949424] ==================================================================
> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
> [ 4448.952988] Read of size 2 by thread T19638:
> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
> [ 4448.956823] ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
> [ 4448.958233] ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
> [ 4448.959552] 0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
> [ 4448.961266] Call Trace:
> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
> [ 4448.970103] sock_sendmsg (net/socket.c:654)
> [ 4448.971584] ? might_fault (mm/memory.c:3741)
> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
> [ 4448.988929] ==================================================================
This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.
After this report there was no usual "Unable to handle kernel NULL pointer dereference"
and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.
This bug was introduced in f3d3342602
(net: rework recvmsg handler msg_name and msg_namelen logic).
Commit message states that:
"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address."
But in fact this affects sendto when address 0 is mapped and contains
socket address structure in it. In such case copy-in address will succeed,
verify_iovec() function will successfully exit with msg->msg_namelen > 0
and msg->msg_name == NULL.
This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SO_TIMESTAMPING API defines three types of timestamps: software,
hardware in raw format (hwtstamp) and hardware converted to system
format (syststamp). The last has been deprecated in favor of combining
hwtstamp with a PTP clock driver. There are no active users in the
kernel.
The option was device driver dependent. If set, but without hardware
support, the correct behavior is to return zero in the relevant field
in the SCM_TIMESTAMPING ancillary message. Without device drivers
implementing the option, this field is effectively always zero.
Remove the internal plumbing to dissuage new drivers from implementing
the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
avoid breaking existing applications that request the timestamp.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ndm_type means L3 address type, in neighbour proxy and vxlan, it's RTN_UNICAST.
NDA_DST is for netlink TLV type, hence it's not right value in this context.
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
"net" is normally for struct net*, pointer to struct net_device
should be named to either "dev" or "ndev" etc.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF is used by socket filtering, seccomp and soon by tracing and
exposed to userspace, therefore 'sock_filter_int' name is not accurate.
Rename it to 'bpf_insn'
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF is used in several kernel components. This split creates logical boundary
between generic eBPF core and the rest
kernel/bpf/core.c: eBPF interpreter
net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters
This patch only moves functions.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It hasn't been used since commit 0fd7bac(net: relax rcvbuf limits).
Signed-off-by: Sorin Dumitru <sorin@returnze.ro>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/infiniband/hw/cxgb4/device.c
The cxgb4 conflict was simply overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it's done silently (from the kernel part), and thus it might be
hard to track the renames from logs.
Add a simple netdev_info() to notify the rename, but only in case the
previous name was valid.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Vlad Yasevich <vyasevic@redhat.com>
CC: stephen hemminger <stephen@networkplumber.org>
CC: Jerry Chu <hkchu@google.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This way we'll always know in what status the device is, unless it's
running normally (i.e. NETDEV_REGISTERED).
Also, emit a warning once in case of a bad reg_state.
CC: "David S. Miller" <davem@davemloft.net>
CC: Jason Baron <jbaron@akamai.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Vlad Yasevich <vyasevic@redhat.com>
CC: stephen hemminger <stephen@networkplumber.org>
CC: Jerry Chu <hkchu@google.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Joe Perches <joe@perches.com>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change cleans up ndo_dflt_fdb_del to drop the ENOTSUPP return value since
that isn't actually returned anywhere in the code. As a result we are able to
drop a few lines by just defaulting this to -EINVAL.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed a bug that was introduced by my GRE-GRO patch
(bf5a755f5e net-gre-gro: Add GRE
support to the GRO stack) that breaks the forwarding path
because various GSO related fields were not set. The bug will
cause on the egress path either the GSO code to fail, or a
GRE-TSO capable (NETIF_F_GSO_GRE) NICs to choke. The following
fix has been tested for both cases.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This passes down NET_NAME_USER (or NET_NAME_ENUM) to alloc_netdev(),
for any device created over rtnetlink.
v9: restore reverse-christmas-tree order of local variables
Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on a patch from David Herrmann.
This is the only place devices can be renamed.
v9: restore revers-christmas-tree order of local variables
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on a patch by David Herrmann.
The name_assign_type attribute gives hints where the interface name of a
given net-device comes from. These values are currently defined:
NET_NAME_ENUM:
The ifname is provided by the kernel with an enumerated
suffix, typically based on order of discovery. Names may
be reused and unpredictable.
NET_NAME_PREDICTABLE:
The ifname has been assigned by the kernel in a predictable way
that is guaranteed to avoid reuse and always be the same for a
given device. Examples include statically created devices like
the loopback device and names deduced from hardware properties
(including being given explicitly by the firmware). Names
depending on the order of discovery, or in any other way on the
existence of other devices, must not be marked as PREDICTABLE.
NET_NAME_USER:
The ifname was provided by user-space during net-device setup.
NET_NAME_RENAMED:
The net-device has been renamed from userspace. Once this type is set,
it cannot change again.
NET_NAME_UNKNOWN:
This is an internal placeholder to indicate that we yet haven't yet
categorized the name. It will not be exposed to userspace, rather
-EINVAL is returned.
The aim of these patches is to improve user-space renaming of interfaces. As
a general rule, userspace must rename interfaces to guarantee that names stay
the same every time a given piece of hardware appears (at boot, or when
attaching it). However, there are several situations where userspace should
not perform the renaming, and that depends on both the policy of the local
admin, but crucially also on the nature of the current interface name.
If an interface was created in repsonse to a userspace request, and userspace
already provided a name, we most probably want to leave that name alone. The
main instance of this is wifi-P2P devices created over nl80211, which currently
have a long-standing bug where they are getting renamed by udev. We label such
names NET_NAME_USER.
If an interface, unbeknown to us, has already been renamed from userspace, we
most probably want to leave also that alone. This will typically happen when
third-party plugins (for instance to udev, but the interface is generic so could
be from anywhere) renames the interface without informing udev about it. A
typical situation is when you switch root from an installer or an initrd to the
real system and the new instance of udev does not know what happened before
the switch. These types of problems have caused repeated issues in the past. To
solve this, once an interface has been renamed, its name is labelled
NET_NAME_RENAMED.
In many cases, the kernel is actually able to name interfaces in such a
way that there is no need for userspace to rename them. This is the case when
the enumeration order of devices, or in fact any other (non-parent) device on
the system, can not influence the name of the interface. Examples include
statically created devices, or any naming schemes based on hardware properties
of the interface. In this case the admin may prefer to use the kernel-provided
names, and to make that possible we label such names NET_NAME_PREDICTABLE.
We want the kernel to have tho possibilty of performing predictable interface
naming itself (and exposing to userspace that it has), as the information
necessary for a proper naming scheme for a certain class of devices may not
be exposed to userspace.
The case where renaming is almost certainly desired, is when the kernel has
given the interface a name using global device enumeration based on order of
discovery (ethX, wlanY, etc). These naming schemes are labelled NET_NAME_ENUM.
Lastly, a fallback is left as NET_NAME_UNKNOWN, to indicate that a driver has
not yet been ported. This is mostly useful as a transitionary measure, allowing
us to label the various naming schemes bit by bit.
v8: minor documentation fixes
v9: move comment to the right commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Bluetooth pairing fixes from Johan Hedberg.
2) ieee80211_send_auth() doesn't allocate enough tail room for the SKB,
from Max Stepanov.
3) New iwlwifi chip IDs, from Oren Givon.
4) bnx2x driver reads wrong PCI config space MSI register, from Yijing
Wang.
5) IPV6 MLD Query validation isn't strong enough, from Hangbin Liu.
6) Fix double SKB free in openvswitch, from Andy Zhou.
7) Fix sk_dst_set() being racey with UDP sockets, leading to strange
crashes, from Eric Dumazet.
8) Interpret the NAPI budget correctly in the new systemport driver,
from Florian Fainelli.
9) VLAN code frees percpu stats in the wrong place, leading to crashes
in the get stats handler. From Eric Dumazet.
10) TCP sockets doing a repair can crash with a divide by zero, because
we invoke tcp_push() with an MSS value of zero. Just skip that part
of the sendmsg paths in repair mode. From Christoph Paasch.
11) IRQ affinity bug fixes in mlx4 driver from Amir Vadai.
12) Don't ignore path MTU icmp messages with a zero mtu, machines out
there still spit them out, and all of our per-protocol handlers for
PMTU can cope with it just fine. From Edward Allcutt.
13) Some NETDEV_CHANGE notifier invocations were not passing in the
correct kind of cookie as the argument, from Loic Prylli.
14) Fix crashes in long multicast/broadcast reassembly, from Jon Paul
Maloy.
15) ip_tunnel_lookup() doesn't interpret wildcard keys correctly, fix
from Dmitry Popov.
16) Fix skb->sk assigned without taking a reference to 'sk' in
appletalk, from Andrey Utkin.
17) Fix some info leaks in ULP event signalling to userspace in SCTP,
from Daniel Borkmann.
18) Fix deadlocks in HSO driver, from Olivier Sobrie.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits)
hso: fix deadlock when receiving bursts of data
hso: remove unused workqueue
net: ppp: don't call sk_chk_filter twice
mlx4: mark napi id for gro_skb
bonding: fix ad_select module param check
net: pppoe: use correct channel MTU when using Multilink PPP
neigh: sysctl - simplify address calculation of gc_* variables
net: sctp: fix information leaks in ulpevent layer
MAINTAINERS: update r8169 maintainer
net: bcmgenet: fix RGMII_MODE_EN bit
tipc: clear 'next'-pointer of message fragments before reassembly
r8152: fix r8152_csum_workaround function
be2net: set EQ DB clear-intr bit in be_open()
GRE: enable offloads for GRE
farsync: fix invalid memory accesses in fst_add_one() and fst_init_card()
igb: do a reset on SR-IOV re-init if device is down
igb: Workaround for i210 Errata 25: Slow System Clock
usbnet: smsc95xx: add reset_resume function with reset operation
dp83640: Always decode received status frames
r8169: disable L23
...
The code in neigh_sysctl_register() relies on a specific layout of
struct neigh_table, namely that the 'gc_*' variables are directly
following the 'parms' member in a specific order. The code, though,
expresses this in the most ugly way.
Get rid of the ugly casts and use the 'tbl' pointer to get a handle to
the table. This way we can refer to the 'gc_*' variables directly.
Similarly seen in the grsecurity patch, written by Brad Spengler.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add const attribute to filter argument to make clear it is no
longer modified.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Actually better than brctl showmacs because we can filter by bridge
port in the kernel.
The current bridge netlink interface doesnt scale when you have many
bridges each with large fdbs or even bridges with many bridge ports
And now for the science non-fiction novel you have all been
waiting for..
//lets see what bridge ports we have
root@moja-1:/configs/may30-iprt/bridge# ./bridge link show
8: eth1 state DOWN : <BROADCAST,MULTICAST> mtu 1500 master br0 state
disabled priority 32 cost 19
17: sw1-p1 state DOWN : <BROADCAST,NOARP> mtu 1500 master br0 state
disabled priority 32 cost 100
// show all..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev ifb0 self permanent
33:33:00:00:00:01 dev ifb1 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:22:01:01 dev eth0 self permanent
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
33:33:00:00:00:01 dev gretap0 self permanent
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
//filter by bridge
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
// bridge sw1 has no ports attached..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br sw1
//filter by port
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show brport eth1
02:00:00:12:01:02 vlan 0 master br0 permanent
00:17:42:8a:b4:05 vlan 0 master br0 permanent
00:17:42:8a:b4:07 self permanent
33:33:00:00:00:01 self permanent
// filter by port + bridge
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0 brport
sw1-p1
da:ac:46:27:d9:53 vlan 0 master br0 permanent
33:33:00:00:00:01 self permanent
// for shits and giggles (as they say in New Brunswick), lets
// change the mac that br0 uses
// Note: a magical fdb entry with no brport is added ...
root@moja-1:/configs/may30-iprt/bridge# ip link set dev br0 address
02:00:00:12:01:04
// lets see if we can see the unicorn ..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev ifb0 self permanent
33:33:00:00:00:01 dev ifb1 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:22:01:01 dev eth0 self permanent
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
33:33:00:00:00:01 dev gretap0 self permanent
02:00:00:12:01:04 dev br0 vlan 0 master br0 permanent <=== there it is
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
//can we see it if we filter by bridge?
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
02:00:00:12:01:04 dev br0 vlan 0 master br0 permanent <=== there it is
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dumping a bridge fdb dumps every fdb entry
held. With this change we are going to filter
on selected bridge port.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a bonding master reclaims the netpoll info struct, slaves could
still hold a pointer to the reclaimed data. This patch fixes it: as
soon as netpoll_async_cleanup is called for a slave (eg. when
un-enslaved), we make sure that this slave doesn't point to the data.
Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
load_pointer() is already a static inline function.
Let's move it into filter.h so BPF JIT implementations can reuse this
function.
Since we're exporting this function, let's also rename it to
bpf_load_pointer() for clarity.
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A bug was introduced in NETDEV_CHANGE notifier sequence causing the
arp table to be sometimes spuriously cleared (including manual arp
entries marked permanent), upon network link carrier changes.
The changed argument for the notifier was applied only to a single
caller of NETDEV_CHANGE, missing among others netdev_state_change().
So upon net_carrier events induced by the network, which are
triggering a call to netdev_state_change(), arp_netdev_event() would
decide whether to clear or not arp cache based on random/junk stack
values (a kind of read buffer overflow).
Fixes: be9efd3653 ("net: pass changed flags along with NETDEV_CHANGE event")
Fixes: 6c8b4e3ff8 ("arp: flush arp cache on IFF_NOARP change")
Signed-off-by: Loic Prylli <loicp@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add sw_hash flag to skbuff to indicate that skb->hash was computed
from flow_dissector. This flag is checked in skb_get_hash to avoid
repeatedly trying to compute the hash (ie. in the case that no L4 hash
can be computed).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the receive side to support RFC 6438 which is to
use the flow label as an ECMP hash. If an IPv6 flow label is set
in a packet we can use this as input for computing an L4-hash. There
should be no need to parse any transport headers in this case.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call standard function to get a packet hash instead of taking this from
skb->sk->sk_hash or only using skb->protocol.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the hash computation located in __skb_get_hash to be a separate
function which takes flow_keys as input. This will allow flow hash
computation in other contexts where we only have addresses and ports.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In process_backlog the input_pkt_queue is only checked once for new
packets and quota is artificially reduced to reflect precisely the
number of packets on the input_pkt_queue so that the loop exits
appropriately.
This patches changes the behavior to be more straightforward and
less convoluted. Packets are processed until either the quota
is met or there are no more packets to process.
This patch seems to provide a small, but noticeable performance
improvement. The performance improvement is a result of staying
in the process_backlog loop longer which can reduce number of IPI's.
Performance data using super_netperf TCP_RR with 200 flows:
Before fix:
88.06% CPU utilization
125/190/309 90/95/99% latencies
1.46808e+06 tps
1145382 intrs.sec.
With fix:
87.73% CPU utilization
122/183/296 90/95/99% latencies
1.4921e+06 tps
1021674.30 intrs./sec.
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This extends the ptp bpf to also match ptp over ip over vlan packets. The ptp
classes are changed to orthogonal bitfields representing version, transport
and vlan values to simplify matching.
Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace two switch statements enumerating all valid ptp classes with an if
statement matching for not PTP_CLASS_NONE.
Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The if_lock()/if_unlock() in next_to_run() adds a significant
overhead, because its called for every packet in busy loop of
pktgen_thread_worker(). (Thomas Graf originally pointed me
at this lock problem).
Removing these two "LOCK" operations should in theory save us approx
16ns (8ns x 2), as illustrated below we do save 16ns when removing
the locks and introducing RCU protection.
Performance data with CLONE_SKB==100000, TX-size=512, rx-usecs=30:
(single CPU performance, ixgbe 10Gbit/s, E5-2630)
* Prev : 5684009 pps --> 175.93ns (1/5684009*10^9)
* RCU-fix: 6272204 pps --> 159.43ns (1/6272204*10^9)
* Diff : +588195 pps --> -16.50ns
To understand this RCU patch, I describe the pktgen thread model
below.
In pktgen there is several kernel threads, but there is only one CPU
running each kernel thread. Communication with the kernel threads are
done through some thread control flags. This allow the thread to
change data structures at a know synchronization point, see main
thread func pktgen_thread_worker().
Userspace changes are communicated through proc-file writes. There
are three types of changes, general control changes "pgctrl"
(func:pgctrl_write), thread changes "kpktgend_X"
(func:pktgen_thread_write), and interface config changes "etcX@N"
(func:pktgen_if_write).
Userspace "pgctrl" and "thread" changes are synchronized via the mutex
pktgen_thread_lock, thus only a single userspace instance can run.
The mutex is taken while the packet generator is running, by pgctrl
"start". Thus e.g. "add_device" cannot be invoked when pktgen is
running/started.
All "pgctrl" and all "thread" changes, except thread "add_device",
communicate via the thread control flags. The main problem is the
exception "add_device", that modifies threads "if_list" directly.
Fortunately "add_device" cannot be invoked while pktgen is running.
But there exists a race between "rem_device_all" and "add_device"
(which normally don't occur, because "rem_device_all" waits 125ms
before returning). Background'ing "rem_device_all" and running
"add_device" immediately allow the race to occur.
The race affects the threads (list of devices) "if_list". The if_lock
is used for protecting this "if_list". Other readers are given
lock-free access to the list under RCU read sections.
Note, interface config changes (via proc) can occur while pktgen is
running, which worries me a bit. I'm assuming proc_remove() takes
appropriate locks, to assure no writers exists after proc_remove()
finish.
I've been running a script exercising the race condition (leading me
to fix the proc_remove order), without any issues. The script also
exercises concurrent proc writes, while the interface config is
getting removed.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid calling set_current_state() inside the busy-loop in
pktgen_thread_worker(). In case of pkt_dev->delay, then it is still
used/enabled in pktgen_xmit() via the spin() call.
The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit
is LOCK prefixed. I've measured the asm LOCK operation to take approx
8ns on this E5-2630 CPU. Performance increase corrolate with this
measurement.
Performance data with CLONE_SKB==100000, rx-usecs=30:
(single CPU performance, ixgbe 10Gbit/s, E5-2630)
* Prev: 5454050 pps --> 183.35ns (1/5454050*10^9)
* Now: 5684009 pps --> 175.93ns (1/5684009*10^9)
* Diff: +229959 pps --> -7.42ns
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So far, it is assumed that ops->setup is filled up. But there might be
case that ops might make sense even without ->setup. In that case,
forbid to newlink and dellink.
This allows to register simple rtnl link ops containing only ->kind.
That allows consistent way of passing device kind (either device-kind or
slave-kind) to userspace.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 371121057607e3127e19b3fa094330181b5b031e("net:
QDISC_STATE_RUNNING dont need atomic bit ops") the
__QDISC_STATE_RUNNING is renamed to __QDISC___STATE_RUNNING,
but the old names existing in comment are not replaced with
the new name completely.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SCSI target fixes from Nicholas Bellinger:
"Mostly minor fixes this time around. The highlights include:
- iscsi-target CHAP authentication fixes to enforce explicit key
values (Tejas Vaykole + rahul.rane)
- fix a long-standing OOPs in target-core when a alua configfs
attribute is accessed after port symlink has been removed.
(Sebastian Herbszt)
- fix a v3.10.y iscsi-target regression causing the login reject
status class/detail to be ignored (Christoph Vu-Brugier)
- fix a v3.10.y iscsi-target regression to avoid rejecting an
existing ITT during Data-Out when data-direction is wrong (Santosh
Kulkarni + Arshad Hussain)
- fix a iscsi-target related shutdown deadlock on UP kernels (Mikulas
Patocka)
- fix a v3.16-rc1 build issue with vhost-scsi + !CONFIG_NET (MST)"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
iscsi-target: fix iscsit_del_np deadlock on unload
iovec: move memcpy_from/toiovecend to lib/iovec.c
iscsi-target: Avoid rejecting incorrect ITT for Data-Out
tcm_loop: Fix memory leak in tcm_loop_submission_work error path
iscsi-target: Explicily clear login response PDU in exception path
target: Fix left-over se_lun->lun_sep pointer OOPs
iscsi-target; Enforce 1024 byte maximum for CHAP_C key value
iscsi-target: Convert chap_server_compute_md5 to use kstrtoul
ERROR: "memcpy_fromiovecend" [drivers/vhost/vhost_scsi.ko] undefined!
commit 9f977ef7b6
vhost-scsi: Include prot_bytes into expected data transfer length
in target-pending makes drivers/vhost/scsi.c call memcpy_fromiovecend().
This function is not available when CONFIG_NET is not enabled.
socket.h already includes uio.h, so no callers need updating.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Dave Jones reported that a crash is occurring in
csum_partial
tcp_gso_segment
inet_gso_segment
? update_dl_migration
skb_mac_gso_segment
__skb_gso_segment
dev_hard_start_xmit
sch_direct_xmit
__dev_queue_xmit
? dev_hard_start_xmit
dev_queue_xmit
ip_finish_output
? ip_output
ip_output
ip_forward_finish
ip_forward
ip_rcv_finish
ip_rcv
__netif_receive_skb_core
? __netif_receive_skb_core
? trace_hardirqs_on
__netif_receive_skb
netif_receive_skb_internal
napi_gro_complete
? napi_gro_complete
dev_gro_receive
? dev_gro_receive
napi_gro_receive
It looks like a likely culprit is that SKB_GSO_CB()->csum_start is
not set correctly when doing non-scatter gather. We are using
offset as opposed to doffset.
Reported-by: Dave Jones <davej@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 7e2b10c1e5 ("net: Support for multiple checksums with gso")
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IP route cache had been removed in linux-3.6, we broke assumption
that dst entries were all freed after rcu grace period. DST_NOCACHE
dst were supposed to be freed from dst_release(). But it appears
we want to keep such dst around, either in UDP sockets or tunnels.
In sk_dst_get() we need to make sure dst refcount is not 0
before incrementing it, or else we might end up freeing a dst
twice.
DST_NOCACHE set on a dst does not mean this dst can not be attached
to a socket or a tunnel.
Then, before actual freeing, we need to observe a rcu grace period
to make sure all other cpus can catch the fact the dst is no longer
usable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dormando <dormando@rydia.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use kcalloc/kmalloc_array to make it clear we're allocating arrays. No
integer overflow can actually happen here, since len/flen is guaranteed
to be less than BPF_MAXINSNS (4096). However, this changed makes sure
we're not going to get one if BPF_MAXINSNS were ever increased.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the order of the parameters to sk_unattached_filter_create() in
the kerneldoc to reflect the order they appear in the actual function.
This fix is only cosmetic, in the generated doc they still appear in the
correct order without the fix.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems overkill to use vmalloc() for typical listeners with less than
2048 hash buckets. Try kmalloc() and fallback to vmalloc() to reduce TLB
pressure.
Use kvfree() helper as it is now available.
Use ilog2() instead of a loop.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_flow_dissect() dissects only transport header type in ip_proto. It dose not
give any information about IPv4 or IPv6.
This patch adds new member, n_proto, to struct flow_keys. Which records the
IP layer type. i.e IPv4 or IPv6.
This can be used in netdev->ndo_rx_flow_steer driver function to dissect flow.
Adding new member to flow_keys increases the struct size by around 4 bytes.
This causes BUILD_BUG_ON(sizeof(qcb->data) < sz); to fail in
qdisc_cb_private_validate()
So increase data size by 4
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>