Commit Graph

722509 Commits

Author SHA1 Message Date
Girish Moodalbail
5e54b3c120 macvlan: fix memory hole in macvlan_dev
Move 'macaddr_count' from after 'netpoll' to after 'nest_level' to pack
and reduce a memory hole.

Fixes: 88ca59d1aa (macvlan: remove unused fields in struct macvlan_dev)
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:51:46 -05:00
Marc Kleine-Budde
936e5d8bdf slip: sl_alloc(): remove unused parameter "dev_t line"
The first and only parameter of sl_alloc() is unused, so remove it.

Fixes: 5342b77c41 slip: ("Clean up create and destroy")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:41:02 -05:00
David S. Miller
2deeb49550 Merge branch 'cxgb4-collect-hardware-logs-via-ethtool'
Rahul Lakkireddy says:

====================
cxgb4: collect hardware logs via ethtool

Collect more hardware logs via ethtool --get-dump facility.

Patch 1 collects on-chip memory layout information.

Patch 2 collects on-chip MC memory dumps.

Patch 3 collects HMA memory dump.

Patch 4 evaluates and skips TX and RX payload regions in memory dumps.

Patch 5 collects egress and ingress SGE queue contexts.

Patch 6 collects PCIe configuration logs
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:31:51 -05:00
Rahul Lakkireddy
6078ab196b cxgb4: collect PCIe configuration logs
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:31:50 -05:00
Rahul Lakkireddy
736c3b9447 cxgb4: collect egress and ingress SGE queue contexts
Use meminfo to identify the egress and ingress context regions and
fetch all valid contexts from those regions. Also flush all contexts
before attempting collection to prevent stale information.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:31:50 -05:00
Rahul Lakkireddy
c1219653f3 cxgb4: skip TX and RX payload regions in memory dumps
Use meminfo to identify TX and RX payload regions and skip them in
collection of EDC, MC, and HMA.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:31:50 -05:00
Rahul Lakkireddy
4db0401f8a cxgb4: collect HMA memory dump
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:31:50 -05:00
Rahul Lakkireddy
a1c69520f7 cxgb4: collect MC memory dump
Use meminfo to get base address and size of MC memory.  Also use same
meminfo for EDC memory dumps.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:31:50 -05:00
Rahul Lakkireddy
123e25c4a5 cxgb4: collect on-chip memory information
Collect memory layout of various on-chip memory regions.  Move code
for collecting on-chip memory information to cudbg_lib.c and update
cxgb4_debugfs.c to use the common function.  Also include
cudbg_entity.h before cudbg_lib.h to avoid adding cudbg entity
structure forward declarations in cudbg_lib.h.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:31:49 -05:00
David S. Miller
62fd8b189e Merge branch 'veth-and-GSO-maximums'
Stephen Hemminger says:

====================
veth and GSO maximums

This is the more general way to solving the issue of GSO limits
not being set correctly for containers on Azure. If a GSO packet
is sent to host that exceeds the limit (reported by NDIS), then
the host is forced to do segmentation in software which has noticeable
performance impact.

The core rtnetlink infrastructure already has the messages and
infrastructure to allow changing gso limits. With an updated iproute2
the following already works:
  # ip li set dev dummy0 gso_max_size 30000

These patches are about making it easier with veth.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:23:00 -05:00
Stephen Hemminger
72d24955b4 veth: set peer GSO values
When new veth is created, and GSO values have been configured
on one device, clone those values to the peer.

For example:
   # ip link add dev vm1 gso_max_size 65530 type veth peer name vm2

This should create vm1 <--> vm2 with both having GSO maximum
size set to 65530.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:22:59 -05:00
Stephen Hemminger
46e6b992c2 rtnetlink: allow GSO maximums to be set on device creation
Netlink device already allows changing GSO sizes with
ip set command. The part that is missing is allowing overriding
GSO settings on device creation.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:22:59 -05:00
Niklas Cassel
5a6a0445d1 net: stmmac: fix broken dma_interrupt handling for multi-queues
There is nothing that says that number of TX queues == number of RX
queues. E.g. the ARTPEC-6 SoC has 2 TX queues and 1 RX queue.

This code is obviously wrong:
for (chan = 0; chan < tx_channel_count; chan++) {
    struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan];

priv->rx_queue has size MTL_MAX_RX_QUEUES, so this will send an
uninitialized napi_struct to __napi_schedule(), causing us to
crash in net_rx_action(), because napi_struct->poll is zero.

[12846.759880] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[12846.768014] pgd = (ptrval)
[12846.770742] [00000000] *pgd=39ec7831, *pte=00000000, *ppte=00000000
[12846.777023] Internal error: Oops: 80000007 [#1] PREEMPT SMP ARM
[12846.782942] Modules linked in:
[12846.785998] CPU: 0 PID: 161 Comm: dropbear Not tainted 4.15.0-rc2-00285-gf5fb5f2f39a7 #36
[12846.794177] Hardware name: Axis ARTPEC-6 Platform
[12846.798879] task: (ptrval) task.stack: (ptrval)
[12846.803407] PC is at 0x0
[12846.805942] LR is at net_rx_action+0x274/0x43c
[12846.810383] pc : [<00000000>]    lr : [<80bff064>]    psr: 200e0113
[12846.816648] sp : b90d9ae8  ip : b90d9ae8  fp : b90d9b44
[12846.821871] r10: 00000008  r9 : 0013250e  r8 : 00000100
[12846.827094] r7 : 0000012c  r6 : 00000000  r5 : 00000001  r4 : bac84900
[12846.833619] r3 : 00000000  r2 : b90d9b08  r1 : 00000000  r0 : bac84900

Since each DMA channel can be used for rx and tx simultaneously,
the current code should probably be rewritten so that napi_struct is
embedded in a new struct stmmac_channel.
That way, stmmac_poll() can call stmmac_tx_clean() on just the tx queue
where we got the IRQ, instead of looping through all tx queues.
This is also how the xgbe driver does it (another driver for this IP).

Fixes: c22a3f48ef ("net: stmmac: adding multiple napi mechanism")
Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:18:40 -05:00
Egil Hjelmeland
2e8d243e88 net: dsa: lan9303: Protect ALR operations with mutex
ALR table operations are a sequence of related register operations which
should be protected from concurrent access. The alr_cache should also be
protected. Add alr_mutex doing that.

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:12:33 -05:00
Jiri Pirko
df45bf84e4 net: sched: fix use-after-free in tcf_block_put_ext
Since the block is freed with last chain being put, once we reach the
end of iteration of list_for_each_entry_safe, the block may be
already freed. I'm hitting this only by creating and deleting clsact:

[  202.171952] ==================================================================
[  202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390
[  202.187590] Read of size 8 at addr ffff880225539a80 by task tc/796
[  202.194508]
[  202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5
[  202.203200] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
[  202.213613] Call Trace:
[  202.216369]  dump_stack+0xda/0x169
[  202.220192]  ? dma_virt_map_sg+0x147/0x147
[  202.224790]  ? show_regs_print_info+0x54/0x54
[  202.229691]  ? tcf_chain_destroy+0x1dc/0x250
[  202.234494]  print_address_description+0x83/0x3d0
[  202.239781]  ? tcf_block_put_ext+0x240/0x390
[  202.244575]  kasan_report+0x1ba/0x460
[  202.248707]  ? tcf_block_put_ext+0x240/0x390
[  202.253518]  tcf_block_put_ext+0x240/0x390
[  202.258117]  ? tcf_chain_flush+0x290/0x290
[  202.262708]  ? qdisc_hash_del+0x82/0x1a0
[  202.267111]  ? qdisc_hash_add+0x50/0x50
[  202.271411]  ? __lock_is_held+0x5f/0x1a0
[  202.275843]  clsact_destroy+0x3d/0x80 [sch_ingress]
[  202.281323]  qdisc_destroy+0xcb/0x240
[  202.285445]  qdisc_graft+0x216/0x7b0
[  202.289497]  tc_get_qdisc+0x260/0x560

Fix this by holding the block also by chain 0 and put chain 0
explicitly, out of the list_for_each_entry_safe loop at the very
end of tcf_block_put_ext.

Fixes: efbf789739 ("net_sched: get rid of rcu_barrier() in tcf_block_put_ext()")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:09:08 -05:00
David S. Miller
fc8b81a598 Merge branch 'lockless-qdisc-series'
John Fastabend says:

====================
lockless qdisc series

This series adds support for building lockless qdiscs. This is
the result of noticing the qdisc lock is a common hot-spot in
perf analysis of the Linux network stack, especially when testing
with high packet per second rates. However, nothing is free and
most qdiscs rely on the qdisc lock for their data structures so
each qdisc must be converted on a case by case basis. In this
series, to kick things off, we make pfifo_fast, mq, and mqprio
lockless. Follow up series can address additional qdiscs as needed.
For example sch_tbf might be useful. To allow this the lockless
design is an opt-in flag. In some future utopia we convert all
qdiscs and we get to drop this case analysis, but in order to
make progress we live in the real-world.

There are also a handful of optimizations I have behind this
series and a few code cleanups that I couldn't figure out how
to fit neatly into this series with out increasing the patch
count. Once this is in additional patches can address this. The
most notable is in skb_dequeue we can push the consumer lock
out a bit and consume multiple skbs off the skb_array in pfifo
fast per iteration. Ideally we could push arrays of packets at
drivers as well but we would need the infrastructure for this.
The other notable improvement is to do less locking in the
overrun cases where bad tx queue list and gso_skb are being
hit. Although, nice in theory in practice this is the error
case and I haven't found a benchmark where this matters yet.

For testing...

My first test case uses multiple containers (via cilium) where
multiple client containers use 'wrk' to benchmark connections with
a server container running lighttpd. Where lighttpd is configured
to use multiple threads, one per core. Additionally this test has
a proxy agent running so all traffic takes an extra hop through a
proxy container. In these cases each TCP packet traverses the egress
qdisc layer at least four times and the ingress qdisc layer an
additional four times. This makes for a good stress test IMO, perf
details below.

The other micro-benchmark I run is injecting packets directly into
qdisc layer using pktgen. This uses the benchmark script,

 ./pktgen_bench_xmit_mode_queue_xmit.sh

Benchmarks taken in two cases, "base" running latest net-next no
changes to qdisc layer and "qdisc" tests run with qdisc lockless
updates. Numbers reported in req/sec. All virtual 'veth' devices
run with pfifo_fast in the qdisc test case.

`wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"`

conns    16      32     64   1024
-----------------------------------------------
base:   18831  20201  21393  29151
qdisc:  19309  21063  23899  29265

notice in all cases we see performance improvement when running
with qdisc case.

Microbenchmarks using pktgen are as follows,

`pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 20000000

base(mq):          2.1Mpps
base(pfifo_fast):  2.1Mpps
qdisc(mq):         2.6Mpps
qdisc(pfifo_fast): 2.6Mpps

notice numbers are the same for mq and pfifo_fast because only
testing a single thread here. In both tests we see a nice bump
in performance gain. The key with 'mq' is it is already per
txq ring so contention is minimal in the above cases. Qdiscs
such as tbf or htb which have more contention will likely show
larger gains when/if lockless versions are implemented.

Thanks to everyone who helped with this work especially Daniel
Borkmann, Eric Dumazet and Willem de Bruijn for discussing the
design and reviewing versions of the code.

Changes from the RFC: dropped a couple patches off the end,
fixed a bug with skb_queue_walk_safe not unlinking skb in all
cases, fixed a lockdep splat with pfifo_fast_destroy not calling
*_bh lock variant, addressed _most_ of Willem's comments, there
was a bug in the bulk locking (final patch) of the RFC series.

@Willem, I left out lockdep annotation for a follow on series
to add lockdep more completely, rather than just in code I
touched.

Comments and feedback welcome.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:27 -05:00
John Fastabend
c5ad119fb6 net: sched: pfifo_fast use skb_array
This converts the pfifo_fast qdisc to use the skb_array data structure
and set the lockless qdisc bit. pfifo_fast is the first qdisc to support
the lockless bit that can be a child of a qdisc requiring locking. So
we add logic to clear the lock bit on initialization in these cases when
the qdisc graft operation occurs.

This also removes the logic used to pick the next band to dequeue from
and instead just checks a per priority array for packets from top priority
to lowest. This might need to be a bit more clever but seems to work
for now.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend
4a86a4cf68 net: skb_array: expose peek API
This adds a peek routine to skb_array.h for use with qdisc.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend
ce679e8df7 net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend
b01ac095c7 net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
The sch_mq qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Also exports __gnet_stats_copy_queue() to use as a helper function.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend
7e66016f2c net: sched: helpers to sum qlen and qlen for per cpu logic
Add qdisc qlen helper routines for lockless qdiscs to use.

The qdisc qlen is no longer used in the hotpath but it is reported
via stats query on the qdisc so it still needs to be tracked. This
adds the per cpu operations needed along with a helper to return
the summation of per cpu stats.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend
fd8e8d1a77 net: sched: check for frozen queue before skb_bad_txq check
I can not think of any reason to pull the bad txq skb off the qdisc if
the txq we plan to send this on is still frozen. So check for frozen
queue first and abort before dequeuing either skb_bad_txq skb or
normal qdisc dequeue() skb.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend
70e57d5e3f net: sched: use skb list for skb_bad_tx
Similar to how gso is handled use skb list for skb_bad_tx this is
required with lockless qdiscs because we may have multiple cores
attempting to push skbs into skb_bad_tx concurrently

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:26 -05:00
John Fastabend
7bbde83b18 net: sched: drop qdisc_reset from dev_graft_qdisc
In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy'
operation is called on the old qdisc. The destroy operation will wait
a rcu grace period and call qdisc_rcu_free(). At which point
gso_cpu_skb is free'd along with all stats so no need to zero stats
and gso_cpu_skb from the graft operation itself.

Further after dropping the qdisc locks we can not continue to call
qdisc_reset before waiting an rcu grace period so that the qdisc is
detached from all cpus. By removing the qdisc_reset() here we get
the correct property of waiting an rcu grace period and letting the
qdisc_destroy operation clean up the qdisc correctly.

Note, a refcnt greater than 1 would cause the destroy operation to
be aborted however if this ever happened the reference to the qdisc
would be lost and we would have a memory leak.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend
a53851e2c3 net: sched: explicit locking in gso_cpu fallback
This work is preparing the qdisc layer to support egress lockless
qdiscs. If we are running the egress qdisc lockless in the case we
overrun the netdev, for whatever reason, the netdev returns a busy
error code and the skb is parked on the gso_skb pointer. With many
cores all hitting this case at once its possible to have multiple
sk_buffs here so we turn gso_skb into a queue.

This should be the edge case and if we see this frequently then
the netdev/qdisc layer needs to back off.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend
d59f5ffa59 net: sched: a dflt qdisc may be used with per cpu stats
Enable dflt qdisc support for per cpu stats before this patch a dflt
qdisc was required to use the global statistics qstats and bstats.

This adds a static flags field to qdisc_ops that is propagated
into qdisc->flags in qdisc allocate call. This allows the allocation
block to completely allocate the qdisc object so we don't have
dangling allocations after qdisc init.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend
40bd036219 net: sched: provide per cpu qstat helpers
The per cpu qstats support was added with per cpu bstat support which
is currently used by the ingress qdisc. This patch adds a set of
helpers needed to make other qdiscs that use qstats per cpu as well.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend
29b86cdac0 net: sched: remove remaining uses for qdisc_qlen in xmit path
sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
of the routine only check if it is zero or not. Simplify the logic so
that we don't need to return an actual queue length value.

This introduces a case now where sch_direct_xmit would have returned
a qlen of zero but now it returns true. However in this case all
call sites of sch_direct_xmit will implement a dequeue() and get
a null skb and abort. This trades tracking qlen in the hotpath for
an extra dequeue operation. Overall this seems to be good for
performance.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend
6b3ba9146f net: sched: allow qdiscs to handle locking
This patch adds a flag for queueing disciplines to indicate the stack
does not need to use the qdisc lock to protect operations. This can
be used to build lockless scheduling algorithms and improving
performance.

The flag is checked in the tx path and the qdisc lock is only taken
if it is not set. For now use a conditional if statement. Later we
could be more aggressive if it proves worthwhile and use a static key
or wrap this in a likely().

Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
for this is synchronizing a qlen counter across threads proves to
cost more than doing the enqueue/dequeue operations when tested with
pktgen.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend
6c148184b5 net: sched: cleanup qdisc_run and __qdisc_run semantics
Currently __qdisc_run calls qdisc_run_end() but does not call
qdisc_run_begin(). This makes it hard to track pairs of
qdisc_run_{begin,end} across function calls.

To simplify reading these code paths this patch moves begin/end calls
into qdisc_run().

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
Toshiaki Makita
fdaa767aef virtio_net: Disable interrupts if napi_complete_done rescheduled napi
Since commit 39e6c8208d ("net: solve a NAPI race") napi has been able
to be rescheduled within napi_complete_done() even in non-busypoll case,
but virtnet_poll() always enabled interrupts before complete, and when
napi was rescheduled within napi_complete_done() it did not disable
interrupts.
This caused more interrupts when event idx is disabled.

According to commit cbdadbbf0c ("virtio_net: fix race in RX VQ
processing") we cannot place virtqueue_enable_cb_prepare() after
NAPI_STATE_SCHED is cleared, so disable interrupts again if
napi_complete_done() returned false.

Tested with vhost-user of OVS 2.7 on host, which does not have the event
idx feature.

* Before patch:

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1472   60.00     32763206      0    6430.32
212992           60.00     23384299           4589.56

Interrupts on guest: 9872369
Packets/interrupt:   2.37

* After patch

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1472   60.00     32794646      0    6436.49
212992           60.00     32793501           6436.27

Interrupts on guest: 4941299
Packets/interrupt:   6.64

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:18:40 -05:00
David S. Miller
62cd277039 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2017-12-07

The following pull-request contains BPF updates for your net-next tree.

The main changes are:

1) Detailed documentation of BPF development process from Daniel.

2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops
   from Lawrence.

3) Minor follow up for bpf_skb_set_tunnel_key() from William.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 10:48:25 -05:00
Jason Wang
124da8f611 tuntap: fix possible deadlock when fail to register netdev
Private destructor could be called when register_netdev() fail with
rtnl lock held. This will lead deadlock in tun_free_netdev() who tries
to hold rtnl_lock. Fixing this by switching to use spinlock to
synchronize.

Fixes: 96f8406162 ("tun: add eBPF based queue selection method")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 10:44:27 -05:00
David S. Miller
66c5c5b566 Merge branch 'smc-fixes-next'
Ursula Braun says:

====================
smc: fixes 2017-12-07

here are some smc-patches. The initial 4 patches are cleanups.
Patch 5 gets rid of ib_post_sends in tasklet context to avoid peer drops due
to out-of-order receivals.
Patch 6 makes sure, the Linux SMC code understands variable sized CLC proposal
messages built according to RFC7609.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 15:03:13 -05:00
Ursula Braun
e7b7a64a84 smc: support variable CLC proposal messages
According to RFC7609 [1] the CLC proposal message contains an area of
unknown length for future growth. Additionally it may contain up to
8 IPv6 prefixes. The current version of the SMC-code does not
understand CLC proposal messages using these variable length fields and,
thus, is incompatible with SMC implementations in other operating
systems.

This patch makes sure, SMC understands incoming CLC proposals
* with arbitrary length values for future growth
* with up to 8 IPv6 prefixes

[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 15:03:12 -05:00
Ursula Braun
6b5771aa3c smc: no consumer update in tasklet context
The SMC protocol requires to send a separate consumer cursor update,
if it cannot be piggybacked to updates of the producer cursor.
When receiving a blocked signal from the sender, this update is sent
already in tasklet context. In addition consumer cursor updates are
sent after data receival.
Sending of cursor updates is controlled by sequence numbers.
Assuming receiving stray messages the receiver drops updates with older
sequence numbers than an already received cursor update with a higher
sequence number.
Sending consumer cursor updates in tasklet context may result in
wrong order sends and its corresponding drops at the receiver. Since
it is sufficient to send consumer cursor updates once the data is
received, this patch gets rid of the consumer cursor update in tasklet
context to guarantee in-sequence arrival of cursor updates.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 15:03:12 -05:00
Ursula Braun
71c125c3f2 smc: cleanup close checking during data receival
When waiting for data to be received it must be checked if the
peer signals shutdown. The SMC code uses two different checks
for this purpose, even though just one check is sufficient.
This patch removes the superfluous test for SOCK_DONE.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 15:03:12 -05:00
Ursula Braun
4bd3e7fbfa smc: no update for unused sk_write_pending
The smc code never checks the sk_write_pending sock field.
Thus there is no need to update it.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 15:03:12 -05:00
Ursula Braun
0c9f1515aa smc: improve smc_clc_send_decline() error handling
Let smc_clc_send_decline() return with an error, if the amount
sent is smaller than the length of an smc decline message.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 15:03:12 -05:00
Ursula Braun
a8ae890b9c smc: make smc_close_active_abort() static
smc_close_active_abort() is used in smc_close.c only.
Make it static.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 15:03:12 -05:00
Florian Fainelli
2a93c1a365 net: dsa: Allow compiling out legacy support
Introduce a configuration option: CONFIG_NET_DSA_LEGACY allowing to compile out
support for the old platform device and Device Tree binding registration.
Support for these configurations is scheduled to be removed in 4.17.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 14:14:54 -05:00
Michael Chan
a8168b6cee bnxt_en: Don't print "Link speed -1 no longer supported" messages.
On some dual port NICs, the 2 ports have to be configured with compatible
link speeds.  Under some conditions, a port's configured speed may no
longer be supported.  The firmware will send a message to the driver
when this happens.

Improve this logic that prints out the warning by only printing it if
we can determine the link speed that is no longer supported.  If the
speed is unknown or it is in autoneg mode, skip the warning message.

Reported-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07 13:35:56 -05:00
Alexei Starovoitov
47adc5b07c Merge branch 'bpf-devel-doc'
Daniel Borkmann says:

====================
Few BPF doc updates

Two changes, i) add BPF trees into maintainers file, and ii) add
a BPF doc around the development process, similarly as we have
with netdev FAQ, but just describing BPF specifics. Thanks!
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-06 14:35:04 -08:00
Daniel Borkmann
34f15bf31e bpf, doc: add faq about bpf development process
In the same spirit of netdev FAQ, start a BPF FAQ as a collection
of expectations and/or workflow details in the context of BPF patch
processing.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-06 14:32:42 -08:00
Daniel Borkmann
5e01929ff0 bpf, doc: add bpf trees and tps to maintainers entry
i) Add the bpf and bpf-next trees to the maintainers entry
   so they can be found easily and picked up by test bots
   etc that would integrate all trees from maintainers file.
   Suggested by Stephen while integrating the trees into
   linux-next.

ii) Add the two headers defining BPF/XDP tracepoints to the
    list of files as well.

Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-06 14:32:42 -08:00
Gao Feng
24e5992a6b ipvlan: Eliminate duplicated codes with existing function
The recv flow of ipvlan l2 mode performs as same as l3 mode for
non-multicast packet, so use the existing func ipvlan_handle_mode_l3
instead of these duplicated statements in non-multicast case.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 15:15:29 -05:00
Jiri Pirko
9454d9307e mlxsw: spectrum: handle NETIF_F_HW_TC changes correctly
Currently, whenever the NETIF_F_HW_TC feature changes, we silently
always allow it, but we actually do not disable the flows in HW
on disable. That breaks user's expectations. So just forbid
the feature disable in case there are any filters offloaded.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 15:11:17 -05:00
Willem de Bruijn
cc166427dc tun: avoid unnecessary READ_ONCE in tun_net_xmit
The statement no longer serves a purpose.

Commit fa35864e0b ("tuntap: Fix for a race in accessing numqueues")
added the ACCESS_ONCE to avoid a race condition with skb_queue_len.

Commit 436accebb5 ("tuntap: remove unnecessary sk_receive_queue
length check during xmit") removed the affected skb_queue_len check.

Commit 96f8406162 ("tun: add eBPF based queue selection method")
split the function, reading the field a second time in the callee.
The temp variable is now only read once, so just remove it.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 15:09:13 -05:00
Prashant Bhole
d9b8693783 rds: debug: fix null check on static array
t_name cannot be NULL since it is an array field of a struct.
Replacing null check on static array with string length check using
strnlen()

Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 15:02:48 -05:00
Cong Wang
4bee00eb25 act_mirred: get rid of mirred_list_lock spinlock
TC actions are no longer freed in RCU callbacks and we should
always have RTNL lock, so this spinlock is no longer needed.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 14:50:13 -05:00