Commit Graph

480731 Commits

Author SHA1 Message Date
Anton Blanchard
63af52629a powerpc: Simplify do_sigbus
Exit out early for a kernel fault, avoiding indenting of
most of the function.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-02 17:15:09 +10:00
David S. Miller
be34c4ef69 crypto: sha - Handle unaligned input data in generic sha256 and sha512.
Like SHA1, use get_unaligned_be*() on the raw input data.

Reported-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-10-02 14:52:37 +08:00
Alexander Duyck
c9d4994084 fm10k: Correctly set the number of Tx queues
The number of Tx queues was not being updated due to some issues when
generating the patches.  This change makes sure to add the lines necessary
to update the number of Tx queues correctly.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2014-10-01 23:42:55 -07:00
Alexander Duyck
fd33396206 fm10k: Reduce buffer size when pages are larger than 4K
This change reduces the buffer size to 2K for all page sizes.  The basic
idea is that since most frames only have a 1500 MTU supporting a buffer
size larger than this is somewhat wasteful.  As such I have reduced the
size to 2K for all page sizes which will allow for more uses per page.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2014-10-01 23:42:01 -07:00
Mathias Krause
5cfed7b335 Revert "crypto: aesni - disable "by8" AVX CTR optimization"
This reverts commit 7da4b29d49.

Now, that the issue is fixed, we can re-enable the code.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-10-02 14:40:28 +08:00
Herbert Xu
9561dccb45 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Merging the crypto tree for 3.17 to pull in the "by8" AVX CTR revert.
2014-10-02 14:37:20 +08:00
Mathias Krause
e3b3bb5ac1 crypto: aesni - remove unused defines in "by8" variant
The defines for xkey3, xkey6 and xkey9 are not used in the code. They're
probably left overs from merging the three source files for 128, 192 and
256 bit AES. They can safely be removed.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-10-02 14:35:03 +08:00
Mathias Krause
80dca4734b crypto: aesni - fix counter overflow handling in "by8" variant
The "by8" CTR AVX implementation fails to propperly handle counter
overflows. That was the reason it got disabled in commit 7da4b29d49
("crypto: aesni - disable "by8" AVX CTR optimization").

Fix the overflow handling by incrementing the counter block as a double
quad word, i.e. a 128 bit, and testing for overflows afterwards. We need
to use VPTEST to do so as VPADD* does not set the flags itself and
silently drops the carry bit.

As this change adds branches to the hot path, minor performance
regressions  might be a side effect. But, OTOH, we now have a conforming
implementation -- the preferable goal.

A tcrypt test on a SandyBridge system (i7-2620M) showed almost identical
numbers for the old and this version with differences within the noise
range. A dm-crypt test with the fixed version gave even slightly better
results for this version. So the performance impact might not be as big
as expected.

Tested-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-10-02 14:35:03 +08:00
Sudip Mukherjee
7a1ae9c0ce hwrng: printk replacement
as pr_* macros are more preffered over printk, so printk replaced with corresponding pr_* macros

Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-10-02 14:35:00 +08:00
Anton Blanchard
e35735b9a5 powerpc: Speed up clear_page by unrolling it
Unroll clear_page 8 times. A simple microbenchmark which
allocates and frees a zeroed page:

for (i = 0; i < iterations; i++) {
	unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
	free_page(p);
}

improves 20% on POWER8.

This assumes cacheline sizes won't grow beyond 512 bytes or
page sizes wont drop below 1kB, which is unlikely, but we could
add a runtime check during early init if it makes people nervous.

Michael found that some versions of gcc produce quite bad code
(all multiplies), so we give gcc a hand by using shifts and adds.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-02 16:04:21 +10:00
Dave Airlie
19318c063b Merge branch 'linux-3.17' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-fixes
A few regression fixes, the runpm ones dating back to 3.15.  Also a fairly severe TMDS regression that effected a lot of GF8/9/GT2xx users.

* 'linux-3.17' of git://anongit.freedesktop.org/git/nouveau/linux-2.6:
  drm/nouveau: make sure display hardware is reinitialised on runtime resume
  drm/nouveau: punt fbcon resume out to a workqueue
  drm/nouveau: fix regression on original nv50 board
  drm/nv50/disp: fix dpms regression on certain boards
2014-10-02 14:48:20 +10:00
Linus Torvalds
50dddff3cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Don't halt the firmware in r8152 driver, from Hayes Wang.

 2) Handle full sized 802.1ad frames in bnx2 and tg3 drivers properly,
    from Vlad Yasevich.

 3) Don't sleep while holding tx_clean_lock in netxen driver, fix from
    Manish Chopra.

 4) Certain kinds of ipv6 routes can end up endlessly failing the route
    validation test, causing it to be re-looked up over and over again.
    This particularly kills input route caching in TCP sockets.  Fix
    from Hannes Frederic Sowa.

 5) netvsc_start_xmit() has a use-after-free access to skb->len, fix
    from K Y Srinivasan.

 6) Fix matching of inverted containers in ematch module, from Ignacy
    Gawędzki.

 7) Aggregation of GRO frames via SKB ->frag_list for linear skbs isn't
    handled properly, regression fix from Eric Dumazet.

 8) Don't test return value of ipv4_neigh_lookup(), which returns an
    error pointer, against NULL.  From WANG Cong.

 9) Fix an old regression where we mistakenly allow a double add of the
    same tunnel.  Fixes from Steffen Klassert.

10) macvtap device delete and open can run in parallel and corrupt lists
    etc., fix from Vlad Yasevich.

11) Fix build error with IPV6=m NETFILTER_XT_TARGET_TPROXY=y, from Pablo
    Neira Ayuso.

12) rhashtable_destroy() triggers lockdep splats, fix also from Pablo.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
  bna: Update Maintainer Email
  r8152: disable power cut for RTL8153
  r8152: remove clearing bp
  bnx2: Correctly receive full sized 802.1ad fragmes
  tg3: Allow for recieve of full-size 8021AD frames
  r8152: fix setting RTL8152_UNPLUG
  netxen: Fix bug in Tx completion path.
  netxen: Fix BUG "sleeping function called from invalid context"
  ipv6: remove rt6i_genid
  hyperv: Fix a bug in netvsc_start_xmit()
  net: stmmac: fix stmmac_pci_probe failed when CONFIG_HAVE_CLK is selected
  ematch: Fix matching of inverted containers.
  gro: fix aggregation for skb using frag_list
  neigh: check error pointer instead of NULL for ipv4_neigh_lookup()
  ip6_gre: Return an error when adding an existing tunnel.
  ip6_vti: Return an error when adding an existing tunnel.
  ip6_tunnel: Return an error when adding an existing tunnel.
  ip6gre: add a rtnl link alias for ip6gretap
  net/mlx4_core: Allow not to specify probe_vf in SRIOV IB mode
  r8152: fix the carrier off when autoresuming
  ...
2014-10-01 21:29:06 -07:00
NeilBrown
8e0e99ba64 md/raid5: disable 'DISCARD' by default due to safety concerns.
It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint.  Some devices in some cases don't do what it
says on the label.

The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero.  If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks.  If all were zero, this would
be the case.  If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.

As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.

As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.

If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
    raid456.devices_handle_discard_safely=Y

As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7

Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org (3.7+)
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-02 13:45:00 +10:00
Ben Skeggs
6fbb702e27 drm/nouveau: make sure display hardware is reinitialised on runtime resume
Linus commit 05c63c2ff2 modified the
runtime suspend/resume paths to skip over display-related tasks to
avoid locking issues on resume.

Unfortunately, this resulted in the display hardware being left in
a partially initialised state, preventing subsequent modesets from
completing.

This commit unifies the (many) suspend/resume paths, bringing back
display (and fbcon) handling in the runtime paths.

Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
2014-10-02 13:32:24 +10:00
Ben Skeggs
634ffcccfb drm/nouveau: punt fbcon resume out to a workqueue
Preparation for some runtime pm fixes.  Currently we skip over fbcon
suspend/resume in the runtime path, which causes issues on resume if
fbcon tries to write to the framebuffer before the BAR subdev has
been resumed to restore the BAR1 VM setup.

As we might be woken up via a sysfs connector, we are unable to call
fb_set_suspend() in the resume path as it could make its way down to
a modeset and cause all sorts of locking hilarity.

To solve this, we'll just delay the fbcon resume to a workqueue.

Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
2014-10-02 13:32:24 +10:00
Ben Skeggs
f2f9a2cbaf drm/nouveau: fix regression on original nv50 board
Xorg (and any non-DRM client really) doesn't have permission to directly
touch VRAM on nv50 and up, which the fence code prior to g84 depends on.

It's less invasive to temporarily grant it premission to do so, as it
previously did, than it is to rework fencenv50 to use the VM.  That
will come later on.

Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
2014-10-02 13:32:24 +10:00
Ben Skeggs
5838ae610f drm/nv50/disp: fix dpms regression on certain boards
Reported in fdo#82527 comment #2.

Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
2014-10-02 13:32:23 +10:00
Rasesh Mody
439e9575e7 bna: Update Maintainer Email
Update the maintainer email for BNA driver.

Signed-off-by: Rasesh Mody <rasesh.mody@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:13:41 -04:00
Petri Gynther
d068b02cfd net: phy: add BCM7425 and BCM7429 PHYs
Signed-off-by: Petri Gynther <pgynther@google.com>
Acked-by: Florian Fainelli <f.fainelli@gmai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:12:48 -04:00
Petri Gynther
bc23333ba1 net: bcmgenet: fix bcmgenet_put_tx_csum()
bcmgenet_put_tx_csum() needs to return skb pointer back to the caller
because it reallocates a new one in case of lack of skb headroom.

Signed-off-by: Petri Gynther <pgynther@google.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:11:49 -04:00
Alexei Starovoitov
38b2cf2982 net: pktgen: packet bursting via skb->xmit_more
This patch demonstrates the effect of delaying update of HW tailptr.
(based on earlier patch by Jesper)

burst=1 is the default. It sends one packet with xmit_more=false
burst=2 sends one packet with xmit_more=true and
        2nd copy of the same packet with xmit_more=false
burst=3 sends two copies of the same packet with xmit_more=true and
        3rd copy with xmit_more=false

Performance with ixgbe (usec 30):
burst=1  tx:9.2 Mpps
burst=2  tx:13.5 Mpps
burst=3  tx:14.5 Mpps full 10G line rate

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:08:12 -04:00
Florian Fainelli
775dd692bd net: bridge: add a br_set_state helper function
In preparation for being able to propagate port states to e.g: notifiers
or other kernel parts, do not manipulate the port state directly, but
instead use a helper function which will allow us to do a bit more than
just setting the state.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:03:50 -04:00
WANG Cong
a0efb80ce3 net_sched: avoid calling tcf_unbind_filter() in call_rcu callback
This fixes the following crash:

[   63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648
[   63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000
[   63.980094] RIP: 0010:[<ffffffff817e6d07>]  [<ffffffff817e6d07>] u32_destroy_key+0x27/0x6d
[   63.980094] RSP: 0018:ffff880117dffcc0  EFLAGS: 00010202
[   63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000
[   63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
[   63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000
[   63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001
[   63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30
[   63.980094] FS:  0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[   63.980094] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0
[   63.980094] Stack:
[   63.980094]  ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68
[   63.980094]  ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8
[   63.980094]  ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a
[   63.980094] Call Trace:
[   63.980094]  [<ffffffff817e6d68>] u32_delete_key_freepf_rcu+0x1b/0x1d
[   63.980094]  [<ffffffff810cb4c7>] rcu_process_callbacks+0x3bb/0x691
[   63.980094]  [<ffffffff810cb3cd>] ? rcu_process_callbacks+0x2c1/0x691
[   63.980094]  [<ffffffff817e6d4d>] ? u32_destroy_key+0x6d/0x6d
[   63.980094]  [<ffffffff810780a4>] __do_softirq+0x142/0x323
[   63.980094]  [<ffffffff810782a8>] run_ksoftirqd+0x23/0x53
[   63.980094]  [<ffffffff81092126>] smpboot_thread_fn+0x203/0x221
[   63.980094]  [<ffffffff81091f23>] ? smpboot_unpark_thread+0x33/0x33
[   63.980094]  [<ffffffff8108e44d>] kthread+0xc9/0xd1
[   63.980094]  [<ffffffff819e00ea>] ? do_wait_for_common+0xf8/0x125
[   63.980094]  [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61
[   63.980094]  [<ffffffff819e43ec>] ret_from_fork+0x7c/0xb0
[   63.980094]  [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61

tp could be freed in call_rcu callback too, the order is not guaranteed.

John Fastabend says:

====================
Its worth noting why this is safe. Any running schedulers will either
read the valid class field or it will be zeroed.

All schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.
====================

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:00:42 -04:00
WANG Cong
6e0565697a net_sched: fix another crash in cls_tcindex
This patch fixes the following crash:

[  166.670795] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  166.674230] IP: [<ffffffff814b739f>] __list_del_entry+0x5c/0x98
[  166.674230] PGD d0ea5067 PUD ce7fc067 PMD 0
[  166.674230] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  166.674230] CPU: 1 PID: 775 Comm: tc Not tainted 3.17.0-rc6+ #642
[  166.674230] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  166.674230] task: ffff8800d03c4d20 ti: ffff8800cae7c000 task.ti: ffff8800cae7c000
[  166.674230] RIP: 0010:[<ffffffff814b739f>]  [<ffffffff814b739f>] __list_del_entry+0x5c/0x98
[  166.674230] RSP: 0018:ffff8800cae7f7d0  EFLAGS: 00010207
[  166.674230] RAX: 0000000000000000 RBX: ffff8800cba8d700 RCX: ffff8800cba8d700
[  166.674230] RDX: 0000000000000000 RSI: dead000000200200 RDI: ffff8800cba8d700
[  166.674230] RBP: ffff8800cae7f7d0 R08: 0000000000000001 R09: 0000000000000001
[  166.674230] R10: 0000000000000000 R11: 000000000000859a R12: ffffffffffffffe8
[  166.674230] R13: ffff8800cba8c5b8 R14: 0000000000000001 R15: ffff8800cba8d700
[  166.674230] FS:  00007fdb5f04a740(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[  166.674230] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  166.674230] CR2: 0000000000000000 CR3: 00000000cf929000 CR4: 00000000000006e0
[  166.674230] Stack:
[  166.674230]  ffff8800cae7f7e8 ffffffff814b73e8 ffff8800cba8d6e8 ffff8800cae7f828
[  166.674230]  ffffffff817caeec 0000000000000046 ffff8800cba8c5b0 ffff8800cba8c5b8
[  166.674230]  0000000000000000 0000000000000001 ffff8800cf8e33e8 ffff8800cae7f848
[  166.674230] Call Trace:
[  166.674230]  [<ffffffff814b73e8>] list_del+0xd/0x2b
[  166.674230]  [<ffffffff817caeec>] tcf_action_destroy+0x4c/0x71
[  166.674230]  [<ffffffff817ca0ce>] tcf_exts_destroy+0x20/0x2d
[  166.674230]  [<ffffffff817ec2b5>] tcindex_delete+0x196/0x1b7

struct list_head can not be simply copied and we should always init it.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:00:42 -04:00
David S. Miller
25e379c475 Merge branch 'udp_gso'
Tom Herbert says:

====================
udp: Generalize GSO for UDP tunnels

This patch set generalizes the UDP tunnel segmentation functions so
that they can work with various protocol encapsulations. The primary
change is to set the inner_protocol field in the skbuff when creating
the encapsulated packet, and then in skb_udp_tunnel_segment this data
is used to determine the function for segmenting the encapsulated
packet. The inner_protocol field is overloaded to take either an
Ethertype or IP protocol.

The inner_protocol is set on transmit using skb_set_inner_ipproto or
skb_set_inner_protocol functions. VXLAN and IP tunnels (for fou GSO)
were modified to call these.

Notes:
  - GSO for GRE/UDP where GRE checksum is enabled does not work.
    Handling this will require some special case code.
  - Software GSO now supports many varieties of encapsulation with
    SKB_GSO_UDP_TUNNEL{_CSUM}. We still need a mechanism to query
    for device support of particular combinations (I intend to
    add ndo_gso_check for that).
  - MPLS seems to be the only previous user of inner_protocol. I don't
    believe these patches can affect that. For supporting GSO with
    MPLS over UDP, the inner_protocol should be set using the
    helper functions in this patch.
  - GSO for L2TP/UDP should also be straightforward now.

v2:
  - Respin for Eric's restructuring of skbuff.

Tested GRE, IPIP, and SIT over fou as well as VLXAN. This was
done using 200 TCP_STREAMs in netperf.

 GRE
    IPv4, FOU, UDP checksum enabled
      TCP_STREAM TSO enabled on tun interface
        14.04% TX CPU utilization
        13.17% RX CPU utilization
        9211 Mbps
      TCP_STREAM TSO disabled on tun interface
        27.82% TX CPU utilization
        25.41% RX CPU utilization
        9336 Mbps
    IPv4, FOU, UDP checksum disabled
      TCP_STREAM TSO enabled on tun interface
        13.14% TX CPU utilization
        23.18% RX CPU utilization
        9277 Mbps
      TCP_STREAM TSO disabled on tun interface
        30.00% TX CPU utilization
        31.28% RX CPU utilization
        9327 Mbps

  IPIP
    FOU, UDP checksum enabled
      TCP_STREAM TSO enabled on tun interface
        15.28% TX CPU utilization
        13.92% RX CPU utilization
        9342 Mbps
      TCP_STREAM TSO disabled on tun interface
        27.82% TX CPU utilization
        25.41% RX CPU utilization
        9336 Mbps
    FOU, UDP checksum disabled
      TCP_STREAM TSO enabled on tun interface
        15.08% TX CPU utilization
        24.64% RX CPU utilization
        9226 Mbps
      TCP_STREAM TSO disabled on tun interface
        30.00% TX CPU utilization
        31.28% RX CPU utilization
        9327 Mbps

  SIT
    FOU, UDP checksum enabled
      TCP_STREAM TSO enabled on tun interface
        14.47% TX CPU utilization
        14.58% RX CPU utilization
        9106 Mbps
      TCP_STREAM TSO disabled on tun interface
        31.82% TX CPU utilization
        30.82% RX CPU utilization
        9204 Mbps
    FOU, UDP checksum disabled
      TCP_STREAM TSO enabled on tun interface
        15.70% TX CPU utilization
        27.93% RX CPU utilization
        9097 Mbps
      TCP_STREAM TSO disabled on tun interface
        33.48% TX CPU utilization
        37.36% RX CPU utilization
        9197 Mbps

   VXLAN
      TCP_STREAM TSO enabled on tun interface
        16.42% TX CPU utilization
        23.66% RX CPU utilization
        9081 Mbps
      TCP_STREAM TSO disabled on tun interface
        30.32% TX CPU utilization
        30.55% RX CPU utilization
        9185 Mbps

   Baseline (no encp, TSO and LRO enabled)
      TCP_STREAM
        11.85% TX CPU utilization
        15.13% RX CPU utilization
        9452 Mbps
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:58 -04:00
Tom Herbert
996c9fd167 vxlan: Set inner protocol before transmit
Call skb_set_inner_protocol to set inner Ethernet protocol to
ETH_P_TEB before transmit. This is needed for GSO with UDP tunnels.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:52 -04:00
Tom Herbert
54bc9bac30 gre: Set inner protocol in v4 and v6 GRE transmit
Call skb_set_inner_protocol to set inner Ethernet protocol to
protocol being encapsulation by GRE before tunnel_xmit. This is
needed for GSO if UDP encapsulation (fou) is being done.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:51 -04:00
Tom Herbert
077c5a0948 ipip: Set inner IP protocol in ipip
Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV4
before tunnel_xmit. This is needed if UDP encapsulation (fou) is
being done.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:51 -04:00
Tom Herbert
469471cdfc sit: Set inner IP protocol in sit
Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV6
before tunnel_xmit. This is needed if UDP encapsulation (fou) is
being done.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:51 -04:00
Tom Herbert
8bce6d7d0d udp: Generalize skb_udp_segment
skb_udp_segment is the function called from udp4_ufo_fragment to
segment a UDP tunnel packet. This function currently assumes
segmentation is transparent Ethernet bridging (i.e. VXLAN
encapsulation). This patch generalizes the function to
operate on either Ethertype or IP protocol.

The inner_protocol field must be set to the protocol of the inner
header. This can now be either an Ethertype or an IP protocol
(in a union). A new flag in the skbuff indicates which type is
effective. skb_set_inner_protocol and skb_set_inner_ipproto
helper functions were added to set the inner_protocol. These
functions are called from the point where the tunnel encapsulation
is occuring.

When skb_udp_tunnel_segment is called, the function to segment the
inner packet is selected based on the inner IP or Ethertype. In the
case of an IP protocol encapsulation, the function is derived from
inet[6]_offloads. In the case of Ethertype, skb->protocol is
set to the inner_protocol and skb_mac_gso_segment is called. (GRE
currently does this, but it might be possible to lookup the protocol
in offload_base and call the appropriate segmenation function
directly).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:51 -04:00
David S. Miller
f44d61cdd3 Merge branch 'bpf-next'
Alexei Starovoitov says:

====================
bpf: add search pruning optimization and tests

patch #1 commit log explains why eBPF verifier has to examine some
instructions multiple times and describes the search pruning optimization
that improves verification speed for branchy programs and allows more
complex programs to be verified successfully.
This patch completes the core verifier logic.

patch #2 adds more verifier tests related to branches and search pruning

I'm still working on Andy's 'bitmask for stack slots' suggestion. It will be
done on top of this patch.

The current verifier algorithm is brute force depth first search with
state pruning. If anyone can come up with another algorithm that demonstrates
better results, we'll replace the algorithm without affecting user space.

Note verifier doesn't guarantee that all possible valid programs are accepted.
Overly complex programs may still be rejected.
Verifier improvements/optimizations will guarantee that if a program
was passing verification in the past, it will still be passing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:30:46 -04:00
Alexei Starovoitov
fd10c2ef3e bpf: add tests to verifier testsuite
add 4 extra tests to cover jump verification better

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:30:33 -04:00
Alexei Starovoitov
f1bca824da bpf: add search pruning optimization to verifier
consider C program represented in eBPF:
int filter(int arg)
{
    int a, b, c, *ptr;

    if (arg == 1)
        ptr = &a;
    else if (arg == 2)
        ptr = &b;
    else
        ptr = &c;

    *ptr = 0;
    return 0;
}
eBPF verifier has to follow all possible paths through the program
to recognize that '*ptr = 0' instruction would be safe to execute
in all situations.
It's doing it by picking a path towards the end and observes changes
to registers and stack at every insn until it reaches bpf_exit.
Then it comes back to one of the previous branches and goes towards
the end again with potentially different values in registers.
When program has a lot of branches, the number of possible combinations
of branches is huge, so verifer has a hard limit of walking no more
than 32k instructions. This limit can be reached and complex (but valid)
programs could be rejected. Therefore it's important to recognize equivalent
verifier states to prune this depth first search.

Basic idea can be illustrated by the program (where .. are some eBPF insns):
    1: ..
    2: if (rX == rY) goto 4
    3: ..
    4: ..
    5: ..
    6: bpf_exit
In the first pass towards bpf_exit the verifier will walk insns: 1, 2, 3, 4, 5, 6
Since insn#2 is a branch the verifier will remember its state in verifier stack
to come back to it later.
Since insn#4 is marked as 'branch target', the verifier will remember its state
in explored_states[4] linked list.
Once it reaches insn#6 successfully it will pop the state recorded at insn#2 and
will continue.
Without search pruning optimization verifier would have to walk 4, 5, 6 again,
effectively simulating execution of insns 1, 2, 4, 5, 6
With search pruning it will check whether state at #4 after jumping from #2
is equivalent to one recorded in explored_states[4] during first pass.
If there is an equivalent state, verifier can prune the search at #4 and declare
this path to be safe as well.
In other words two states at #4 are equivalent if execution of 1, 2, 3, 4 insns
and 1, 2, 4 insns produces equivalent registers and stack.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:30:33 -04:00
Nimrod Andy
1b7bde6d65 net: fec: implement rx_copybreak to improve rx performance
- Copy short frames and keep the buffers mapped, re-allocate skb instead of
  memory copy for long frames.
- Add support for setting/getting rx_copybreak using generic ethtool tunable

Changes V3:
* As Eric Dumazet's suggestion that removing the copybreak module parameter
  and only keep the ethtool API support for rx_copybreak.

Changes V2:
* Implements rx_copybreak
* Rx_copybreak provides module parameter to change this value
* Add tunable_ops support for rx_copybreak

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Frank Li <Frank.Li@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:28:21 -04:00
Eric Dumazet
ce1a4ea3f1 net: avoid one atomic operation in skb_clone()
Fast clone cloning can actually avoid an atomic_inc(), if we
guarantee prior clone_ref value is 1.

This requires a change kfree_skbmem(), to perform the
atomic_dec_and_test() on clone_ref before setting fclone to
SKB_FCLONE_UNAVAILABLE.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:27:23 -04:00
Brian Foster
da5f10969d xfs: flush the range before zero range conversion
XFS currently discards delalloc blocks within the target range of a
zero range request. Unaligned start and end offsets are zeroed
through the page cache and the internal, aligned blocks are
converted to unwritten extents.

If EOF is page aligned and covered by a delayed allocation extent.
The inode size is not updated until I/O completion. If a zero range
request discards a delalloc range that covers page aligned EOF as
such, the inode size update never occurs. For example:

$ rm -f /mnt/file
$ xfs_io -fc "pwrite 0 64k" -c "zero 60k 4k" /mnt/file
$ stat -c "%s" /mnt/file
65536
$ umount /mnt
$ mount <dev> /mnt
$ stat -c "%s" /mnt/file
61440

Update xfs_zero_file_space() to flush the range rather than discard
delalloc blocks to ensure that inode size updates occur
appropriately.

[dchinner: Note that this is really a workaround to avoid the
underlying problems. More work is needed (and ongoing) to fix those
issues so this fix is being added as a temporary stop-gap measure. ]

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:44:54 +10:00
Brian Foster
07d08681d2 xfs: restore buffer_head unwritten bit on ioend cancel
xfs_vm_writepage() walks each buffer_head on the page, maps to the block
on disk and attaches to a running ioend structure that represents the
I/O submission. A new ioend is created when the type of I/O (unwritten,
delayed allocation or overwrite) required for a particular buffer_head
differs from the previous. If a buffer_head is a delalloc or unwritten
buffer, the associated bits are cleared by xfs_map_at_offset() once the
buffer_head is added to the ioend.

The process of mapping each buffer_head occurs in xfs_map_blocks() and
acquires the ilock in blocking or non-blocking mode, depending on the
type of writeback in progress. If the lock cannot be acquired for
non-blocking writeback, we cancel the ioend, redirty the page and
return. Writeback will revisit the page at some later point.

Note that we acquire the ilock for each buffer on the page. Therefore
during non-blocking writeback, it is possible to add an unwritten buffer
to the ioend, clear the unwritten state, fail to acquire the ilock when
mapping a subsequent buffer and cancel the ioend. If this occurs, the
unwritten status of the buffer sitting in the ioend has been lost. The
page will eventually hit writeback again, but xfs_vm_writepage() submits
overwrite I/O instead of unwritten I/O and does not perform unwritten
extent conversion at I/O completion. This leads to data corruption
because unwritten extents are treated as holes on reads and zeroes are
returned instead of reading from disk.

Modify xfs_cancel_ioend() to restore the buffer unwritten bit for ioends
of type XFS_IO_UNWRITTEN. This ensures that unwritten extent conversion
occurs once the page is eventually written back.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:42:06 +10:00
Eric Sandeen
5cca3f611d xfs: check for null dquot in xfs_quota_calc_throttle()
Coverity spotted this.

Granted, we *just* checked xfs_inod_dquot() in the caller (by
calling xfs_quota_need_throttle). However, this is the only place we
don't check the return value but the check is cheap and future-proof
so add it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:27:09 +10:00
Eric Sandeen
04dd1a0d4b xfs: fix crc field handling in xfs_sb_to/from_disk
I discovered this in userspace, but the same change applies
to the kernel.

If we xfs_mdrestore an image from a non-crc filesystem, lo
and behold the restored image has gained a CRC:

# db/xfs_metadump.sh -o /dev/sdc1 - | xfs_mdrestore - test.img
# xfs_db -c "sb 0" -c "p crc" /dev/sdc1
crc = 0 (correct)
# xfs_db -c "sb 0" -c "p crc" test.img
crc = 0xb6f8d6a0 (correct)

This is because xfs_sb_from_disk doesn't fill in sb_crc,
but xfs_sb_to_disk(XFS_SB_ALL_BITS) does write the in-memory
CRC to disk - so we get uninitialized memory on disk.

Fix this by always initializing sb_crc to 0 when we read
the superblock, and masking out the CRC bit from ALL_BITS
when we write it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:24:11 +10:00
Eric Sandeen
6ee49a20c1 xfs: don't send null bp to xfs_trans_brelse()
In this case, if bp is NULL, error is set, and we send a
NULL bp to xfs_trans_brelse, which will try to dereference it.

Test whether we actually have a buffer before we try to
free it.

Coverity spotted this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:23:49 +10:00
Brian Foster
ce57bcf6b8 xfs: check for inode size overflow in xfs_new_eof()
If we write to the maximum file offset (2^63-2), XFS fails to log the
inode size update when the page is flushed. For example:

$ xfs_io -fc "pwrite `echo "2^63-1-1" | bc` 1" /mnt/file
wrote 1/1 bytes at offset 9223372036854775806
1.000000 bytes, 1 ops; 0.0000 sec (22.711 KiB/sec and 23255.8140 ops/sec)
$ stat -c %s /mnt/file
9223372036854775807
$ umount /mnt ; mount <dev> /mnt/
$ stat -c %s /mnt/file
0

This occurs because XFS calculates the new file size as io_offset +
io_size, I/O occurs in block sized requests, and the maximum supported
file size is not block aligned. Therefore, a write to the max allowable
offset on a 4k blocksize fs results in a write of size 4k to offset
2^63-4096 (e.g., equivalent to round_down(2^63-1, 4096), or IOW the
offset of the block that contains the max file size). The offset plus
size calculation (2^63 - 4096 + 4096 == 2^63) overflows the signed
64-bit variable which goes negative and causes the > comparison to the
on-disk inode size to fail. This returns 0 from xfs_new_eof() and
results in no change to the inode on-disk.

Update xfs_new_eof() to explicitly detect overflow of the local
calculation and use the VFS inode size in this scenario. The VFS inode
size is capped to the maximum and thus XFS writes the correct inode size
to disk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:21:53 +10:00
Dave Chinner
a872703f34 xfs: only set extent size hint when asked
Currently the extent size hint is set unconditionally in
xfs_ioctl_setattr() when the FSX_EXTSIZE flag is set. Hence we can
set hints when the inode flags indicating the hint should be used
are not set.  Hence only set the extent size hint from userspace
when the inode has the XFS_DIFLAG_EXTSIZE flag set to indicate that
we should have an extent size hint set on the inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:20:30 +10:00
Dave Chinner
9336e3a765 xfs: project id inheritance is a directory only flag
xfs_set_diflags() allows it to be set on non-directory inodes, and
this flags errors in xfs_repair. Further, inode allocation allows
the same directory-only flag to be inherited to non-directories.
Make sure directory inode flags don't appear on other types of
inodes.

This fixes several xfstests scratch fileystem corruption reports
(e.g. xfs/050) now that xfstests checks scratch filesystems after
test completion.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:18:40 +10:00
Dave Chinner
e076b0f3a5 xfs: kill time.h
The typedef for timespecs and nanotime() are completely unnecessary,
and delay() can be moved to fs/xfs/linux.h, which means this file
can go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:18:13 +10:00
Dave Chinner
b1d6cc02f2 xfs: compat_xfs_bstat does not have forkoff
struct compat_xfs_bstat is missing the di_forkoff field and so does
not fully translate the structure correctly. Fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:17:58 +10:00
Dave Chinner
75e58ce4c8 Merge branch 'xfs-buf-iosubmit' into for-next 2014-10-02 09:11:14 +10:00
Christoph Hellwig
8c15612546 xfs: simplify xfs_zero_remaining_bytes
xfs_zero_remaining_bytes() open codes a log of buffer manupulations
to do a read forllowed by a write. It can simply be replaced by an
uncached read followed by a xfs_bwrite() call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:05:44 +10:00
Dave Chinner
ba3726742c xfs: check xfs_buf_read_uncached returns correctly
xfs_buf_read_uncached() has two failure modes. If can either return
NULL or bp->b_error != 0 depending on the type of failure, and not
all callers check for both. Fix it so that xfs_buf_read_uncached()
always returns the error status, and the buffer is returned as a
function parameter. The buffer will only be returned on success.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:05:32 +10:00
Dave Chinner
595bff75dc xfs: introduce xfs_buf_submit[_wait]
There is a lot of cookie-cutter code that looks like:

	if (shutdown)
		handle buffer error
	xfs_buf_iorequest(bp)
	error = xfs_buf_iowait(bp)
	if (error)
		handle buffer error

spread through XFS. There's significant complexity now in
xfs_buf_iorequest() to specifically handle this sort of synchronous
IO pattern, but there's all sorts of nasty surprises in different
error handling code dependent on who owns the buffer references and
the locks.

Pull this pattern into a single helper, where we can hide all the
synchronous IO warts and hence make the error handling for all the
callers much saner. This removes the need for a special extra
reference to protect IO completion processing, as we can now hold a
single reference across dispatch and waiting, simplifying the sync
IO smeantics and error handling.

In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
make it explicitly handle on asynchronous IO. This forces all users
to be switched specifically to one interface or the other and
removes any ambiguity between how the interfaces are to be used. It
also means that xfs_buf_iowait() goes away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:05:14 +10:00
Dave Chinner
8b131973d1 xfs: kill xfs_bioerror_relse
There is only one caller now - xfs_trans_read_buf_map() - and it has
very well defined call semantics - read, synchronous, and b_iodone
is NULL. Hence it's pretty clear what error handling is necessary
for this case. The bigger problem of untangling
xfs_trans_read_buf_map error handling is left to a future patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:05:05 +10:00