Commit Graph

412626 Commits

Author SHA1 Message Date
Vu Pham
7bb312e4a2 IB/srp: Make transport layer retry count configurable
Allow the InfiniBand RC retry count to be configured by the user as an
option in the target login string.  Reducing this retry count allows to
reduce the path failover time.

Signed-off-by: Vu Pham <vu@mellanox.com>

[ bvanassche: Rewrote patch description / changed default retry count ]

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:15 -08:00
Mike Marciniszyn
2fadd83184 IB/qib: Fix txselect regression
Commit 7fac33014f54("IB/qib: checkpatch fixes") was overzealous in
removing a simple_strtoul for a parse routine, setup_txselect().  That
routine is required to handle a multi-value string.

Unwind that aspect of the fix.

Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:12 -08:00
Mike Marciniszyn
78a5886472 IB/qib: Fix checkpatch __packed warnings
Convert __attribute__ ((packed)) to __packed.

Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:12 -08:00
Jan Kara
603e772992 IB/qib: Convert qib_user_sdma_pin_pages() to use get_user_pages_fast()
qib_user_sdma_queue_pkts() gets called with mmap_sem held for
writing. Except for get_user_pages() deep down in
qib_user_sdma_pin_pages() we don't seem to need mmap_sem at all.  Even
more interestingly the function qib_user_sdma_queue_pkts() (and also
qib_user_sdma_coalesce() called somewhat later) call copy_from_user()
which can hit a page fault and we deadlock on trying to get mmap_sem
when handling that fault.

So just make qib_user_sdma_pin_pages() use get_user_pages_fast() and
leave mmap_sem locking for mm.

This deadlock has actually been observed in the wild when the node
is under memory pressure.

Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:11 -08:00
Jan Kara
4adcf7fb67 IB/ipath: Convert ipath_user_sdma_pin_pages() to use get_user_pages_fast()
ipath_user_sdma_queue_pkts() gets called with mmap_sem held for
writing.  Except for get_user_pages() deep down in
ipath_user_sdma_pin_pages() we don't seem to need mmap_sem at all.

Even more interestingly the function ipath_user_sdma_queue_pkts() (and
also ipath_user_sdma_coalesce() called somewhat later) call
copy_from_user() which can hit a page fault and we deadlock on trying
to get mmap_sem when handling that fault.  So just make
ipath_user_sdma_pin_pages() use get_user_pages_fast() and leave
mmap_sem locking for mm.

This deadlock has actually been observed in the wild when the node
is under memory pressure.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>

[ Merged in fix for call to get_user_pages_fast from Tetsuo Handa
  <penguin-kernel@I-love.SAKURA.ne.jp>.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:11 -08:00
Naresh Gottumukkala
d5e3f37833 RDMA/ocrdma: Remove redundant check in ocrdma_build_fr()
Remove the redundant check of comparing if a 32-bit value is greater
than 0xffffffffULL.

Reported by Dan Carpenter.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:06 -08:00
Naresh Gottumukkala
1852d1da3b RDMA/ocrdma: Fix a crash in rmmod
1) ocrdma_remove_free() is called from a call_rcu callback funtion
   context, which can be a bottom-half context. So the code in
   ocrdma_remove_free should not sleep.

   But ocrdma_cleanup_hw() can sleep, So move it ocrdma_remove()
   instead of ocrdma_remove_free.

2) Fix a couple of kbuild test robot warnings.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:06 -08:00
Dan Carpenter
6ebacdfc07 RDMA/ocrdma: Silence an integer underflow warning
We recently added a cap on "max_wqe_allocated" in 43a6b4025c
('RDMA/ocrdma: Create IRD queue fix').

My static checker complains that the cap has a problem because it
casts large values to negative.  "attrs->cap.max_send_wr" is a u32.
It comes from the user, but it's capped in ocrdma_check_qp_params() so
it can't wrap here.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:05 -08:00
Eli Cohen
1b77d2bd75 mlx5: Use enum to indicate adapter page size
The Connect-IB adapter has an inherent page size which equals 4K.
Define an new enum that equals the page shift and use it instead of
using the value 12 throughout the code.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:01 -08:00
Eli Cohen
c2a3431e61 IB/mlx5: Update opt param mask for RTS2RTS
RTS to RTS transition should allow update of alternate path.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:01 -08:00
Eli Cohen
07c9113fe8 IB/mlx5: Remove "Always false" comparison
mlx5_cur and mlx5_new cannot have negative values so remove the
redundant condition.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:01 -08:00
Eli Cohen
2d036fad94 IB/mlx5: Remove dead code in mr.c
In mlx5_mr_cache_init() the size variable is not used so remove it to
avoid compiler warnings when running with make W=1.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Moshe Lazer
4e3d677ba9 mlx5_core: Change optimal_reclaimed_pages for better performance
Change optimal_reclaimed_pages() to increase the output size of each
reclaim pages command. This change reduces significantly the amount of
reclaim pages commands issued to FW when the driver is unloaded which
reduces the overall driver unload time.

Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Eli Cohen
87b8de492d mlx5: Clear reserved area in set_hca_cap()
Firmware spec requires reserved fields to be cleared when calling
set_hca_cap.  Current code queries and copy to the set area, possibly
resulting in reserved bits not cleared. This patch copies only
writable fields to the set area.

Fix also typo - msx => max

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Eli Cohen
bf0bf77f65 mlx5: Support communicating arbitrary host page size to firmware
Connect-IB firmware requires 4K pages to be communicated with the
driver. This patch breaks larger pages to 4K units to enable support
for architectures utilizing larger page size, such as PowerPC.  This
patch also fixes several places that referred to PAGE_SHIFT instead of
explicit 12 which is the inherent page shift on Connect-IB.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Eli Cohen
952f5f6e80 mlx5: Fix cleanup flow when DMA mapping fails
If DMA mapping fails, the driver cleared the object that holds the
previously DMA mapped pages. Fix this by allocating a new object for
the command that reports back to firmware that pages can't be
supplied.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Moshe Lazer
cfd8f1d49b IB/mlx5: Fix srq free in destroy qp
On destroy QP the driver walks over the relevant CQ and removes CQEs
reported for the destroyed QP.  It also frees the related SRQ entry
without checking that this is actually an SRQ-related CQE.  In case of
a CQ used for both send and receive QP, we could free SRQ entries for
send CQEs.  This patch resolves this issue by verifying that this is a
SRQ related CQE by checking the SRQ number in the CQE is not zero.

Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:59 -08:00
Eli Cohen
1faacf82df IB/mlx5: Simplify mlx5_ib_destroy_srq
Make use of destroy_srq_kernel() to clear SRQ resouces.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:59 -08:00
Eli Cohen
9641b74ebe IB/mlx5: Fix overflow check in IB_WR_FAST_REG_MR
Make sure not to overflow when reading the page list from struct
ib_fast_reg_page_list.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:59 -08:00
Eli Cohen
746b5583c1 IB/mlx5: Multithreaded create MR
Use asynchronous commands to execute up to eight concurrent create MR
commands. This is to fill memory caches faster so we keep consuming
from there.  Also, increase timeout for shrinking caches to five
minutes.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:59 -08:00
Eli Cohen
51ee86a4af IB/mlx5: Fix check of number of entries in create CQ
Verify that the value is non negative before rounding up to power of 2.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:58 -08:00
Mathias Krause
5476781bb9 IB/netlink: Remove superfluous RDMA_NL_GET_OP() masking
'op' is the already RDMA_NL_GET_OP() masked 'type'.  No need to mask it again.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:54 -08:00
Latchesar Ionkov
6b7d103c1b IB/core: Pass imm_data from ib_uverbs_send_wr to ib_send_wr correctly
Currently, we don't copy the immediate data from the userspace struct
to the kernel one when UD messages are being sent.

This patch makes sure that the immediate data is set correctly.

Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:54 -08:00
Michal Schmidt
7f1a38671c IPoIB: lower NAPI weight
Since commit 82dc3c63c6 ("net: introduce NAPI_POLL_WEIGHT")
netif_napi_add() produces an error message if a NAPI poll weight
greater than 64 is requested.

Use the standard NAPI weight.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:50 -08:00
Erez Shitrit
94232d9ce8 IPoIB: Start multicast join process only on active ports
The driver starts the mcast_join task whenever the netdev interface is
UP without relation to the underlying IB port state.

Until the port state is ACTIVE all the join requests are irrelevant,
and the IB core returns -EINVAL. So the user will see errors such as:
"multicast join failed for ff12:401b:... , status -22".

Instead, have ipoib_mcast_join_task() return when the port is not active.

It will be called again when the port state is changed and the
low-level driver triggers the IB_EVENT_PORT_ACTIVE event or the
IB_EVENT_CLIENT_REREGISTER event.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Erez Shitrit
a39c52ab88 IPoIB: Add path query flushing in ipoib_ib_dev_cleanup
The path_rec_completion() callback may be invoked asynchronously even
at the middle of "driver uninit" process.  This can lead to scheduling
a task that tries to touch members of the priv object that are no
longer valid.  For example the function cm_create_tx_qp can attempt to
create qp with no valid priv->pd object.

The following crash is one of the results:
RIP: 0010:[<ffffffffa021bb47>]  [<ffffffffa021bb47>] ipoib_cm_create_tx_qp+0x57/0x90 [ib_ipoib]
Process ipoib (pid: 5916, threadinfo ffff8803786e4000, task ffff8804150e1500)
Stack:
Call Trace:
[<ffffffff81309ef0>] ? get_random_bytes+0x20/0x30
[<ffffffffa021be2a>] ipoib_cm_tx_init+0xca/0x340 [ib_ipoib]
[<ffffffffa021f765>] ipoib_cm_tx_start+0x215/0x3f0 [ib_ipoib]
[<ffffffffa021f550>] ? ipoib_cm_tx_start+0x0/0x3f0 [ib_ipoib]
[<ffffffff8108b2b0>] worker_thread+0x170/0x2a0
[<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8108b140>] ? worker_thread+0x0/0x2a0
[<ffffffff81090886>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff810907f0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20

Fix that by flushing all pending path queries at this point.

Signed-off-by: Alex Markuze <markuze@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Erez Shitrit
a9c8ba5884 IPoIB: Fix usage of uninitialized multicast objects
The driver should avoid calling ib_sa_free_multicast on the mcast->mc
object until it finishes its initialization state.  Otherwise we can
crash when ipoib_mcast_dev_flush() attempts to use the uninitialized
multicast object.

Instead, only call wait_for_completion() for multicast entries that
started the join process, meaning that ib_sa_join_multicast() finished.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Erez Shitrit
aede25011f IPoIB: Avoid flushing the driver workqueue on dev_down
The driver should not flush the whole workqueue when only one work (the
pkey poll one) needs to be cancelled.  Use cancel_delayed_work_sync()
instead.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Erez Shitrit
f47944cc2d IPoIB: Fix deadlock between dev_change_flags() and __ipoib_dev_flush()
When ipoib interface is going down it takes all of its children with
it, under mutex.

For each child, dev_change_flags() is called.  That function calls
ipoib_stop() via the ndo, and causes flush of the workqueue.
Sometimes in the workqueue an __ipoib_dev_flush work() is waiting and
when invoked tries to get the same mutex, which leads to a deadlock,
as seen below.

The solution is to switch to rw-sem instead of mutex.

The deadlock:
[11028.165303]  [<ffffffff812b0977>] ? vgacon_scroll+0x107/0x2e0
[11028.171844]  [<ffffffff814eaac5>] schedule_timeout+0x215/0x2e0
[11028.178465]  [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80
[11028.185962]  [<ffffffff814ea743>] wait_for_common+0x123/0x180
[11028.192491]  [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
[11028.199504]  [<ffffffff814ea85d>] wait_for_completion+0x1d/0x20
[11028.206224]  [<ffffffff8108b4f1>] flush_cpu_workqueue+0x61/0x90
[11028.212948]  [<ffffffff8108b5a0>] ? wq_barrier_func+0x0/0x20
[11028.219375]  [<ffffffff8108bfc4>] flush_workqueue+0x54/0x80
[11028.225712]  [<ffffffffa05a0576>] ipoib_mcast_stop_thread+0x66/0x90 [ib_ipoib]
[11028.233988]  [<ffffffffa059ccea>] ipoib_ib_dev_down+0x6a/0x100 [ib_ipoib]
[11028.241678]  [<ffffffffa059849a>] ipoib_stop+0x8a/0x140 [ib_ipoib]
[11028.248692]  [<ffffffff8142adf1>] dev_close+0x71/0xc0
[11028.254447]  [<ffffffff8142a631>] dev_change_flags+0xa1/0x1d0
[11028.261062]  [<ffffffffa059851b>] ipoib_stop+0x10b/0x140 [ib_ipoib]
[11028.268172]  [<ffffffff8142adf1>] dev_close+0x71/0xc0
[11028.273922]  [<ffffffff8142a631>] dev_change_flags+0xa1/0x1d0
[11028.280452]  [<ffffffff8148f20b>] devinet_ioctl+0x5eb/0x6a0
[11028.286786]  [<ffffffff814903b8>] inet_ioctl+0x88/0xa0
[11028.292633]  [<ffffffff8141591a>] sock_ioctl+0x7a/0x280
[11028.298576]  [<ffffffff81189012>] vfs_ioctl+0x22/0xa0
[11028.304326]  [<ffffffff81140540>] ? unmap_region+0x110/0x130
[11028.310756]  [<ffffffff811891b4>] do_vfs_ioctl+0x84/0x580
[11028.316897]  [<ffffffff81189731>] sys_ioctl+0x81/0xa0

and

11028.017533]  [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80
[11028.025030]  [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[11028.031945]  [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
[11028.039053]  [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
[11028.044910]  [<ffffffffa059f7e7>] __ipoib_ib_dev_flush+0x37/0x210 [ib_ipoib]
[11028.052894]  [<ffffffffa059fa00>] ? ipoib_ib_dev_flush_light+0x0/0x20 [ib_ipoib]
[11028.061363]  [<ffffffffa059fa17>] ipoib_ib_dev_flush_light+0x17/0x20 [ib_ipoib]
[11028.069738]  [<ffffffff8108b120>] worker_thread+0x170/0x2a0
[11028.076068]  [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
[11028.083374]  [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
[11028.089709]  [<ffffffff81090626>] kthread+0x96/0xa0
[11028.095266]  [<ffffffff8100c0ca>] child_rip+0xa/0x20
[11028.100921]  [<ffffffff81090590>] ? kthread+0x0/0xa0
[11028.106573]  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
[11028.112423] INFO: task ifconfig:23640 blocked for more than 120 seconds.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Tal Alon
22252b4e09 IPoIB: Change CM skb memory allocation to be non-atomic during init
Change CM skb memory allocation to use GFP_KERNEL when possible.

During device init there's no need to use GFP_ATOMIC when allocating
memory for the CM skbs -- use GFP_KERNEL instead.

Signed-off-by: Tal Alon <talal@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:48 -08:00
Erez Shitrit
c2bb5628db IPoIB: Fix crash in dev_open error flow
If napi has never been enabled when calling ipoib_ib_dev_stop, a
kernel crash occurs, because the verbs layer completion handler
(ipoib_ib_completion) calls napi_schedule unconditionally.

If the napi structure passed in the napi_schedule call has not
been initialized, napi will crash.

The cleanest solution is to simply enable napi before calling
ipoib_ib_dev_stop in the dev_open error flow. (dev_stop then
immediately disables napi).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:48 -08:00
Ben Hutchings
649fb5ec0e IB/cxgb4: Fix formatting of physical address
Physical addresses may be wider than virtual addresses (e.g. on i386
with PAE) and must not be formatted with %p.

Compile-tested only.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:30 -08:00
Doug Ledford
be9130cc92 IB/cma: Check for GID on listening device first
As a simple optimization that should speed up the vast majority of
connect attemps on IB devices, when we are searching for the GID of an
incoming connection in the cached GID lists of devices, search the
device that received the incoming connection request first.  If we
don't find it there, then move on to other devices.

This reduces the time to perform 10,000 connections considerably.
Prior to this patch, a bad run of cmtime would look like this:

connect      :    12399.26   12351.10    8609.00    1239.93

With this patch, it looks more like this:

connect      :     5864.86    5799.80    8876.00     586.49

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:24 -08:00
Doug Ledford
29f27e8477 IB/cma: Use cached gids
The cma_acquire_dev function was changed by commit 3c86aa70bf
("RDMA/cm: Add RDMA CM support for IBoE devices") to use find_gid_port()
because multiport devices might have either IB or IBoE formatted gids.
The old function assumed that all ports on the same device used the
same GID format.

However, when it was changed to use find_gid_port(), we inadvertently
lost usage of the GID cache.  This turned out to be a very costly
change.  In our testing, each iteration through each index of the GID
table takes roughly 35us.  When you have multiple devices in a system,
and the GID you are looking for is on one of the later devices, the
code loops through all of the GID indexes on all of the early devices
before it finally succeeds on the target device.  This pathological
search behavior combined with 35us per GID table index retrieval
results in results such as the following from the cmtime application
that's part of the latest librdmacm git repo:

ib1:
step              total ms     max ms     min us  us / conn
create id    :       29.42       0.04       1.00       2.94
bind addr    :   186705.66      19.00   18556.00   18670.57
resolve addr :       41.93       9.68     619.00       4.19
resolve route:      486.93       0.48     101.00      48.69
create qp    :     4021.95       6.18     330.00     402.20
connect      :    68350.39   68588.17   24632.00    6835.04
disconnect   :     1460.43     252.65-1862269.00     146.04
destroy      :       41.16       0.04       2.00       4.12

ib0:
step              total ms     max ms     min us  us / conn
create id    :       28.61       0.68       1.00       2.86
bind addr    :     2178.86       2.95     201.00     217.89
resolve addr :       51.26      16.85     845.00       5.13
resolve route:      620.08       0.43      92.00      62.01
create qp    :     3344.40       6.36     273.00     334.44
connect      :     6435.99    6368.53    7844.00     643.60
disconnect   :     5095.38     321.90     757.00     509.54
destroy      :       37.13       0.02       2.00       3.71

Clearly, both the bind address and connect operations suffer
a huge penalty for being anything other than the default
GID on the first port in the system.

After applying this patch, the numbers now look like this:

ib1:
step              total ms     max ms     min us  us / conn
create id    :       30.15       0.03       1.00       3.01
bind addr    :       80.27       0.04       7.00       8.03
resolve addr :       43.02      13.53     589.00       4.30
resolve route:      482.90       0.45     100.00      48.29
create qp    :     3986.55       5.80     330.00     398.66
connect      :     7141.53    7051.29    5005.00     714.15
disconnect   :     5038.85     193.63     918.00     503.88
destroy      :       37.02       0.04       2.00       3.70

ib0:
step              total ms     max ms     min us  us / conn
create id    :       34.27       0.05       1.00       3.43
bind addr    :       26.45       0.04       1.00       2.64
resolve addr :       38.25      10.54     760.00       3.82
resolve route:      604.79       0.43      97.00      60.48
create qp    :     3314.95       6.34     273.00     331.49
connect      :    12399.26   12351.10    8609.00    1239.93
disconnect   :     5096.76     270.72    1015.00     509.68
destroy      :       37.10       0.03       2.00       3.71

It's worth noting that we still suffer a bit of a penalty on
connect to the wrong device, but the penalty is much less than
it used to be.  Follow on patches deal with this penalty.

Many thanks to Neil Horman for helping to track the source of
slow function that allowed us to track down the fact that
the original patch I mentioned above backed out cache usage
and identify just how much that impacted the system.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:24 -08:00
Trond Myklebust
a6b31d18b0 SUNRPC: Fix a data corruption issue when retransmitting RPC calls
The following scenario can cause silent data corruption when doing
NFS writes. It has mainly been observed when doing database writes
using O_DIRECT.

1) The RPC client uses sendpage() to do zero-copy of the page data.
2) Due to networking issues, the reply from the server is delayed,
   and so the RPC client times out.

3) The client issues a second sendpage of the page data as part of
   an RPC call retransmission.

4) The reply to the first transmission arrives from the server
   _before_ the client hardware has emptied the TCP socket send
   buffer.
5) After processing the reply, the RPC state machine rules that
   the call to be done, and triggers the completion callbacks.
6) The application notices the RPC call is done, and reuses the
   pages to store something else (e.g. a new write).

7) The client NIC drains the TCP socket send buffer. Since the
   page data has now changed, it reads a corrupted version of the
   initial RPC call, and puts it on the wire.

This patch fixes the problem in the following manner:

The ordering guarantees of TCP ensure that when the server sends a
reply, then we know that the _first_ transmission has completed. Using
zero-copy in that situation is therefore safe.
If a time out occurs, we then send the retransmission using sendmsg()
(i.e. no zero-copy), We then know that the socket contains a full copy of
the data, and so it will retransmit a faithful reproduction even if the
RPC call completes, and the application reuses the O_DIRECT buffer in
the meantime.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-11-08 17:19:15 -05:00
Seiji Aguchi
d34603b07c x86, trace: Add page fault tracepoints
This patch introduces page fault tracepoints to x86 architecture
by switching IDT.

  Two events, for user and kernel spaces, are introduced at the beginning
  of page fault handler for tracing.

  - User space event
    There is a request of page fault event for user space as below.

    https://lkml.kernel.org/r/1368079520-11015-2-git-send-email-fdeslaur+()+gmail+!+com
    https://lkml.kernel.org/r/1368079520-11015-1-git-send-email-fdeslaur+()+gmail+!+com

  - Kernel space event:
    When we measure an overhead in kernel space for investigating performance
    issues, we can check if it comes from the page fault events.

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/52716E67.6090705@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-11-08 14:15:49 -08:00
Seiji Aguchi
ac7956e269 x86, trace: Delete __trace_alloc_intr_gate()
Currently irq vector handlers for tracing are registered in both set_intr_gate()
 and __trace_alloc_intr_gate() in alloc_intr_gate().
But, we don't need to do that twice.
So, let's delete __trace_alloc_intr_gate().

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/52716E1B.7090205@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-11-08 14:15:47 -08:00
Seiji Aguchi
25c74b10ba x86, trace: Register exception handler to trace IDT
This patch registers exception handlers for tracing to a trace IDT.

To implemented it in set_intr_gate(), this patch does followings.
 - Register the exception handlers to
   the trace IDT by prepending "trace_" to the handler's names.
 - Also, newly introduce trace_page_fault() to add tracepoints
   in a subsequent patch.

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/52716DEC.5050204@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-11-08 14:15:45 -08:00
Seiji Aguchi
959c071f09 x86, trace: Remove __alloc_intr_gate()
Prepare to move set_intr_gate() into a macro by removing
__alloc_intr_gate().

The purpose is to avoid failing a kernel build after applying a
subsequent patch which changes set_intr_gate() into a macro.

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/52716DB8.1080702@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-11-08 14:15:44 -08:00
Nicholas Bellinger
4863e52565 target: Add per device xcopy_lun for copy offload I/O
This patch adds a se_device->xcopy_lun that is used for local
copy offload I/O, instead of allocating + initializing a pseudo
se_lun for each received EXTENDED_COPY operation.

Also, move declaration of struct se_lun + struct se_port_stat_grps
ahead of struct se_device.

Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2013-11-08 13:13:38 -08:00
Stefano Stabellini
ffc555be09 arm,arm64/include/asm/io.h: define struct bio_vec
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-11-08 16:12:28 -05:00
Konrad Rzeszutek Wilk
e1d8f62ad4 Merge remote-tracking branch 'stefano/swiotlb-xen-9.1' into stable/for-linus-3.13
* stefano/swiotlb-xen-9.1:
  swiotlb-xen: fix error code returned by xen_swiotlb_map_sg_attrs
  swiotlb-xen: static inline xen_phys_to_bus, xen_bus_to_phys, xen_virt_to_bus and range_straddles_page_boundary
  grant-table: call set_phys_to_machine after mapping grant refs
  arm,arm64: do not always merge biovec if we are running on Xen
  swiotlb: print a warning when the swiotlb is full
  swiotlb-xen: use xen_dma_map/unmap_page, xen_dma_sync_single_for_cpu/device
  xen: introduce xen_dma_map/unmap_page and xen_dma_sync_single_for_cpu/device
  swiotlb-xen: use xen_alloc/free_coherent_pages
  xen: introduce xen_alloc/free_coherent_pages
  arm64/xen: get_dma_ops: return xen_dma_ops if we are running as xen_initial_domain
  arm/xen: get_dma_ops: return xen_dma_ops if we are running as xen_initial_domain
  swiotlb-xen: introduce xen_swiotlb_set_dma_mask
  xen/arm,arm64: enable SWIOTLB_XEN
  xen: make xen_create_contiguous_region return the dma address
  xen/x86: allow __set_phys_to_machine for autotranslate guests
  arm/xen,arm64/xen: introduce p2m
  arm64: define DMA_ERROR_CODE
  arm: make SWIOTLB available

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Conflicts:
	arch/arm/include/asm/dma-mapping.h
	drivers/xen/swiotlb-xen.c

[Conflicts arose b/c "arm: make SWIOTLB available" v8 was in Stefano's
branch, while I had v9 + Ack from Russel. I also fixed up white-space
issues]
2013-11-08 16:10:48 -05:00
Konrad Rzeszutek Wilk
bad97817de Linux 3.12-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSWyGgAAoJEHm+PkMAQRiGA8MH/35upHXImoRCsI5uC1qvHJtI
 QvQAhDFxoEXbFUKeaYgTfcM8q9FgnqnfjhLf8eYa4Q7tDZeqLXOE8bkI807mSZMl
 yECr3jcwlV+zyhV2MP/HdwTjzy25bwxLM3Zy43S7QROrYoMHZYznil/QPfyMATCJ
 XLPuXZC1FtuUen89n4BoDIuL8QaVrIR/zLqFklAQcdTcGpLHSOwFtH8gb2WaRLhv
 +4IikFRFgTNZiMR5tP0GPc6UH6TVTvRb4QKSqqa7J8OmfAIvOzAUdhqWSPOIwWwt
 Z/+JFxFDczAcNmpv4gE6jkgc2vR8CVeHsvh0j61RDSFObBWspwk337CSyUZxYSA=
 =w4VQ
 -----END PGP SIGNATURE-----

Merge tag 'v3.12-rc5' into stable/for-linus-3.13

Linux 3.12-rc5

Because the Stefano branch (for SWIOTLB ARM changes) is based on that.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

* tag 'v3.12-rc5': (550 commits)
  Linux 3.12-rc5
  watchdog: sunxi: Fix section mismatch
  watchdog: kempld_wdt: Fix bit mask definition
  watchdog: ts72xx_wdt: locking bug in ioctl
  ARM: exynos: dts: Update 5250 arch timer node with clock frequency
  parisc: let probe_kernel_read() capture access to page zero
  parisc: optimize variable initialization in do_page_fault
  parisc: fix interruption handler to respect pagefault_disable()
  parisc: mark parisc_terminate() noreturn and cold.
  parisc: remove unused syscall_ipi() function.
  parisc: kill SMP single function call interrupt
  parisc: Export flush_cache_page() (needed by lustre)
  vfs: allow O_PATH file descriptors for fstatfs()
  ext4: fix memory leak in xattr
  ARC: Ignore ptrace SETREGSET request for synthetic register "stop_pc"
  ALSA: hda - Sony VAIO Pro 13 (haswell) now has a working headset jack
  ALSA: hda - Add a headset mic model for ALC269 and friends
  ALSA: hda - Fix microphone for Sony VAIO Pro 13 (Haswell model)
  compiler/gcc4: Add quirk for 'asm goto' miscompilation bug
  Revert "i915: Update VGA arbiter support for newer devices"
  ...
2013-11-08 15:28:05 -05:00
Stefano Stabellini
6fe19278ff swiotlb-xen: missing include dma-direction.h
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-11-08 15:22:10 -05:00
Stefano Stabellini
92c0fd17c0 pci-swiotlb-xen: call pci_request_acs only ifdef CONFIG_PCI
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-11-08 15:21:44 -05:00
John Fastabend
51f3773bde ixgbe: deleting dfwd stations out of order can cause null ptr deref
The number of stations in use is kept in the num_rx_pools counter
in the ixgbe_adapter structure. This is in turn used by the queue
allocation scheme to determine how many queues are needed to support
the number of pools in use with the current feature set.

This works as long as the pools are added and destroyed in order
because (num_rx_pools * queues_per_pool) is equal to the last
queue in use by a pool. But as soon as you delete a pool out of
order this is no longer the case. So the above multiplication
allocates to few queues and a pool may reference a ring that has
not been allocated/initialized.

To resolve use the bit mask of in use pools to determine the final
pool being used and allocate enough queues so that we don't
inadvertently remove its queues.

# ip link add link eth2 \
	numtxqueues 4 numrxqueues 4 txqueuelen 50 type macvlan
# ip link set dev macvlan0 up
# ip link add link eth2 \
	numtxqueues 4 numrxqueues 4 txqueuelen 50 type macvlan
# ip link set dev macvlan1 up
# for i in {0..100}; do
  ip link set dev macvlan0 down; ip link set dev macvlan0 up;
  done;

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08 15:21:08 -05:00
John Fastabend
219354d489 ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
In the recent support for layer 2 hardware acceleration, I added a
few references to real_num_rx_queues and num_rx_queues which are
only available with CONFIG_RPS.

The fix is first to remove unnecessary references to num_rx_queues.
Because the hardware offload case is limited to cases where RX queues
and TX queues are equal we only need a single check. Then wrap the
single case in an ifdef.

The patch that introduce this is here,

commit a6cc0cfa72
Author: John Fastabend <john.r.fastabend@intel.com>
Date:   Wed Nov 6 09:54:46 2013 -0800

    net: Add layer 2 hardware acceleration operations for macvlan devices

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08 15:21:08 -05:00
Stefano Stabellini
fbd989b1d7 arm: make SWIOTLB available
IOMMU_HELPER is needed because SWIOTLB calls iommu_is_span_boundary,
provided by lib/iommu_helper.c.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: will.deacon@arm.com
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>

Changes in v9:
- remove uneeded include asm/cacheflush.h;
- just return 0 if !dev->dma_mask in dma_capable.

Changes in v8:
- use __phys_to_pfn and __pfn_to_phys.

Changes in v7:
- dma_mark_clean: empty implementation;
- in dma_capable use coherent_dma_mask if dma_mask hasn't been
  allocated.

Changes in v6:
- check for dev->dma_mask being NULL in dma_capable.

Changes in v5:
- implement dma_mark_clean using dmac_flush_range.

Changes in v3:
- dma_capable: do not treat dma_mask as a limit;
- remove SWIOTLB dependency on NEED_SG_DMA_LENGTH.
2013-11-08 15:16:07 -05:00
Duan Jiong
f104a567e6 ipv6: use rt6_get_dflt_router to get default router in rt6_route_rcv
As the rfc 4191 said, the Router Preference and Lifetime values in a
::/0 Route Information Option should override the preference and lifetime
values in the Router Advertisement header. But when the kernel deals with
a ::/0 Route Information Option, the rt6_get_route_info() always return
NULL, that means that overriding will not happen, because those default
routers were added without flag RTF_ROUTEINFO in rt6_add_dflt_router().

In order to deal with that condition, we should call rt6_get_dflt_router
when the prefix length is 0.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-08 15:16:04 -05:00
Paul Gortmaker
3b284bde70 xen: delete new instances of added __cpuinit
commit 6efa20e49b
("xen: Support 64-bit PV guest receiving NMIs") and
commit cd9151e26d
( "xen/balloon: set a mapping for ballooned out pages")
added new instances of __cpuinit usage.

We removed this a couple versions ago; we now want to remove
the compat no-op stubs.  Introducing new users is not what
we want to see at this point in time, as it will break once
the stubs are gone.

Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-11-08 15:13:16 -05:00