Now that we can reuse bound ports after a close, we never really want to
clear the transport's source port after it has been set. Doing so really
messes up the NFSv3 DRC on the server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Now that we're setting SO_REUSEPORT, we still need to handle the
case where a connect() is attempted, but the old socket is still
lingering.
Essentially, all we want to do here is handle the error by waiting
a few seconds and then retrying.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When using TCP, we need the ability to reuse port numbers after
a disconnection, so that the NFSv3 server knows that we're the same
client. Currently we use a hack to work around the TCP socket's
TIME_WAIT: we send an RST instead of closing, which doesn't
always work...
The SO_REUSEPORT option added in Linux 3.9 allows us to bind multiple
TCP connections to the same source address+port combination, and thus
to use ordinary TCP close() instead of the current hack.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch fixes a sparse warning in the initial submission.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJU09WdAAoJENfLVL+wpUDr2HMP/2rvjkncrlbumSUlkMMUPenZ
PUM6dDAnvvhkRj/YrhmJ94v9SPb4BsR0ay4apn5aqmoEHqEy/LfAsoy1TlcspRO6
DwIKD8wHQ9RkpIVaEt1IGhApWlWK3R5FqoDTSMW9VLq72QuCsnbX3EXWN9B+hgT7
UVmcsGe0/5zd9v8RXzWJjyabOsx7rx7gNuuAyULO3RudNMzEpldMf7oHq0RwndOt
akxSSvRSfMFyfJscfI3F2IJsbXBST+ZzzJ0oRXXWn0RfiWR+Z438Nk9HDuOfGBFv
LACoELeJxtBEFAjBv6GnBILdSxRi6dTZPpAEzvNR45SJAk4c1gGDqgUbyIYXqeYM
k8alEMCS+eKvDe7DL6N1Nlxq6oc5pAt7BYYpRzCz+2dd7gfHIOwurBnZwFgxN/rK
GhtXCaBqqVr4lmu/lbgWjVFN/Jt6TGZ6+FqQhDM70CmJGBWbOEJNE6H50R7SiFpv
7UapXg9iknCpjexkZS3cyAB0Qu31qvIK64ZzsPhVafmSYh9ed5hlT2WSwphfDki5
ZQOktuPFtVkOI55Rz+2EpNNYe9nRuL5wBqD4Na3uUiEWgbnDLRmx+sDyzcfPujTN
trFTwUO082SFK0yZVusZ04vlqjfzxSd4aYSeC5Zzgt0CYJCZd7z0tFrdLpyO/fDn
PLfvcJyFRBBbaPgiwBHg
=3fRY
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma
NFS: RDMA Client Sparse Fixes
This patch fixes a sparse warning in the initial submission.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
xprtrdma: Address sparse complaint in rpcr_to_rdmar()
With "make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__":
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: warning: incorrect
type in initializer (different base types)
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: expected restricted
__be32 [usertype] *buffer
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: got unsigned int
[usertype] *rq_buffer
As far as I can tell this is a false positive.
Reported-by: kbuild-all@01.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fix an Oopsable condition when nsm_mon_unmon is called as part of the
namespace cleanup, which now apparently happens after the utsname
has been freed.
Link: http://lkml.kernel.org/r/20150125220604.090121ae@neptune.home
Reported-by: Bruno Prémont <bonbons@linux-vserver.org>
Cc: stable@vger.kernel.org # 3.18
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* flexfiles: (53 commits)
pnfs: lookup new lseg at lseg boundary
nfs41: .init_read and .init_write can be called with valid pg_lseg
pnfs: Update documentation on the Layout Drivers
pnfs/flexfiles: Add the FlexFile Layout Driver
nfs: count DIO good bytes correctly with mirroring
nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET
nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes
nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags
nfs/flexfiles: send layoutreturn before freeing lseg
nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE
nfs41: allow async version layoutreturn
nfs41: add range to layoutreturn args
pnfs: allow LD to ask to resend read through pnfs
nfs: add nfs_pgio_current_mirror helper
nfs: only reset desc->pg_mirror_idx when mirroring is supported
nfs41: add a debug warning if we destroy an unempty layout
pnfs: fail comparison when bucket verifier not set
nfs: mirroring support for direct io
nfs: add mirroring support to pgio layer
pnfs: pass ds_commit_idx through the commit path
...
Conflicts:
fs/nfs/pnfs.c
fs/nfs/pnfs.h
Add a call to tally stats for a task under a different statsidx than
what's contained in the task structure.
This is needed to properly account for pnfs reads/writes when the
DS nfs version != the MDS version.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
These patches improve the scalability of the NFSoRDMA client and take large
variables off of the stack. Additionally, the GFP_* flags are updated to
match what TCP uses.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUy/21AAoJENfLVL+wpUDrDR4QAJ+aZ1LqO5muJG5LDIReh0p+
IyvAO3ekWKaaXZd5e/wOviHRRSBCs3oRBm1sNyjOOZYf7nl67RrndjGq2Ccp6tGw
6V6ligIlhu3QAApFiQmLfpDXQeDBysMfhEQMXDj8kVhAvn3SWwbT6q1Bb8tMZvnu
lwSPYCuLAdtPzDs3j3Pt3jSrOD7gS06Vtumco1yS82YiLK86E7JiOLb9w5Ltk0Wo
DT9hZ7JXtflBy3trBsNYQ7GTjoYEp792Ca9DT/nie8/Tuuy3DhooLJKdwVErq/0/
ycI033mw//RneY+/eF2wLeS4PlzaiSq7DmXKpE5MnkpVuNAdPbHbCuVzuXa4/LLD
NhKyw/NTAE9ucFJ8e3apj3m42O00bMV0rx0rOUvMMdbNPBKzXk3rrb4uRYfT4k+D
yss7aEZg0EcDbe9KqfuDdwXVJIZnvAvXD+V1iL4X3QmVM0JyJZouNZPoG7fCwA1g
uoqGJK9zu7osr59YMomf2okWQFUQYCea5ombXYOo3ZtANw+OYaHDcMKTXaiCtMdA
CZUEOND8mB/EFtgTMA2TZmDzch8iOOuD7wmtifb8h1DqN/3GxgecBcT5XMCx5zDR
bMpt9vbCfe4YP6idpl5XEnVbcPG9CZF2W2Hg2gI7axX8R+ZMvwQQlWJjOqxoR6+F
bpD/jjWFYaS6U7NGyylM
=YCZe
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma
NFS: Client side changes for RDMA
These patches improve the scalability of the NFSoRDMA client and take large
variables off of the stack. Additionally, the GFP_* flags are updated to
match what TCP uses.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (21 commits)
xprtrdma: Update the GFP flags used in xprt_rdma_allocate()
xprtrdma: Clean up after adding regbuf management
xprtrdma: Allocate zero pad separately from rpcrdma_buffer
xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep
xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req
xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req
xprtrdma: Add struct rpcrdma_regbuf and helpers
xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()
xprtrdma: Simplify synopsis of rpcrdma_buffer_create()
xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack
xprtrdma: Take struct ib_device_attr off the stack
xprtrdma: Free the pd if ib_query_qp() fails
xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt
xprtrdma: Move credit update to RPC reply handler
xprtrdma: Remove rl_mr field, and the mr_chunk union
xprtrdma: Remove rpcrdma_ep::rep_ia
xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt
xprtrdma: Clean up hdrlen
xprtrdma: Display XIDs in host byte order
xprtrdma: Modernize htonl and ntohl
...
Reflect the more conservative approach used in the socket transport's
version of this transport method. An RPC buffer allocation should
avoid forcing not just FS activity, but any I/O.
In particular, two recent changes missed updating xprtrdma:
- Commit c6c8fe79a8 ("net, sunrpc: suppress allocation warning ...")
- Commit a564b8f039 ("nfs: enable swap on NFS")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_{de}register_internal() are used only in verbs.c now.
MAX_RPCRDMAHDR is no longer used and can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use the new rpcrdma_alloc_regbuf() API to shrink the amount of
contiguous memory needed for a buffer pool by moving the zero
pad buffer into a regbuf.
This is for consistency with the other uses of internally
registered memory.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The rr_base field is currently the buffer where RPC replies land.
An RPC/RDMA reply header lands in this buffer. In some cases an RPC
reply header also lands in this buffer, just after the RPC/RDMA
header.
The inline threshold is an agreed-on size limit for RDMA SEND
operations that pass from server and client. The sum of the
RPC/RDMA reply header size and the RPC reply header size must be
less than this threshold.
The largest RDMA RECV that the client should have to handle is the
size of the inline threshold. The receive buffer should thus be the
size of the inline threshold, and not related to RPCRDMA_MAX_SEGS.
RPC replies received via RDMA WRITE (long replies) are caught in
rq_rcv_buf, which is the second half of the RPC send buffer. Ie,
such replies are not involved in any way with rr_base.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The rl_base field is currently the buffer where each RPC/RDMA call
header is built.
The inline threshold is an agreed-on size limit to for RDMA SEND
operations that pass between client and server. The sum of the
RPC/RDMA header size and the RPC header size must be less than or
equal to this threshold.
Increasing the r/wsize maximum will require MAX_SEGS to grow
significantly, but the inline threshold size won't change (both
sides agree on it). The server's inline threshold doesn't change.
Since an RPC/RDMA header can never be larger than the inline
threshold, make all RPC/RDMA header buffers the size of the
inline threshold.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Because internal memory registration is an expensive and synchronous
operation, xprtrdma pre-registers send and receive buffers at mount
time, and then re-uses them for each RPC.
A "hardway" allocation is a memory allocation and registration that
replaces a send buffer during the processing of an RPC. Hardway must
be done if the RPC send buffer is too small to accommodate an RPC's
call and reply headers.
For xprtrdma, each RPC send buffer is currently part of struct
rpcrdma_req so that xprt_rdma_free(), which is passed nothing but
the address of an RPC send buffer, can find its matching struct
rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof.
That means that hardway currently has to replace a whole rpcrmda_req
when it replaces an RPC send buffer. This is often a fairly hefty
chunk of contiguous memory due to the size of the rl_segments array
and the fact that both the send and receive buffers are part of
struct rpcrdma_req.
Some obscure re-use of fields in rpcrdma_req is done so that
xprt_rdma_free() can detect replaced rpcrdma_req structs, and
restore the original.
This commit breaks apart the RPC send buffer and struct rpcrdma_req
so that increasing the size of the rl_segments array does not change
the alignment of each RPC send buffer. (Increasing rl_segments is
needed to bump up the maximum r/wsize for NFS/RDMA).
This change opens up some interesting possibilities for improving
the design of xprt_rdma_allocate().
xprt_rdma_allocate() is now the one place where RPC send buffers
are allocated or re-allocated, and they are now always left in place
by xprt_rdma_free().
A large re-allocation that includes both the rl_segments array and
the RPC send buffer is no longer needed. Send buffer re-allocation
becomes quite rare. Good send buffer alignment is guaranteed no
matter what the size of the rl_segments array is.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There are several spots that allocate a buffer via kmalloc (usually
contiguously with another data structure) and then register that
buffer internally. I'd like to split the buffers out of these data
structures to allow the data structures to scale.
Start by adding functions that can kmalloc and register a buffer,
and can manage/preserve the buffer's associated ib_sge and ib_mr
fields.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Move the details of how to create and destroy rpcrdma_req and
rpcrdma_rep structures into helper functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: There is one call site for rpcrdma_buffer_create(). All of
the arguments there are fields of an rpcrdma_xprt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reduce stack footprint of the connection upcall handler function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Device attributes are large, and are used in more than one place.
Stash a copy in dynamically allocated memory.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If ib_query_qp() fails or the memory registration mode isn't
supported, don't leak the PD. An orphaned IB/core resource will
cause IB module removal to hang.
Fixes: bd7ed1d133 ("RPC/RDMA: check selected memory registration ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: The rep_func field always refers to rpcrdma_conn_func().
rep_func should have been removed by commit b45ccfd25d ("xprtrdma:
Remove MEMWINDOWS registration modes").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reduce work in the receive CQ handler, which can be run at hardware
interrupt level, by moving the RPC/RDMA credit update logic to the
RPC reply handler.
This has some additional benefits: More header sanity checking is
done before trusting the incoming credit value, and the receive CQ
handler no longer touches the RPC/RDMA header (the CPU stalls while
waiting for the header contents to be brought into the cache).
This further extends work begun by commit e7ce710a88 ("xprtrdma:
Avoid deadlock when credit window is reset").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Since commit 0ac531c183 ("xprtrdma: Remove REGISTER
memory registration mode"), the rl_mr pointer is no longer used
anywhere.
After removal, there's only a single member of the mr_chunk union,
so mr_chunk can be removed as well, in favor of a single pointer
field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: This field is not used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Use consistent field names in struct rpcrdma_xprt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Replace naked integers with a documenting macro.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
xprtsock.c and the backchannel code display XIDs in host byte order.
Follow suit in xprtrdma.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Replace htonl and ntohl with the be32 equivalents.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make it easier to grep the system log for specific error conditions.
The wc.opcode field is not included because opcode numbers are
sparse, and because wc.opcode is not necessarily valid when
completion reports an error.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Increase the concurrency level for rpciod threads to allow for allocations
etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range
locks may now need to allocate memory from inside rpciod.
Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull networking fixes from David Miller:
1) Don't use uninitialized data in IPVS, from Dan Carpenter.
2) conntrack race fixes from Pablo Neira Ayuso.
3) Fix TX hangs with i40e, from Jesse Brandeburg.
4) Fix budget return from poll calls in dnet and alx, from Eric
Dumazet.
5) Fix bugus "if (unlikely(x) < 0)" test in AF_PACKET, from Christoph
Jaeger.
6) Fix bug introduced by conversion to list_head in TIPC retransmit
code, from Jon Paul Maloy.
7) Don't use GFP_NOIO under spinlock in USB kaweth driver, from Alexey
Khoroshilov.
8) Fix bridge build with INET disabled, from Arnd Bergmann.
9) Fix netlink array overrun for PROBE attributes in openvswitch, from
Thomas Graf.
10) Don't hold spinlock across synchronize_irq() in tg3 driver, from
Prashant Sreedharan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
tg3: Release tp->lock before invoking synchronize_irq()
tg3: tg3_reset_task() needs to use rtnl_lock to synchronize
tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync
team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin
openvswitch: packet messages need their own probe attribtue
i40e: adds FCoE configure option
cxgb4vf: Fix queue allocation for 40G adapter
netdevice: Add missing parentheses in macro
bridge: only provide proxy ARP when CONFIG_INET is enabled
neighbour: fix base_reachable_time(_ms) not effective immediatly when changed
net: fec: fix MDIO bus assignement for dual fec SoC's
xen-netfront: use different locks for Rx and Tx stats
drivers: net: cpsw: fix multicast flush in dual emac mode
cxgb4vf: Initialize mdio_addr before using it
net: Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations
usb/kaweth: use GFP_ATOMIC under spin_lock in usb_start_wait_urb()
MAINTAINERS: add me as ibmveth maintainer
tipc: fix bug in broadcast retransmit code
update ip-sysctl.txt documentation (v2)
net/at91_ether: prepare and unprepare clock
...
User space is currently sending a OVS_FLOW_ATTR_PROBE for both flow
and packet messages. This leads to an out-of-bounds access in
ovs_packet_cmd_execute() because OVS_FLOW_ATTR_PROBE >
OVS_PACKET_ATTR_MAX.
Introduce a new OVS_PACKET_ATTR_PROBE with the same numeric value
as OVS_FLOW_ATTR_PROBE to grow the range of accepted packet attributes
while maintaining to be binary compatible with existing OVS binaries.
Fixes: 05da589 ("openvswitch: Add support for OVS_FLOW_ATTR_PROBE.")
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tracked-down-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IPV4 support is disabled, we cannot call arp_send from
the bridge code, which would result in a kernel link error:
net/built-in.o: In function `br_handle_frame_finish':
:(.text+0x59914): undefined reference to `arp_send'
:(.text+0x59a50): undefined reference to `arp_tbl'
This makes the newly added proxy ARP support in the bridge
code depend on the CONFIG_INET symbol and lets the compiler
optimize the code out to avoid the link error.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Cc: Kyeyoon Park <kyeyoonp@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting base_reachable_time or base_reachable_time_ms on a
specific interface through sysctl or netlink, the reachable_time
value is not updated.
This means that neighbour entries will continue to be updated using the
old value until it is recomputed in neigh_period_work (which
recomputes the value every 300*HZ).
On systems with HZ equal to 1000 for instance, it means 5mins before
the change is effective.
This patch changes this behavior by recomputing reachable_time after
each set on base_reachable_time or base_reachable_time_ms.
The new value will become effective the next time the neighbour's timer
is triggered.
Changes are made in two places: the netlink code for set and the sysctl
handling code. For sysctl, I use a proc_handler. The ipv6 network
code does provide its own handler but it already refreshes
reachable_time correctly so it's not an issue.
Any other user of neighbour which provide its own handlers must
refresh reachable_time.
Signed-off-by: Jean-Francois Remy <jeff@melix.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 58dc55f256 ("tipc: use generic
SKB list APIs to manage link transmission queue") we replace all list
traversal loops with the macros skb_queue_walk() or
skb_queue_walk_safe(). While the previous loops were based on the
assumption that the list was NULL-terminated, the standard macros
stop when the iterator reaches the list head, which is non-NULL.
In the function bclink_retransmit_pkt() this macro replacement has
lead to a bug. When we receive a BCAST STATE_MSG we unconditionally
call the function bclink_retransmit_pkt(), whether there really is
anything to retransmit or not, assuming that the sequence number
comparisons will lead to the correct behavior. However, if the
transmission queue is empty, or if there are no eligible buffers in
the transmission queue, we will by mistake pass the list head pointer
to the function tipc_link_retransmit(). Since the list head is not a
valid sk_buff, this leads to a crash.
In this commit we fix this by only calling tipc_link_retransmit()
if we actually found eligible buffers in the transmission queue.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter/ipvs fixes for net
The following patchset contains netfilter/ipvs fixes, they are:
1) Small fix for the FTP helper in IPVS, a diff variable may be left
unset when CONFIG_IP_VS_IPV6 is set. Patch from Dan Carpenter.
2) Fix nf_tables port NAT in little endian archs, patch from leroy
christophe.
3) Fix race condition between conntrack confirmation and flush from
userspace. This is the second reincarnation to resolve this problem.
4) Make sure inner messages in the batch come with the nfnetlink header.
5) Relax strict check from nfnetlink_bind() that may break old userspace
applications using all 1s group mask.
6) Schedule removal of chains once no sets and rules refer to them in
the new nf_tables ruleset flush command. Reported by Asbjoern Sloth
Toennesen.
Note that this batch comes later than usual because of the short
winter holidays.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to a misplaced parenthesis, the expression
(unlikely(offset) < 0),
which expands to
(__builtin_expect(!!(offset), 0) < 0),
never evaluates to true. Therefore, when sending packets with
PF_PACKET/SOCK_DGRAM, packet_snd() does not abort as intended
if the creation of the layer 2 header fails.
Spotted by Coverity - CID 1259975 ("Operands don't affect result").
Fixes: 9c7077622d ("packet: make packet_snd fail on len smaller than l2 header")
Signed-off-by: Christoph Jaeger <cj@linux.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull two nfsd bugfixes from Bruce Fields.
* 'for-3.19' of git://linux-nfs.org/~bfields/linux:
rpc: fix xdr_truncate_encode to handle buffer ending on page boundary
nfsd: fix fi_delegees leak when fi_had_conflict returns true
Pull two Ceph fixes from Sage Weil:
"These are both pretty trivial: a sparse warning fix and size_t printk
thing"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: fix sparse endianness warnings
ceph: use %zu for len in ceph_fill_inline_data()
A struct xdr_stream at a page boundary might point to the end of one
page or the beginning of the next, but xdr_truncate_encode isn't
prepared to handle the former.
This can cause corruption of NFSv4 READDIR replies in the case that a
readdir entry that would have exceeded the client's dircount/maxcount
limit would have ended exactly on a 4k page boundary. You're more
likely to hit this case on large directories.
Other xdr_truncate_encode callers are probably also affected.
Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Fixes: 3e19ce762b "rpc: xdr_truncate_encode"
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull networking fixes from David Miller:
"Just a pile of random fixes, including:
1) Do not apply TSO limits to non-TSO packets, fix from Herbert Xu.
2) MDI{,X} eeprom check in e100 driver is reversed, from John W.
Linville.
3) Missing error return assignments in several ethernet drivers, from
Julia Lawall.
4) Altera TSE device doesn't come back up after ifconfig down/up
sequence, fix from Kostya Belezko.
5) Add more cases to the check for whether the qmi_wwan device has a
bogus MAC address and needs to be assigned a random one. From
Kristian Evensen.
6) Fix interrupt hangs in CPSW, from Felipe Balbi.
7) Implement ndo_features_check in r8152 so that the stack doesn't
feed GSO packets which are outside of the chip's capabilities.
From Hayes Wang"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
qla3xxx: don't allow never end busy loop
xen-netback: fixing the propagation of the transmit shaper timeout
r8152: support ndo_features_check
batman-adv: fix potential TT client + orig-node memory leak
batman-adv: fix multicast counter when purging originators
batman-adv: fix counter for multicast supporting nodes
batman-adv: fix lock class for decoding hash in network-coding.c
batman-adv: fix delayed foreign originator recognition
batman-adv: fix and simplify condition when bonding should be used
Revert "mac80211: Fix accounting of the tailroom-needed counter"
net: ethernet: cpsw: fix hangs with interrupts
enic: free all rq buffs when allocation fails
qmi_wwan: Set random MAC on devices with buggy fw
openvswitch: Consistently include VLAN header in flow and port stats.
tcp: Do not apply TSO segment limit to non-TSO packets
Altera TSE: Add missing phydev
net/mlx4_core: Fix error flow in mlx4_init_hca()
net/mlx4_core: Correcly update the mtt's offset in the MR re-reg flow
qlcnic: Fix return value in qlcnic_probe()
net: axienet: fix error return code
...
Relax the checking that was introduced in 97840cb ("netfilter:
nfnetlink: fix insufficient validation in nfnetlink_bind") when the
subscription bitmask is used. Existing userspace code code may request
to listen to all of the existing netlink groups by setting an all to one
subscription group bitmask. Netlink already validates subscription via
setsockopt() for us.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make sure there is enough room for the nfnetlink header in the
netlink messages that are part of the batch. There is a similar
check in netlink_rcv_skb().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 5195c14c8b ("netfilter: conntrack: fix race in
__nf_conntrack_confirm against get_next_corpse") aimed to resolve the
race condition between the confirmation (packet path) and the flush
command (from control plane). However, it introduced a crash when
several packets race to add a new conntrack, which seems easier to
reproduce when nf_queue is in place.
Fix this race, in __nf_conntrack_confirm(), by removing the CT
from unconfirmed list before checking the DYING bit. In case
race occured, re-add the CT to the dying list
This patch also changes the verdict from NF_ACCEPT to NF_DROP when
we lose race. Basically, the confirmation happens for the first packet
that we see in a flow. If you just invoked conntrack -F once (which
should be the common case), then this is likely to be the first packet
of the flow (unless you already called flush anytime soon in the past).
This should be hard to trigger, but better drop this packet, otherwise
we leave things in inconsistent state since the destination will likely
reply to this packet, but it will find no conntrack, unless the origin
retransmits.
The change of the verdict has been discussed in:
https://www.marc.info/?l=linux-netdev&m=141588039530056&w=2
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- ensure bonding is used (if enabled) for packets coming in the soft
interface
- fix race condition to avoid orig_nodes to be deleted right after
being added
- avoid false positive lockdep splats by assigning lockclass to
the proper hashtable lock objects
- avoid miscounting of multicast 'disabled' nodes in the network
- fix memory leak in the Global Translation Table in case of
originator interval change
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUq7PtAAoJEJgn97Bh2u9ejD0P/AqFX7KAqlzEZo7SSIwduX/e
OLUWJyplATljKAwmTM+gFxyTC6HOUhqG/0UZpUu83OA0uC5HmT2ATJXUal1vG8uN
ewfxqhsnz8UMxmjENcjsExGdhjW0CDeD+lRdxPGbbq5vEbqe9dXXr4HVAlJFGhnX
fzR65WB2om9BTYMgIWmZutRjP9VTPArgOCKtCgLclHut19VXmZDALPWZ5S6kPfUG
O9fSPi0uzr7WIy6p+5XCi69pw+rlHv23A3soXK9G4Ze+ET2QcQSgkL6Q3VJ2+uMV
wBzTahRN7sJgJ8GYBDrhXgw3yRl2FkjLNPav3+gOnbK0+PycrnGiC1YPbX8w/cpr
EHyFB/kZCeBusPaRQAYP91YI3cYJ3Onq3kjbUPVV+ASMoJ/vdVORRjDHY8ALf12w
uGp/+9aAEJYqAOZhVZKTofTgfT3ULQycd+Ffi/0LC8OLaKQeqdEcmGRrIKNJmsP5
a+gsKEMe7QCbHkWAVnKLGKthLH4IRU98gVApS1nld9OE2yrPyywrA5ac353iFB1K
Y/56RFAqU1nywv+hhArcCZymiOPxKyF+lK3VAHvnlFgMMJs5gNZzoGX7V7TODkiA
yiZCuzOGsB4hPSdzf1KfCVR3LkFfaI2jGr7X+38OdK9WGMG2mcPEB0tyXr7uFuky
V/VPJDENwQwmgfdR+D+3
=30sl
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Included changes:
- ensure bonding is used (if enabled) for packets coming in the soft
interface
- fix race condition to avoid orig_nodes to be deleted right after
being added
- avoid false positive lockdep splats by assigning lockclass to
the proper hashtable lock objects
- avoid miscounting of multicast 'disabled' nodes in the network
- fix memory leak in the Global Translation Table in case of
originator interval change
Signed-off-by: David S. Miller <davem@davemloft.net>
p54 and cw2100 drivers (arguably due to bad assumptions there.)
Since this affects kernels since 3.17, I decided to revert for
now and we'll revisit this optimisation properly for -next.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJUq9EvAAoJEDBSmw7B7bqr7mMQAJRbdepVVqK5IFH1BH0NC7vi
LkBE2lp8ZAz1Crg+OAQNUdlZUHtGyfYoXSfzezmrMG51i5xjHyOYQQikW7aJ2SQ0
XsjJJ5TcqKe83NwoakXUrMpE7KmCt/LnbjKNXDsZIvLlUkqa7ksXaS7btK195aXy
WlVrmUE+BqT9a16VjFLZ6wRjI43+3bGxhtFL+g1eXw6nZ4a2o4EbIXdc9SN+/bT4
tAhWJfdAQqQc34jhesWGbMIvkXWhzy2R6Js+9gMIBNsmlAiYbFa4QZ/9tI3nBI/O
yHSiDc7JnPNjkkC+3wTJxMl7mEd6fEKnAS1ryZ5L4XhPrQpV39iZuWSPvPGw6LLW
kB6+wXkIyQdCSoyrQZxY75ibqOUKYYxhhkSYfMePXRKTYY6MlHYqiH8wPWFpPoqO
iumLqx8/CtRW1q1t2EBAG6rZLRF8HqmfqtB+ptT0DWcAP8E81q8BImPoPFr+P9S2
XfuuSw97xKCcilOcYJ0uYSBe4XNNhy1dtC/zJ8cA9nV4WNkofALga5Z/t8ARhDsM
wvP1D2uIX3U9My17bXq+Xn/fSSS7yhpLZjEHj/JNRvpDCWGf/tQl6A3ydMy//Oqe
lRSKfmiAGysqhXnmK12+YhfO+4ioTz8dA88tHs1AO8qasfQwx45eRsUPemWeExiL
9Lntb0U6MhYvgiTdWqt6
=CnbJ
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Here's just a single fix - a revert of a patch that broke the
p54 and cw2100 drivers (arguably due to bad assumptions there.)
Since this affects kernels since 3.17, I decided to revert for
now and we'll revisit this optimisation properly for -next.
Signed-off-by: David S. Miller <davem@davemloft.net>