Commit Graph

30427 Commits

Author SHA1 Message Date
Ilya Dryomov
917edad5d1 crush: CHOOSE_LEAF -> CHOOSELEAF throughout
This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.

Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:24 +02:00
Ilya Dryomov
cc10df4a3a crush: add SET_CHOOSE_TRIES rule step
Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.

Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:23 +02:00
Ilya Dryomov
f18650ace3 crush: apply chooseleaf_tries to firstn mode too
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well.  Note that we have
slightly different behavior here than with indep:

 If the firstn value is not specified for firstn, we pass through the
 normal attempt count.  This maintains compatibility with legacy behavior.
 Note that this is usually *not* actually N^2 work, though, because of the
 descend_once tunable.  However, descend_once is unfortunately *not* the
 same thing as 1 chooseleaf try because it is only checked on a reject but
 not on a collision.  Sigh.

 In contrast, for indep, if tries is not specified we default to 1
 recursive attempt, because that is simply more sane, and we have the
 option to do so.  The descend_once tunable has no effect for indep.

Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:22 +02:00
Ilya Dryomov
be3226acc5 crush: new SET_CHOOSE_LEAF_TRIES command
Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).

(We should do the same for the other tunables, by the way!)

Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:21 +02:00
Ilya Dryomov
4158608139 crush: pass parent r value for indep call
Pass down the parent's 'r' value so that we will sample different values in
the recursive call when the parent tries multiple times.  This avoids doing
useless work (calling multiple times and trying the same values).

Reflects ceph.git commit 2731d3030d7a3e80922b7f1b7756f9a4a124bac5.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:20 +02:00
Ilya Dryomov
ab4ce2b5bd crush: clarify numrep vs endpos
Pass numrep (the width of the result) separately from the number of results
we want *this* iteration.  This makes things less awkward when we do a
recursive call (for chooseleaf) and want only one item.

Reflects ceph.git commit 1b567ee08972f268c11b43fc881e57b5984dd08b.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:19 +02:00
Ilya Dryomov
9fe0718282 crush: strip firstn conditionals out of crush_choose, rename
Now that indep is handled by crush_choose_indep, rename crush_choose to
crush_choose_firstn and remove all the conditionals.  This ends up
stripping out *lots* of code.

Note that it *also* makes it obvious that the shenanigans we were playing
with r' for uniform buckets were broken for firstn mode.  This appears to
have happened waaaay back in commit dae8bec9 (or earlier)... 2007.

Reflects ceph.git commit 94350996cb2035850bcbece6a77a9b0394177ec9.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:18 +02:00
Ilya Dryomov
3102b0a5b4 crush: add note about r in recursive choose
Reflects ceph.git commit 4551fee9ad89d0427ed865d766d0d44004d3e3e1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:17 +02:00
Ilya Dryomov
9a3b490a20 crush: use breadth-first search for indep mode
Reflects ceph.git commit 86e978036a4ecbac4c875e7c00f6c5bbe37282d3.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:16 +02:00
Ilya Dryomov
c6d98a603a crush: return CRUSH_ITEM_UNDEF for failed placements with indep
For firstn mode, if we fail to make a valid placement choice, we just
continue and return a short result to the caller.  For indep mode, however,
we need to make the position stable, and return an undefined value on
failed placements to avoid shifting later results to the left.

Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:15 +02:00
Ilya Dryomov
e8ef19c4ad crush: eliminate CRUSH_MAX_SET result size limitation
This is only present to size the temporary scratch arrays that we put on
the stack.  Let the caller allocate them as they wish and remove the
limitation.

Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:14 +02:00
Ilya Dryomov
2a4ba74ef6 crush: fix some comments
Reflects ceph.git commit 3cef755428761f2481b1dd0e0fbd0464ac483fc5.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:13 +02:00
Ilya Dryomov
8f99c85b7a crush: reduce scope of some local variables
Reflects ceph.git commit e7d47827f0333c96ad43d257607fb92ed4176550.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:12 +02:00
Ilya Dryomov
bfb16d7d69 crush: factor out (trivial) crush_destroy_rule()
Reflects ceph.git commit 43a01c9973c4b83f2eaa98be87429941a227ddde.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:11 +02:00
Ilya Dryomov
b3b33b0e43 crush: pass weight vector size to map function
Pass the size of the weight vector into crush_do_rule() to ensure that we
don't access values past the end.  This can happen if the caller misbehaves
and passes a weight vector that is smaller than max_devices.

Currently the monitor tries to prevent that from happening, but this will
gracefully tolerate previous bad osdmaps that got into this state.  It's
also a bit more defensive.

Reflects ceph.git commit 5922e2c2b8335b5e46c9504349c3a55b7434c01a.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:10 +02:00
Ilya Dryomov
2b3e0c905a libceph: update ceph_features.h
This updates ceph_features.h so that it has all feature bits defined in
ceph.git.  In the interim since the last update, ceph.git crossed the
"32 feature bits" point, and, the addition of the 33rd bit wasn't
handled correctly.  The work-around is squashed into this commit and
reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:09 +02:00
Ilya Dryomov
12b4629a9f libceph: all features fields must be u64
In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64.  (ceph.git has ~40 feature bits at this
point.)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:08 +02:00
Josh Durgin
9a1ea2dbff libceph: resend all writes after the osdmap loses the full flag
With the current full handling, there is a race between osds and
clients getting the first map marked full. If the osd wins, it will
return -ENOSPC to any writes, but the client may already have writes
in flight. This results in the client getting the error and
propagating it up the stack. For rbd, the block layer turns this into
EIO, which can cause corruption in filesystems above it.

To avoid this race, osds are being changed to drop writes that came
from clients with an osdmap older than the last osdmap marked full.
In order for this to work, clients must resend all writes after they
encounter a full -> not full transition in the osdmap. osds will wait
for an updated map instead of processing a request from a client with
a newer map, so resent writes will not be dropped by the osd unless
there is another not full -> full transition.

This approach requires both osds and clients to be fixed to avoid the
race. Old clients talking to osds with this fix may hang instead of
returning EIO and potentially corrupting an fs. New clients talking to
old osds have the same behavior as before if they encounter this race.

Fixes: http://tracker.ceph.com/issues/6938

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-13 23:04:28 +02:00
Josh Durgin
d29adb34a9 libceph: block I/O when PAUSE or FULL osd map flags are set
The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes.  PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.

When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO.  If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.

Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default.  A follow-on patch makes this configurable.

__map_request() is called to re-target osd requests in case the
available osds changed.  Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set.  Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.

Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.

Fixes: http://tracker.ceph.com/issues/6079

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-13 09:13:29 -08:00
Li Wang
37c89bde5d ceph: Add necessary clean up if invalid reply received in handle_reply()
Wake up possible waiters, invoke the call back if any, unregister the request

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-13 09:13:29 -08:00
Linus Torvalds
29be6345bb NFS client bugfixes
- Stable fix for a NFSv4.1 delegation and state recovery deadlock
 - Stable fix for a loop on irrecoverable errors when returning delegations
 - Fix a 3-way deadlock between layoutreturn, open, and state recovery
 - Update the MAINTAINERS file with contact information for Trond Myklebust
 - Close needs to handle NFS4ERR_ADMIN_REVOKED
 - Enabling v4.2 should not recompile nfsd and lockd
 - Fix a couple of compile warnings
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJSoLTpAAoJEGcL54qWCgDy2dgQAIKkKAXccg3OG2b1SxJmiaja
 PcrovNmgg3HvYQ7clUMqtrMByiXEpSybl6tAeXYUWE3sS1DISSBVEwO3MoOiASiM
 951Ssx+CoyhsHYo5aH83sUIiWFl/YsRhpKmSr2cdQd13DQTFbPq896k64Inf6L2/
 9fngoqOD7FunQHn8AiVPoDOQzObB0OuKhYCwuwLt47oPiwgmm12JQNCDxU1i4sxb
 lkGUBLkPMs6D5IyI8XHaMyX3+8MvmPiIsjIKaNJRdhkuX/k7ollucTJXyvyEQKK0
 PhBIWyUULmKcAXYwCfHf9UoyGZFvmj47YggyKcBd26OZUEFekcWrULfym46F1xak
 EcO6D4mlTy5i5W0RBqYCj1oGud57rixZBmhLTbeq6sSJaiqBfGEs225Q17H7rsEB
 YIghHiEFNnBmVWELhHxbJHQoY6HOugmZOuc0dxopaikN/7to8gnYoVyTIVlMfe/t
 UNXZoer6GOOohJGtZ7s7v4Al7EzvwnVnBCBklEAKFJ7Ca2LEmq+b58oQW3nJ1mPn
 y4TnihxYXsSEbqy+Lds9rumRhJLG1oVTpwficAm7N3HdK3abzCIPEt6iOHoCmXQz
 J1B4gmwOKsDqVlCSpBsnc3ZiBlSJGOn6MmVQUCNFpzv/DetWn/BxEUPE8cNm8DaI
 WioD0grC0/9bR8oD1m+w
 =UZ51
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 - Stable fix for a NFSv4.1 delegation and state recovery deadlock
 - Stable fix for a loop on irrecoverable errors when returning
   delegations
 - Fix a 3-way deadlock between layoutreturn, open, and state recovery
 - Update the MAINTAINERS file with contact information for Trond
   Myklebust
 - Close needs to handle NFS4ERR_ADMIN_REVOKED
 - Enabling v4.2 should not recompile nfsd and lockd
 - Fix a couple of compile warnings

* tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: fix do_div() warning by instead using sector_div()
  MAINTAINERS: Update contact information for Trond Myklebust
  NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery
  SUNRPC: do not fail gss proc NULL calls with EACCES
  NFSv4: close needs to handle NFS4ERR_ADMIN_REVOKED
  NFSv4: Update list of irrecoverable errors on DELEGRETURN
  NFSv4 wait on recovery for async session errors
  NFS: Fix a warning in nfs_setsecurity
  NFS: Enabling v4.2 should not recompile nfsd and lockd
2013-12-05 13:05:48 -08:00
Linus Torvalds
5fc92de3c7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking updates from David Miller:
 "Here is a pile of bug fixes that accumulated while I was in Europe"

 1) In fixing kernel leaks to userspace during copying of socket
    addresses, we broke a case that used to work, namely the user
    providing a buffer larger than the in-kernel generic socket address
    structure.  This broke Ruby amongst other things.  Fix from Dan
    Carpenter.

 2) Fix regression added by byte queue limit support in 8139cp driver,
    from Yang Yingliang.

 3) The addition of MSG_SENDPAGE_NOTLAST buggered up a few sendpage
    implementations, they should just treat it the same as MSG_MORE.
    Fix from Richard Weinberger and Shawn Landden.

 4) Handle icmpv4 errors received on ipv6 SIT tunnels correctly, from
    Oussama Ghorbel.  In particular we should send an ICMPv6 unreachable
    in such situations.

 5) Fix some regressions in the recent genetlink fixes, in particular
    get the pmcraid driver to use the new safer interfaces correctly.
    From Johannes Berg.

 6) macvtap was converted to use a per-cpu set of statistics, but some
    code was still bumping tx_dropped elsewhere.  From Jason Wang.

 7) Fix build failure of xen-netback due to missing include on some
    architectures, from Andy Whitecroft.

 8) macvtap double counts received packets in statistics, fix from Vlad
    Yasevich.

 9) Fix various cases of using *_STATS_BH() when *_STATS() is more
    appropriate.  From Eric Dumazet and Hannes Frederic Sowa.

10) Pktgen ipsec mode doesn't update the ipv4 header length and checksum
    properly after encapsulation.  Fix from Fan Du.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
  net/mlx4_en: Remove selftest TX queues empty condition
  {pktgen, xfrm} Update IPv4 header total len and checksum after tranformation
  virtio_net: make all RX paths handle erors consistently
  virtio_net: fix error handling for mergeable buffers
  virtio_net: Fixed a trivial typo (fitler --> filter)
  netem: fix gemodel loss generator
  netem: fix loss 4 state model
  netem: missing break in ge loss generator
  net/hsr: Support iproute print_opt ('ip -details ...')
  net/hsr: Very small fix of comment style.
  MAINTAINERS: Added net/hsr/ maintainer
  ipv6: fix possible seqlock deadlock in ip6_finish_output2
  ixgbe: Make ixgbe_identify_qsfp_module_generic static
  ixgbe: turn NETIF_F_HW_L2FW_DOFFLOAD off by default
  ixgbe: ixgbe_fwd_ring_down needs to be static
  e1000: fix possible reset_task running after adapter down
  e1000: fix lockdep warning in e1000_reset_task
  e1000: prevent oops when adapter is being closed and reset simultaneously
  igb: Fixed Wake On LAN support
  inet: fix possible seqlock deadlocks
  ...
2013-12-02 10:09:07 -08:00
fan.du
3868204d6b {pktgen, xfrm} Update IPv4 header total len and checksum after tranformation
commit a553e4a631 ("[PKTGEN]: IPSEC support")
tried to support IPsec ESP transport transformation for pktgen, but acctually
this doesn't work at all for two reasons(The orignal transformed packet has
bad IPv4 checksum value, as well as wrong auth value, reported by wireshark)

- After transpormation, IPv4 header total length needs update,
  because encrypted payload's length is NOT same as that of plain text.

- After transformation, IPv4 checksum needs re-caculate because of payload
  has been changed.

With this patch, armmed pktgen with below cofiguration, Wireshark is able to
decrypted ESP packet generated by pktgen without any IPv4 checksum error or
auth value error.

pgset "flag IPSEC"
pgset "flows 1"

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-01 20:33:52 -05:00
stephen hemminger
eff7979f00 netem: fix gemodel loss generator
Patch from developers of the alternative loss models, downloaded from:
   http://netgroup.uniroma2.it/twiki/bin/view.cgi/Main/NetemCLG

 "in case 2, of the switch we change the direction of the inequality to
  net_random()>clg->a3, because clg->a3 is h in the GE model and when h
  is 0 all packets will be lost."

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:49:29 -05:00
stephen hemminger
ab6c27be81 netem: fix loss 4 state model
Patch from developers of the alternative loss models, downloaded from:
   http://netgroup.uniroma2.it/twiki/bin/view.cgi/Main/NetemCLG

 "In the case 1 of the switch statement in the if conditions we
   need to add clg->a4 to clg->a1, according to the model."

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:49:28 -05:00
stephen hemminger
7c2781fa92 netem: missing break in ge loss generator
There is a missing break statement in the Gilbert Elliot loss model
generator which makes state machine behave incorrectly.

Reported-by: Martin Burri <martin.burri@ch.abb.com
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:49:28 -05:00
Arvid Brodin
98bf836222 net/hsr: Support iproute print_opt ('ip -details ...')
This implements the rtnl_link_ops fill_info routine for HSR.

Signed-off-by: Arvid Brodin <arvid.brodin@alten.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:48:14 -05:00
Arvid Brodin
213e3bc723 net/hsr: Very small fix of comment style.
Signed-off-by: Arvid Brodin <arvid.brodin@alten.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:48:13 -05:00
Hannes Frederic Sowa
7f88c6b23a ipv6: fix possible seqlock deadlock in ip6_finish_output2
IPv6 stats are 64 bits and thus are protected with a seqlock. By not
disabling bottom-half we could deadlock here if we don't disable bh and
a softirq reentrantly updates the same mib.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-30 12:48:13 -05:00
Eric Dumazet
f1d8cba61c inet: fix possible seqlock deadlocks
In commit c9e9042994 ("ipv4: fix possible seqlock deadlock") I left
another places where IP_INC_STATS_BH() were improperly used.

udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from
process context, not from softirq context.

This was detected by lockdep seqlock support.

Reported-by: jongman heo <jongman.heo@samsung.com>
Fixes: 584bdf8cbd ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29 16:37:36 -05:00
Shawn Landden
d3f7d56a7a net: update consumers of MSG_MORE to recognize MSG_SENDPAGE_NOTLAST
Commit 35f9c09fe (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag MSG_SENDPAGE_NOTLAST, similar to
MSG_MORE.

algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages()
and need to see the new flag as identical to MSG_MORE.

This fixes sendfile() on AF_ALG.

v3: also fix udp

Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> # 3.4.x + 3.2.x
Reported-and-tested-by: Shawn Landden <shawnlandden@gmail.com>
Original-patch: Richard Weinberger <richard@nod.at>
Signed-off-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29 16:32:54 -05:00
Dan Carpenter
db31c55a6f net: clamp ->msg_namelen instead of returning an error
If kmsg->msg_namelen > sizeof(struct sockaddr_storage) then in the
original code that would lead to memory corruption in the kernel if you
had audit configured.  If you didn't have audit configured it was
harmless.

There are some programs such as beta versions of Ruby which use too
large of a buffer and returning an error code breaks them.  We should
clamp the ->msg_namelen value instead.

Fixes: 1661bf364a ("net: heap overflow in __audit_sockaddr()")
Reported-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Tested-by: Eric Wong <normalperson@yhbt.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29 16:12:52 -05:00
Veaceslav Falico
ec6f809ff6 af_packet: block BH in prb_shutdown_retire_blk_timer()
Currently we're using plain spin_lock() in prb_shutdown_retire_blk_timer(),
however the timer might fire right in the middle and thus try to re-aquire
the same spinlock, leaving us in a endless loop.

To fix that, use the spin_lock_bh() to block it.

Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
CC: "David S. Miller" <davem@davemloft.net>
CC: Daniel Borkmann <dborkman@redhat.com>
CC: Willem de Bruijn <willemb@google.com>
CC: Phil Sutter <phil@nwl.cc>
CC: Eric Dumazet <edumazet@google.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29 16:11:08 -05:00
Baker Zhang
39af0c409e net: remove outdated comment for ipv4 and ipv6 protocol handler
since f9242b6b28
inet: Sanitize inet{,6} protocol demux.

there are not pretended hash tables for ipv4 or
ipv6 protocol handler.

Signed-off-by: Baker Zhang <Baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-28 18:47:51 -05:00
Gao feng
66028310ae sit: use kfree_skb to replace dev_kfree_skb
In failure case, we should use kfree_skb not
dev_kfree_skb to free skbuff, dev_kfree_skb
is defined as consume_skb.

Trace takes advantage of this point.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-28 18:34:13 -05:00
Xufeng Zhang
6eabca54d6 sctp: Restore 'resent' bit to avoid retransmitted chunks for RTT measurements
Currently retransmitted DATA chunks could also be used for
RTT measurements since there are no flag to identify whether
the transmitted DATA chunk is a new one or a retransmitted one.
This problem is introduced by commit ae19c5486 ("sctp: remove
'resent' bit from the chunk") which inappropriately removed the
'resent' bit completely, instead of doing this, we should set
the resent bit only for the retransmitted DATA chunks.

Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-28 18:29:58 -05:00
Johannes Berg
5e53e689b7 genetlink/pmcraid: use proper genetlink multicast API
The pmcraid driver is abusing the genetlink API and is using its
family ID as the multicast group ID, which is invalid and may
belong to somebody else (and likely will.)

Make it use the correct API, but since this may already be used
as-is by userspace, reserve a family ID for this code and also
reserve that group ID to not break userspace assumptions.

My previous patch broke event delivery in the driver as I missed
that it wasn't using the right API and forgot to update it later
in my series.

While changing this, I noticed that the genetlink code could use
the static group ID instead of a strcmp(), so also do that for
the VFS_DQUOT family.

Cc: Anil Ravindranath <anil_ravindranath@pmc-sierra.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-28 18:26:30 -05:00
Geert Uytterhoeven
0f0e2159c0 genetlink: Fix uninitialized variable in genl_validate_assign_mc_groups()
net/netlink/genetlink.c: In function ‘genl_validate_assign_mc_groups’:
net/netlink/genetlink.c:217: warning: ‘err’ may be used uninitialized in this
function

Commit 2a94fe48f3 ("genetlink: make multicast
groups const, prevent abuse") split genl_register_mc_group() in multiple
functions, but dropped the initialization of err.

Initialize err to zero to fix this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-28 18:24:07 -05:00
Andy Adamson
c297c8b99b SUNRPC: do not fail gss proc NULL calls with EACCES
Otherwise RPCSEC_GSS_DESTROY messages are not sent.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-26 11:41:23 -05:00
Eric Dumazet
4d0820cf6a sch_tbf: handle too small burst
If a too small burst is inadvertently set on TBF, we might trigger
a bug in tbf_segment(), as 'skb' instead of 'segs' was used in a
qdisc_reshape_fail() call.

tc qdisc add dev eth0 root handle 1: tbf latency 50ms burst 1KB rate
50mbit

Fix the bug, and add a warning, as such configuration is not
going to work anyway for non GSO packets.

(For some reason, one has to use a burst >= 1520 to get a working
configuration, even with old kernels. This is a probable iproute2/tc
bug)

Based on a report and initial patch from Yang Yingliang

Fixes: e43ac79a4b ("sch_tbf: segment too big GSO packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23 14:46:25 -08:00
Hannes Frederic Sowa
1fa4c710b6 ipv6: fix leaking uninitialized port number of offender sockaddr
Offenders don't have port numbers, so set it to 0.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23 14:46:23 -08:00
Hannes Frederic Sowa
85fbaa7503 inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu functions
Commit bceaa90240 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.

As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.

This broke traceroute and such.

Fixes: bceaa90240 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: Brad Spengler <spender@grsecurity.net>
Reported-by: Tom Labanowski
Cc: mpb <mpb.mail@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23 14:46:23 -08:00
Oussama Ghorbel
ca15a078bd sit: generate icmpv6 error when receiving icmpv4 error
Send icmpv6 error with type "destination unreachable" and code
"address unreachable" when receiving icmpv4 error and sufficient
data bytes are available
This patch enhances the compliance of sit tunnel with section 3.4 of
rfc 4213

Signed-off-by: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23 14:46:22 -08:00
Gao feng
fb10f802b0 tcp_memcg: remove useless var old_lim
nobody needs it. remove.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23 14:46:21 -08:00
Herbert Xu
b8ee93ba80 gro: Clean up tcpX_gro_receive checksum verification
This patch simplifies the checksum verification in tcpX_gro_receive
by reusing the CHECKSUM_COMPLETE code for CHECKSUM_NONE.  All it
does for CHECKSUM_NONE is compute the partial checksum and then
treat it as if it came from the hardware (CHECKSUM_COMPLETE).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23 14:46:19 -08:00
Herbert Xu
cc5c00bbb4 gro: Only verify TCP checksums for candidates
In some cases we may receive IP packets that are longer than
their stated lengths.  Such packets are never merged in GRO.
However, we may end up computing their checksums incorrectly
and end up allowing packets with a bogus checksum enter our
stack with the checksum status set as verified.

Since such packets are rare and not performance-critical, this
patch simply skips the checksum verification for them.

Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>

Thanks,
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23 14:46:19 -08:00
Chang Xiangzhong
d6c4161485 net: sctp: find the correct highest_new_tsn in sack
Function sctp_check_transmitted(transport t, ...) would iterate all of
transport->transmitted queue and looking for the highest __newly__ acked tsn.
The original algorithm would depend on the order of the assoc->transport_list
(in function sctp_outq_sack line 1215 - 1226). The result might not be the
expected due to the order of the tranport_list.

Solution: checking if the exising is smaller than the new one before assigning

Signed-off-by: Chang Xiangzhong <changxiangzhong@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23 14:46:18 -08:00
Linus Torvalds
d2c2ad54c4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix memory leaks and other issues in mwifiex driver, from Amitkumar
    Karwar.

 2) skb_segment() can choke on packets using frag lists, fix from
    Herbert Xu with help from Eric Dumazet and others.

 3) IPv4 output cached route instantiation properly handles races
    involving two threads trying to install the same route, but we
    forgot to propagate this logic to input routes as well.  Fix from
    Alexei Starovoitov.

 4) Put protections in place to make sure that recvmsg() paths never
    accidently copy uninitialized memory back into userspace and also
    make sure that we never try to use more that sockaddr_storage for
    building the on-kernel-stack copy of a sockaddr.  Fixes from Hannes
    Frederic Sowa.

 5) R8152 driver transmit flow bug fixes from Hayes Wang.

 6) Fix some minor fallouts from genetlink changes, from Johannes Berg
    and Michael Opdenacker.

 7) AF_PACKET sendmsg path can race with netdevice unregister notifier,
    fix by using RCU to make sure the network device doesn't go away
    from under us.  Fix from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
  gso: handle new frag_list of frags GRO packets
  genetlink: fix genl_set_err() group ID
  genetlink: fix genlmsg_multicast() bug
  packet: fix use after free race in send path when dev is released
  xen-netback: stop the VIF thread before unbinding IRQs
  wimax: remove dead code
  net/phy: Add the autocross feature for forced links on VSC82x4
  net/phy: Add VSC8662 support
  net/phy: Add VSC8574 support
  net/phy: Add VSC8234 support
  net: add BUG_ON if kernel advertises msg_namelen > sizeof(struct sockaddr_storage)
  net: rework recvmsg handler msg_name and msg_namelen logic
  bridge: flush br's address entry in fdb when remove the
  net: core: Always propagate flag changes to interfaces
  ipv4: fix race in concurrent ip_route_input_slow()
  r8152: fix incorrect type in assignment
  r8152: support stopping/waking tx queue
  r8152: modify the tx flow
  r8152: fix tx/rx memory overflow
  netfilter: ebt_ip6: fix source and destination matching
  ...
2013-11-22 09:57:35 -08:00
Yuanhan Liu
044c8d4b15 kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly
Remove CONFIG_USE_GENERIC_SMP_HELPERS left by commit 0a06ff068f
("kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS").

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:27 -08:00
Herbert Xu
9d8506cc2d gso: handle new frag_list of frags GRO packets
Recently GRO started generating packets with frag_lists of frags.
This was not handled by GSO, thus leading to a crash.

Thankfully these packets are of a regular form and are easy to
handle.  This patch handles them in two ways.  For completely
non-linear frag_list entries, we simply continue to iterate over
the frag_list frags once we exhaust the normal frags.  For frag_list
entries with linear parts, we call pskb_trim on the first part
of the frag_list skb, and then process the rest of the frags in
the usual way.

This patch also kills a chunk of dead frag_list code that has
obviously never ever been run since it ends up generating a bogus
GSO-segmented packet with a frag_list entry.

Future work is planned to split super big packets into TSO
ones.

Fixes: 8a29111c7c ("net: gro: allow to build full sized skb")
Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Reported-by: Jerry Chu <hkchu@google.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-21 14:11:50 -05:00