linux/net
Jim Schutt 1604f488ac libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed
Add libceph support for a new CRUSH tunable recently added to Ceph servers.

Consider the CRUSH rule
  step chooseleaf firstn 0 type <node_type>

This rule means that <n> replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of <node_type>.

When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the <node_type> bucket
that held the failed leaf.  This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular <node_type> bucket, that remaining leaf holds all the data from
its failed peers.

This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.

For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.

If the tree descent down to <node_type> is the outer descent, and the descent
from <node_type> down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.

In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf.  Wire this up as a new CRUSH tunable.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary.  This
requires that OSD's data for that PG to need moving as well.  This
seems unavoidable but should be relatively rare.

This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:39 -06:00
..
9p net: Fix (nearly-)kernel-doc comments for various functions 2012-07-10 23:13:45 -07:00
802
8021q vlan: clean up vlan_dev_hard_start_xmit() 2012-08-14 14:33:32 -07:00
appletalk net: Fix (nearly-)kernel-doc comments for various functions 2012-07-10 23:13:45 -07:00
atm atm: fix info leak via getsockname() 2012-08-15 21:36:30 -07:00
ax25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-07-19 11:17:30 -07:00
batman-adv batman-adv: Fix symmetry check / route flapping in multi interface setups 2012-09-23 23:12:49 +02:00
bluetooth Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem 2012-09-22 12:19:22 -04:00
bridge netfilter: log: Fix log-level processing 2012-09-12 17:17:35 +02:00
caif caif: move the dereference below the NULL test 2012-09-10 16:13:31 -04:00
can can: gw: Remove pointless casts 2012-07-10 22:36:17 +02:00
ceph libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed 2013-01-17 12:42:39 -06:00
core net: guard tcp_set_keepalive() to tcp sockets 2012-09-24 16:51:53 -04:00
dcb net: Fix non-kernel-doc comments with kernel-doc start marker 2012-07-10 23:13:45 -07:00
dccp dccp: fix info leak via getsockopt(DCCP_SOCKOPT_CCID_TX_INFO) 2012-08-15 21:36:31 -07:00
decnet ipv4: Restore old dst_free() behavior. 2012-07-31 14:41:38 -07:00
dns_resolver
dsa
ethernet ipx: move peII functions 2012-07-19 10:48:00 -07:00
ieee802154 6lowpan: Change byte order when storing/accessing to len field 2012-07-16 22:52:02 -07:00
ipv4 inetpeer: fix token initialization 2012-09-27 19:27:39 -04:00
ipv6 ipv6: mip6: fix mip6_mh_filter() 2012-09-25 16:04:44 -04:00
ipx ipx: move peII functions 2012-07-19 10:48:00 -07:00
irda irda: Fix typo in irda 2012-07-16 23:23:52 -07:00
iucv
key
l2tp l2tp: fix return value check 2012-09-27 13:18:19 -04:00
lapb
llc llc: fix info leak via getsockname() 2012-08-15 21:36:31 -07:00
mac80211 Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 2012-09-05 14:48:15 -04:00
mac802154 mac802154: sparse warnings: make symbols static 2012-07-12 07:54:45 -07:00
netfilter netfilter: xt_limit: have r->cost != 0 case work 2012-09-26 01:33:16 +02:00
netlabel
netlink netlink: fix possible spoofing from non-root processes 2012-08-24 13:36:09 -04:00
netrom net: change return values from -EACCES to -EPERM 2012-09-21 13:58:08 -04:00
nfc Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2012-07-20 12:30:48 -04:00
openvswitch openvswitch: Fix FLOW_BUFSIZE definition. 2012-09-03 19:06:27 -07:00
packet af_packet: match_fanout_group() can be static 2012-08-23 09:27:12 -07:00
phonet
rds rds: set correct msg_namelen 2012-07-23 01:01:44 -07:00
rfkill
rose
rxrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-07-10 23:56:33 -07:00
sched pkt_sched: fix virtual-start-time update in QFQ 2012-09-19 16:23:53 -04:00
sctp sctp: Don't charge for data in sndbuf again when transmitting packet 2012-09-03 13:24:13 -04:00
sunrpc NFS client bugfixes for Linux 3.6 2012-09-13 09:04:13 +08:00
tipc tipc: remove print_buf and deprecated log buffer code 2012-07-13 19:34:43 -04:00
unix af_netlink: force credentials passing [CVE-2012-3520] 2012-08-21 14:53:01 -07:00
wanrouter wanmain: comparing array with NULL 2012-07-24 13:55:21 -07:00
wimax
wireless cfg80211: fix possible circular lock on reg_regdb_search() 2012-09-18 20:43:23 -04:00
x25 net: Fix (nearly-)kernel-doc comments for various functions 2012-07-10 23:13:45 -07:00
xfrm xfrm_user: don't copy esn replay window twice for new states 2012-09-20 18:08:40 -04:00
compat.c net: Fix references to out-of-scope variables in put_cmsg_compat() 2012-07-22 17:50:49 -07:00
Kconfig
Makefile
nonet.c
socket.c Fix order of arguments to compat_put_time[spec|val] 2012-09-05 18:34:13 -07:00
sysctl_net.c