This adds support for a configuring the minimum number of links that
must be active before asserting carrier. It is similar to the Cisco
EtherChannel min-links feature. This allows setting the minimum number
of member ports that must be up (link-up state) before marking the
bond device as up (carrier on). This is useful for situations where
higher level services such as clustering want to ensure a minimum
number of low bandwidth links are active before switchover.
See:
http://bugzilla.vyatta.com/show_bug.cgi?id=7196
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are enough instances of this:
iph->frag_off & htons(IP_MF | IP_OFFSET)
that a helper function is probably warranted.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise we will not see the name of the slave dev in error
message:
[ 388.469446] (null): doesn't support polling, aborting.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter found that there was a dereference before a check,
added in 56d00c677de0(bonding:delete lacp_fast from ad_bond_info).
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
1) the setting of NETIF_F_VLAN_CHALLENGED in bond_del_vlan() is
useless since commit b2a103e6 because bond_fix_features() now
sets NETIF_F_VLAN_CHALLENGED whenever the last slave is being
removed.
2) the code never triggers anyway as vlan_list is never empty
since ad1afb00.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now all received packets are handled by bond_handle_frame,
and arp_mon_pt isn't used any more.
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we use agg_select_timer and ad_work.
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_params->ad_select and ad_bond_info->agg_select_mode have the same
meaning, they are duplicate and need extra synchronization.
__get_agg_selection_mode() get ad_select from bond_params directly.
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These is also a bug, that if you modify lacp_rate via sysfs,
and add new slaves in bonding, new slaves won't use the latest lacp_rate,
since ad_bond_info->lacp_fast is initialized only once,
in bond_3ad_initialize().
Since both struct bond_params and ad_bond_info have lacp_fast,
they are duplicate and need extra synchronization.
bond_3ad_bind_slave() can use bond_params->lacp_fast to initialize port.
So we can just remove lacp_fast from struct ad_bond_info.
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is bug that when you modify lacp_rate via sysfs,
802.3ad won't use the new value of lacp_rate to transmit packets.
This is because port->actor_oper_port_state isn't changed.
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding driver is multiqueue enabled, in which each queue represents a slave
to enable optional steering of output frames to given slaves against the default
output policy. However, it needs to reset the skb->queue_mapping prior to
queuing to the physical device or the physical slave (if it is multiqueue) could
wind up transmitting on an unintended tx queue
Change Notes:
v2) Based on first pass review, updated the patch to restore the origional queue
mapping that was found in bond_select_queue, rather than simply resetting to
zero. This preserves the value of queue_mapping when it was set on receive in
the forwarding case which is desireable.
v3) Fixed spelling an casting error in skb->cb
v4) fixed to store raw queue_mapping to avoid double decrement
v5) Eric D requested that ->cb access be wrapped in a macro.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to check for 10, 100, 1000, 10000 explicitly. Just make this
generic and check for invalid values only (similar check is in ethtool
userspace app). This enables correct speed handling for slave devices
with "nonstandard" speeds.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Weiping Pan noticed that the module option description for
xmit_hash_policy was incorrect and was nice enough to post a patch to
fix it. The text was correct, but created a line over 80 characters and
I would rather not add those. I realized I could take a few minutes and
clean up all the descriptions and things would look much better. This
is the result.
Based on patch from Weiping Pan <panweiping3@gmail.com>.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
CC: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Improves the documentation about how IGMP resend parameter
works, fix two missing checks and coding style issues.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This soft lockup was recently reported:
[root@dell-per715-01 ~]# echo +bond5 > /sys/class/net/bonding_masters
[root@dell-per715-01 ~]# echo +eth1 > /sys/class/net/bond5/bonding/slaves
bonding: bond5: doing slave updates when interface is down.
bonding bond5: master_dev is not up in bond_enslave
[root@dell-per715-01 ~]# echo -eth1 > /sys/class/net/bond5/bonding/slaves
bonding: bond5: doing slave updates when interface is down.
BUG: soft lockup - CPU#12 stuck for 60s! [bash:6444]
CPU 12:
Modules linked in: bonding autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc
be2d
Pid: 6444, comm: bash Not tainted 2.6.18-262.el5 #1
RIP: 0010:[<ffffffff80064bf0>] [<ffffffff80064bf0>]
.text.lock.spinlock+0x26/00
RSP: 0018:ffff810113167da8 EFLAGS: 00000286
RAX: ffff810113167fd8 RBX: ffff810123a47800 RCX: 0000000000ff1025
RDX: 0000000000000000 RSI: ffff810123a47800 RDI: ffff81021b57f6f8
RBP: ffff81021b57f500 R08: 0000000000000000 R09: 000000000000000c
R10: 00000000ffffffff R11: ffff81011d41c000 R12: ffff81021b57f000
R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000282
FS: 00002b3b41ef3f50(0000) GS:ffff810123b27940(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b3b456dd000 CR3: 000000031fc60000 CR4: 00000000000006e0
Call Trace:
[<ffffffff80064af9>] _spin_lock_bh+0x9/0x14
[<ffffffff886937d7>] :bonding:tlb_clear_slave+0x22/0xa1
[<ffffffff8869423c>] :bonding:bond_alb_deinit_slave+0xba/0xf0
[<ffffffff8868dda6>] :bonding:bond_release+0x1b4/0x450
[<ffffffff8006457b>] __down_write_nested+0x12/0x92
[<ffffffff88696ae4>] :bonding:bonding_store_slaves+0x25c/0x2f7
[<ffffffff801106f7>] sysfs_write_file+0xb9/0xe8
[<ffffffff80016b87>] vfs_write+0xce/0x174
[<ffffffff80017450>] sys_write+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
It occurs because we are able to change the slave configuarion of a bond while
the bond interface is down. The bonding driver initializes some data structures
only after its ndo_open routine is called. Among them is the initalization of
the alb tx and rx hash locks. So if we add or remove a slave without first
opening the bond master device, we run the risk of trying to lock/unlock a
spinlock that has garbage for data in it, which results in our above softlock.
Note that sometimes this works, because in many cases an unlocked spinlock has
the raw_lock parameter initialized to zero (meaning that the kzalloc of the
net_device private data is equivalent to calling spin_lock_init), but thats not
true in all cases, and we aren't guaranteed that condition, so we need to pass
the relevant spinlocks through the spin_lock_init function.
Fix it by moving the spin_lock_init calls for the tx and rx hashtable locks to
the ndo_init path, so they are ready for use by the bond_store_slaves path.
Change notes:
v2) Based on conversation with Jay and Nicolas it seems that the ability to
enslave devices while the bond master is down should be safe to do. As such
this is an outlier bug, and so instead we'll just initalize the errant spinlocks
in the init path rather than the open path, solving the problem. We'll also
remove the warnings about the bond being down during enslave operations, since
it should be safe
v3) Fix spelling error
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: jtluka@redhat.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: nicolas.2p.debian@gmail.com
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/NETDEV_BONDING_DESLAVE/NETDEV_RELEASE/ as Andy suggested.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Neil Horman <nhorman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
V3: rename NETDEV_ENSLAVE to NETDEV_JOIN
Currently we do nothing when we enslave a net device which is running netconsole.
Neil pointed out that we may get weird results in such case, so let's disable
netpoll on the device being enslaved. I think it is too harsh to prevent
the device being ensalved if it is running netconsole.
By the way, this patch also removes the NETDEV_GOING_DOWN from netconsole
netdev notifier, because netpoll will check if the device is running or not
and we don't handle NETDEV_PRE_UP neither.
This patch is based on net-next-2.6.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With some combinations of arch/compiler (e.g. arm-linux-gcc) the sizeof
operator on structure returns value greater than expected. In cases when the
structure is used for mapping PDU fields it may lead to unexpected results
(such as holes and alignment problems in skb data). __packed prevents this
undesired behavior.
Signed-off-by: Vitalii Demianets <vitas@nppfactor.kiev.ua>
Signed-off-by: David S. Miller <davem@davemloft.net>
This should also fix updating of vlan_features and propagating changes to
VLAN devices on the bond.
Side effect: it allows user to force-disable some offloads on the bond
interface.
Note: NETIF_F_VLAN_CHALLENGED is managed by bond_fix_features() now.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull read_lock(&bond->lock) and BOND_IS_OK() to bond_start_xmit() from
mode-dependent xmit functions.
netif_running() is always true in hard_start_xmit.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Force dev_alloc_name() to be called from register_netdevice() by
dev_get_valid_name(). That allows to remove multiple explicit
dev_alloc_name() calls.
The possibility to call dev_alloc_name in advance remains.
This also fixes veth creation regresion caused by
84c49d8c3e
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For backward compatibility, we should retain the module parameters and
sysfs attributes to control the number of peer notifications
(gratuitous ARPs and unsolicited NAs) sent after bonding failover.
Also, it is possible for failover to take place even though the new
active slave does not have link up, and in that case the peer
notification should be deferred until it does.
Change ipv4 and ipv6 so they do not automatically send peer
notifications on bonding failover.
Change the bonding driver to send separate NETDEV_NOTIFY_PEERS
notifications when the link is up, as many times as requested. Since
it does not directly control which protocols send notifications, make
num_grat_arp and num_unsol_na aliases for a single parameter. Bump
the bonding version number and update its documentation.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Acked-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolved logic conflicts causing a build failure due to
drivers/net/r8169.c changes using a patch from Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since now when bonding uses rx_handler, all traffic going into bond
device goes thru bond_handle_frame. So there's no need to go back into
bonding code later via ptype handlers. This patch converts
original ptype handlers into "bonding receive probes". These functions
are called from bond_handle_frame and they are registered per-mode.
Note that vlan packets are also handled because they are always untagged
thanks to vlan_untag()
Note that this also allows arpmon for eth-bond-bridge-vlan topology.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The slave member of struct aggregator does not necessarily point
to a slave which is part of the aggregator. It points to the
slave structure containing the aggregator structure, while
completely different slaves (or no slaves at all) may be part of
the aggregator.
The agg_device_up() function wrongly uses agg->slave to find the state
of the aggregator. Use agg->lag_ports->slave instead. The bug has
been introduced by commit 4cd6fe1c64
("bonding: fix link down handling in 802.3ad mode").
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is undesirable for the bonding driver to be poking into higher
level protocols, and notifiers provide a way to avoid that. This does
mean removing the ability to configure reptitition of gratuitous ARPs
and unsolicited NAs.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This updates the bonding driver to support v2.6.27-rc3 enhancements
(b11f8d8c aka. "ethtool: Expand ethtool_cmd.speed to 32 bits") which
allow to encode the Mbps link speed on 32-bits (Max 4 Pbps) instead of
16 (Max 65536 Mbps).
This patch also attempts to compact struct slave by reordering its
fields.
Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The __get_link_speed() function returns a u16 value which was stored
in a u32 local variable. This patch uses the return value directly,
thus fixing that minor type consistency.
The 'duplex' field in struct slave being encoded on 8 bits, to be more
consistent we use a u8 integer (instead of u16) whenever we copy it to
local variables.
Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This gets rid of minor sparse complaints:
drivers/net/bonding/bond_main.c:4361:4: warning: do-while statement is not a compound statement
drivers/net/bonding/bond_main.c:243:12: warning: symbol 'bond_mode_name' was not declared. Should it be static?
Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
replace relpy with reply.
replace premanent with permanent.
Signed-off-by: Weiping Pan(潘卫平) <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
replace tranmitted with transmitted.
replace tranmitting with transmitting.
Signed-off-by: Weiping Pan(潘卫平) <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, alb_bond_info uses rx_ntt,rlb_update_delay_counter and
rlb_update_retry_counter to decide when to call rlb_update_rx_clients().
Signed-off-by: Weiping Pan(潘卫平) <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now bonding-alb uses delayed_work instead of timer_list.
Signed-off-by: Weiping Pan(潘卫平) <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is unnecessary to set save_load to 1 here,
as the tx_hashtbl is just kzalloced.
Signed-off-by: Weiping Pan(潘卫平) <panweiping3@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses __copy_from_user_nocache on transmit to bypass data
cache for a performance improvement. skb_add_data_nocache and
skb_copy_to_page_nocache can be called by sendmsg functions to use
this feature, initial support is in tcp_sendmsg. This functionality is
configurable per device using ethtool.
Presumably, this feature would only be useful when the driver does
not touch the data. The feature is turned on by default if a device
indicates that it does some form of checksum offload; it is off by
default for devices that do no checksum offload or indicate no checksum
is necessary. For the former case copy-checksum is probably done
anyway, in the latter case the device is likely loopback in which case
the no cache copy is probably not beneficial.
This patch was tested using 200 instances of netperf TCP_RR with
1400 byte request and one byte reply. Platform is 16 core AMD x86.
No-cache copy disabled:
672703 tps, 97.13% utilization
50/90/99% latency:244.31 484.205 1028.41
No-cache copy enabled:
702113 tps, 96.16% utilization,
50/90/99% latency 238.56 467.56 956.955
Using 14000 byte request and response sizes demonstrate the
effects more dramatically:
No-cache copy disabled:
79571 tps, 34.34 %utlization
50/90/95% latency 1584.46 2319.59 5001.76
No-cache copy enabled:
83856 tps, 34.81% utilization
50/90/95% latency 2508.42 2622.62 2735.88
Note especially the effect on latency tail (95th percentile).
This seems to provide a nice performance improvement and is
consistent in the tests I ran. Presumably, this would provide
the greatest benfits in the presence of an application workload
stressing the cache and a lot of transmit data happening.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This prevents possible race between bond_enslave and bond_handle_frame
as reported by Nicolas by moving rx_handler register/unregister.
slave->bond is added to hold pointer to master bonding sructure. That
way dev->master is no longer used in bond_handler_frame.
Also, this removes "BUG: scheduling while atomic" message
Reported-by: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Tested-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only slaves that are up should transmit netpoll frames, so there is no
need to check to see if a slave is up before enabling netpoll on it.
This resolves a reported failure on active-backup bonds where a slave
interface is down when netpoll was enabled.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Tested-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows rx_handlers to better signalize what to do next to
it's caller. That makes skb->deliver_no_wcard no longer needed.
kernel-doc for rx_handler_result is taken from Nicolas' patch.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since bond-related code was moved from net/core/dev.c into bonding,
IFF_SLAVE_INACTIVE is no longer needed. Replace is with flag "inactive"
stored in slave structure
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
transfers slave->state into slave->backup (that it's going to transfer
into bitfield. Introduce wrapper inlines to do the work with it.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when bond-related code is moved from net/core/dev.c into bonding
code, multiple priv_flags are not needed anymore. So let them rot.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register slave pointer as rx_handler data. That would eventually prevent
need to loop over slave devices to find the right slave.
Use synchronize_net to ensure that bond_handle_frame does not get slave
structure freed when working with that.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bonding module is loaded, it creates bond0 by default.
Then, when attempting to create bond0, the following messages
are printed to syslog:
kernel: bonding: bond0 is being created...
kernel: bonding: Bond creation failed.
Which seems to indicate a problem, when in reality there is no
problem. Since the actual error code is passed down from bond_create,
make use of it to print a bit less ominous message:
kernel: bonding: bond0 is being created...
kernel: bond0 already exists.
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bringing up a bond interface with all network cables disconnected
does not properly set the interface as DOWN because the call to
netif_carrier_off occurs too early in bond_init. The call needs
to occur after register_netdevice has set dev->reg_state to
NETREG_REGISTERED, so that netif_carrier_off will trigger the
call to linkwatch_fire_event.
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When packets come in from a device with >= 16 receive queues
headed out a bonding interface, syslog gets filled with this:
kernel: bond0 selects TX queue 16, but real number of TX queues is 16
because queue_mapping is offset by 1. Adjust return value
to account for the offset.
This is a revision of my earlier patch (which did not use the
skb_rx_queue_* helpers - thanks to Ben for the suggestion).
Andy submitted a similar patch which emits a pr_warning on
invalid queue selection, but I believe the log spew is
not useful. We can revisit that question in the future,
but in the interim I believe fixing the core problem is
worthwhile.
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.
Signed-off-by: David S. Miller <davem@davemloft.net>
V2: Move #ifdef CONFIG_PROC_FS into bonding.h, as suggested by David.
bond_main.c is bloating, separate the procfs code out,
move them to bond_procfs.c
Signed-off-by: WANG Cong <amwang@redhat.com>
Reviewed-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the rx_machine_lock to state_machine_lock as this makes more
sense in light of it now protecting all the state machines against
concurrency.
Signed-off-by: Nils Carlson <nils.carlson@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes since v1:
* Clarify an unclear comment
* Move a (possible) name change to a separate patch
The ad_rx_machine, ad_periodic_machine and ad_port_selection_logic
functions all inspect and alter common fields within the port structure.
Previous to this patch, only the ad_rx_machines were mutexed, and the
periodic and port_selection could run unmutexed against an ad_rx_machine
trigged by an arriving LACPDU.
This patch remedies the situation by protecting all the state machines
from concurrency. This is accomplished by locking around all the state
machines for a given port, which are executed at regular intervals; and
the ad_rx_machine when handling an incoming LACPDU.
Signed-off-by: Nils Carlson <nils.carlson@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there is a ptype handler holding a clone of this skb, whose
destination MAC addresse is overwritten, the owner of this handler may
get a corrupted packet.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These two functions are only used when net poll controller is enabled.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clearly it should be the size of ->ip_dst here.
Although this is harmless, but it still reads odd.
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix use of zero where NULL expected. And wrap long line.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts bonding to use rx_handler. Results in cleaner
__netif_receive_skb() with much less exceptions needed. Also
bond-specific work is moved into bond code.
Did performance test using pktgen and counting incoming packets by
iptables. No regression noted.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
V4: rebase to net-next-2.6
This patch removes the flag IFF_IN_NETPOLL, we don't need it any more since
we have netpoll_tx_running() now.
Signed-off-by: WANG Cong <amwang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
V4: rebase to net-next-2.6
V3: remove an useless #ifdef.
This patch unifies the netpoll code in bonding with netpoll code in bridge,
thanks to Herbert that code is much cleaner now.
Signed-off-by: WANG Cong <amwang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->master is now tightly connected to bonding driver. This patch makes
this pointer more general and ready to be used by others.
- netdev_set_master() - bond specifics moved to new function
netdev_set_bond_master()
- introduced netif_is_bond_slave() to check if device is a bonding slave
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
count is incorrectly returned even in case of fail. Return ret instead.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce printk() levels to KERN_INFO in netdev_fix_features() as this will
be used by ethtool and might spam dmesg unnecessarily.
This converts the function to use netdev_info() instead of plain printk().
As a side effect, bonding and bridge devices will now log dropped features
on every slave device change.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.
Occurences found by `grep -r` on net/, drivers/net, include/
[ Move features and vlan_features next to each other in
struct netdev, as per Eric Dumazet's suggestion -DaveM ]
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove kobject.h from files which don't need it, notably,
sched.h and fs.h.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
This patch provices the debugfs interface to see RLB hash table
like the following:
# cat /sys/kernel/debug/bonding/bond0/rlb_hash_table
SourceIP DestinationIP Destination MAC DEV
10.124.196.205 10.124.196.205 ff:ff:ff:ff:ff:ff eth4
10.124.196.205 10.124.196.81 00:19:99:XX:XX:XX eth3
10.124.196.205 10.124.196.1 00:21:d8:XX:XX:XX eth0
This is helpful to check if the receive load balancing works as expected.
Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch simply migrates some macros from bond_alb.c to bond_alb.h.
Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_na_send() attempts to insert a VLAN tag in between building and
sending packets of the respective formats. If the slave does not
implement hardware VLAN tag insertion then vlan_put_tag() will mangle
the network-layer header because the Ethernet header is not present at
this point (unlike in bond_arp_send()).
Fix this by adding the tag out-of-line and relying on
dev_hard_start_xmit() to insert it inline if necessary.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_change_active_slave() may be called when a slave is added, even
if the bond has not been brought up yet. It may then attempt to send
packets, and further it may use mcast_work which is uninitialised
before the bond is brought up. Add the necessary checks for
netif_running(bond->dev).
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A bond may have a mixture of slave devices with and without hardware
VLAN tag insertion capability. Therefore it always claims this
capability and performs software VLAN tag insertion if the slave does
not.
Since commit 7b9c609037, this has
also been done by dev_hard_start_xmit(). The result is that VLAN-
tagged skbs are now double-tagged when transmitted through slave
devices without hardware VLAN tag insertion!
Remove the now-redundant logic from bond_dev_queue_xmit().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The returned slave is incorrect, if the net device under check is not
charged yet by the master.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides the debugfs facility to the bonding driver.
The "bonding" directory is created in the debugfs root and directories of
each bonding interface (like bond0, bond1...) are created in that.
# mount -t debugfs none /sys/kernel/debug
# ls /sys/kernel/debug/bonding
bond0 bond1
Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A while back I made some changes to enable netpoll in the bonding driver. Among
them was a per-cpu flag that indicated we were in a path that held locks which
could cause the netpoll path to block in during tx, and as such the tx path
should queue the frame for later use. This appears to have given rise to a
regression. If one of those paths on which we hold the per-cpu flag yields the
cpu, its possible for us to come back on a different cpu, leading to us clearing
a different flag than we set. This results in odd netpoll drops, and BUG
backtraces appearing in the log, as we check to make sure that we only clear set
bits, and only set clear bits. I had though briefly about changing the
offending paths so that they wouldn't sleep, but looking at my origional work
more closely, it doesn't appear that a per-cpu flag is warranted. We alrady
gate the checking of this flag on IFF_IN_NETPOLL, so we don't hit this in the
normal tx case anyway. And practically speaking, the normal use case for
netpoll is to only have one client anyway, so we're not going to erroneously
queue netpoll frames when its actually safe to do so. As such, lets just
convert that per-cpu flag to an atomic counter. It fixes the rescheduling bugs,
is equivalent from a performance perspective and actually eliminates some code
in the process.
Tested by the reporter and myself, successfully
Reported-by: Liang Zheng <lzheng@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Restore the check for an unassigned mac address before adopting the
first slaves as it's own. The change in behavior was introduced by:
commit c20811a79e
Author: Jiri Pirko <jpirko@redhat.com>
bonding: move dev_addr cpy to bond_enslave
Signed-off-by: David Strand <dpstrand@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of iterating in_dev->mc_list from bonding driver, its better
to call a helper function provided by igmp.c
Details of implementation (locking) are private to igmp code.
ip_mc_rejoin_group(struct ip_mc_list *im) becomes
ip_mc_rejoin_groups(struct in_device *in_dev);
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCU conversion in IGMP code done in net-next-2.6 raised a race in
__bond_resend_igmp_join_requests().
It iterates in_dev->mc_list without appropriate protection (RTNL, or
read_lock on in_dev->mc_list_lock).
Another cpu might delete an entry while we use it and trigger a fault.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_info_seq_start() uses a read_lock(&dev_base_lock) to make sure
device doesn’t disappear. Same goal can be achieved using RCU.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
checkpatch.pl cleanup : Remove braces from single statement
blocks.
Signed-off-by: Bandan Das <bandan.das@stratus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
checkpatch.pl cleanup: Added spaces around operators at various places.
Also fixed some c99 style comments that I came across.
Signed-off-by: Bandan Das <bandan.das@stratus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only used in main file.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some recent testing in netpoll with bonding showed this backtrace
------------[ cut here ]------------
kernel BUG at drivers/net/bonding/bonding.h:134!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1d.2/usb7/devnum
CPU 0
Pid: 1876, comm: rmmod Not tainted 2.6.36-rc3+ #10 D26928/
RIP: 0010:[<ffffffffa0514ba4>] [<ffffffffa0514ba4>] bond_uninit+0x6f4/0x7a0
RSP: 0018:ffff88003b1b5d58 EFLAGS: 00010296
RAX: ffff88003b9b6200 RBX: ffff8800373e8e00 RCX: 00000000000f4240
RDX: 00000000ffffffff RSI: 0000000000000286 RDI: 0000000000000286
RBP: ffff88003b1b5dc8 R08: 0000000000000000 R09: 00000001af7de920
R10: 0000000000000000 R11: ffff880002495e98 R12: ffff880037922700
R13: ffff880038c31000 R14: ffff880037922730 R15: 0000000000000286
FS: 00007f90e6d72700(0000) GS:ffff880002400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000346f0d9ad0 CR3: 000000003b263000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rmmod (pid: 1876, threadinfo ffff88003b1b4000, task ffff88003b36aa80)
Stack:
00000000ffffffff ffff88003b1b5d7a ffff8800379221e8 ffff880037922000
<0> ffff88003b1b5dc8 ffffffff813eb5fb ffff88003b1b5da8 0000000031b177a3
<0> ffff88003b1b5da8 ffff880037922000 ffff88003b1b5e48 ffff88003b1b5e48
Call Trace:
[<ffffffff813eb5fb>] ? rtmsg_ifinfo+0xcb/0xf0
[<ffffffff813daad8>] rollback_registered_many+0x168/0x280
[<ffffffff813dac09>] unregister_netdevice_many+0x19/0x80
[<ffffffff813e97b3>] __rtnl_kill_links+0x63/0x90
[<ffffffff813e980b>] __rtnl_link_unregister+0x2b/0x60
[<ffffffff813e9bde>] rtnl_link_unregister+0x1e/0x30
[<ffffffffa052124b>] bonding_exit+0x37/0x51 [bonding]
[<ffffffff81098b2e>] sys_delete_module+0x19e/0x270
[<ffffffff810bb2b2>] ? audit_syscall_entry+0x252/0x280
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
RIP [<ffffffffa0514ba4>] bond_uninit+0x6f4/0x7a0 [bonding]
RSP <ffff88003b1b5d58>
---[ end trace 1395ad691cea24d1 ]---
It occurs because of my recent netpoll blocking patches, which I added to avoid
recursive deadlock in the bonding driver. It relies on some per cpu bits, but
the shutdown path forces some rescheduling as we cancel workqueues for the
driver and wait for some device refcounts. If after the forced reschedule, we
wind up on a different cpu we trigger the bughalt in unblock_netpoll_tx.
The fix is to remove the netpoll block/unblock calls from bond_release_all.
This is safe to do because bond_uninit, which is called via ndo_uninit in
rollback_registered_many, doesn't occur until we send a NETDEV_UNREGISTER event,
which triggers netconsole to remove us as a netpoll client, so we are guaranteed
not to recurse into our own tx path here.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the inclusion of previous fixup patches, netpoll over bonding apears to
work reliably with failover conditions. This reverts Gospos previous commit
c22d7ac844, and allows access again to the netpoll
functionality in the bonding driver.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The monitoring paths in the bonding driver take write locks that are shared by
the tx path. If netconsole is in use, these paths can call printk which puts us
in the netpoll tx path, which, if netconsole is attached to the bonding driver,
result in deadlock (the xmit_lock guards are useless in netpoll_send_skb, as the
monitor paths in the bonding driver don't claim the xmit_lock, nor should they).
The solution is to use a per cpu flag internal to the driver to indicate when a
cpu is holding the lock in a path that might recusrse into the tx path for the
driver via netconsole. By checking this flag on transmit, we can defer the
sending of the netconsole frames until a later time using the retransmit feature
of netpoll_send_skb that is triggered on the return code NETDEV_TX_BUSY. I've
tested this and am able to transmit via netconsole while causing failover
conditions on the bond slave links.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding driver currently modifies the netpoll structure in its xmit path
while sending frames from netpoll. This is racy, as other cpus can access the
netpoll structure in parallel. Since the bonding driver points np->dev to a
slave device, other cpus can inadvertently attempt to send data directly to
slave devices, leading to improper locking with the bonding master, lost frames,
and deadlocks. This patch fixes that up.
This patch also removes the real_dev pointer from the netpoll structure as that
data is really only used by bonding in the poll_controller, and we can emulate
its behavior by check each slave for IS_UP.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Effect:
Slave Interface: eth5
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: XX:XX:XX:XX:XX:XX
Slave queue ID: 0
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an interface was enslaved when it was down, bonding thinks
it has speed -1 even after it goes up. This leads into selecting
a wrong active interface in active/backup mode on mixed 10G/1G or
1G/100M environment.
before:
bonding: bond0: link status definitely up for interface eth5, 100 Mbps full duplex.
bonding: bond0: link status definitely up for interface eth0, 100 Mbps full duplex.
after:
bonding: bond0: link status definitely up for interface eth5, 10000 Mbps full duplex.
bonding: bond0: link status definitely up for interface eth0, 1000 Mbps full duplex.
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
before:
bonding: bond0: link status definitely up for interface eth5
bonding: bond0: link status definitely up for interface eth0
after:
bonding: bond0: link status definitely up for interface eth5, 100 Mbps full duplex.
bonding: bond0: link status definitely up for interface eth0, 100 Mbps full duplex.
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow sysadmins to configure the number of multicast
membership report sent on a link failure event.
Signed-off-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During a failover, the IGMP membership is sent to update
the switch restoring the traffic, but it misses groups added
to VLAN devices running on top of bonding devices.
This patch changes it to iterate over all VLAN devices
on top of it sending IGMP memberships too.
Signed-off-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a WARN_ON failure in bond_masters sysfs file
Got a report of this warning recently
bonding: bond0 is being created...
------------[ cut here ]------------
WARNING: at fs/proc/generic.c:590 proc_register+0x14d/0x185()
Hardware name: ProLiant BL465c G1
proc_dir_entry 'bonding/bond0' already registered
Modules linked in: bonding ipv6 tg3 bnx2 shpchp amd64_edac_mod edac_core
ipmi_si
ipmi_msghandler serio_raw i2c_piix4 k8temp edac_mce_amd hpwdt microcode hpsa
cc
iss radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded:
scsi_wai
t_scan]
Pid: 935, comm: ifup-eth Not tainted 2.6.33.5-124.fc13.x86_64 #1
Call Trace:
[<ffffffff8104b54c>] warn_slowpath_common+0x77/0x8f
[<ffffffff8104b5b1>] warn_slowpath_fmt+0x3c/0x3e
[<ffffffff8114bf0b>] proc_register+0x14d/0x185
[<ffffffff8114c20c>] proc_create_data+0x87/0xa1
[<ffffffffa0211e9b>] bond_create_proc_entry+0x55/0x95 [bonding]
[<ffffffffa0215e5d>] bond_init+0x95/0xd0 [bonding]
[<ffffffff8138cd97>] register_netdevice+0xdd/0x29e
[<ffffffffa021240b>] bond_create+0x8e/0xb8 [bonding]
[<ffffffffa021c4be>] bonding_store_bonds+0xb3/0x1c1 [bonding]
[<ffffffff812aec85>] class_attr_store+0x27/0x29
[<ffffffff8115423d>] sysfs_write_file+0x10f/0x14b
[<ffffffff81101acf>] vfs_write+0xa9/0x106
[<ffffffff81101be2>] sys_write+0x45/0x69
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
---[ end trace a677c3f7f8b16b1e ]---
bonding: Bond creation failed.
It happens because a user space writer to bond_master can try to
register an already existing bond interface name. Fix it by teaching
bond_create to check for the existance of devices with that name first
in cases where a non-NULL name parameter has been passed in
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change "return (EXPR);" to "return EXPR;"
return is not a function, parentheses are not required.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gro can be enabled by default on bonding devices.
Actual support depends on the lower devices.
One can still use ethtool to switch off GRO if needed.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was recently brought to my attention that 802.3ad mode bonds would no
longer form when using some network hardware after a driver update.
After snooping around I realized that the particular hardware was using
page-based skbs and found that skb->data did not contain a valid LACPDU
as it was not stored there. That explained the inability to form an
802.3ad-based bond. For balance-alb mode bonds this was also an issue
as ARPs would not be properly processed.
This patch fixes the issue in my tests and should be applied to 2.6.36
and as far back as anyone cares to add it to stable.
Thanks to Alexander Duyck <alexander.h.duyck@intel.com> and Jesse
Brandeburg <jesse.brandeburg@intel.com> for the suggestions on this one.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: stable@kerne.org
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The time_before_eq()/time_after_eq() functions operate on unsigned
long and only work if the difference between the two compared values
is smaller than half the range of unsigned long (31 bits on i386).
Some of the variables (slave->jiffies, dev->trans_start, dev->last_rx)
used by bonding store a copy of jiffies and may not be updated for a
long time. With HZ=1000, time_before_eq()/time_after_eq() will start
giving bad results after ~25 days.
jiffies will never be before slave->jiffies, dev->trans_start,
dev->last_rx by more than possibly a couple ticks caused by preemption
of this code. This allows us to detect/prevent these overflows by
replacing time_before_eq()/time_after_eq() with time_in_range().
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using module options arp monitoring and balance-alb/balance-tlb
are mutually exclusive options. Anytime balance-alb/balance-tlb are
enabled mii monitoring is forced to 100ms if not set. When configuring
via sysfs no checking is currently done.
Handling these cases with sysfs has to be done a bit differently because
we do not have all configuration information available at once. This
patch will not allow a mode change to balance-alb/balance-tlb if
arp_interval is already non-zero. It will also not allow the user to
set a non-zero arp_interval value if the mode is already set to
balance-alb/balance-tlb. They are still mutually exclusive on a
first-come, first serve basis.
Tested with initscripts on Fedora and manual setting via sysfs.
Signed-off-by: Andy Gospodarek <gospo@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After:
commit 6146b1a4da
Author: Jay Vosburgh <fubar@us.ibm.com>
Date: Tue Nov 4 17:51:15 2008 -0800
bonding: Fix ALB mode to balance traffic on VLANs
the dev field in the RLB ARP packet handler was set to NULL to wildcard
and accommodate balancing VLANs on top of bonds.
This has the side-effect of the packet handler being called against
other, non RLB-enabled bonds, and a kernel oops results when it tries to
dereference rx_hashtbl in rlb_update_entry_from_arp(), which won't be
set for those bonds, e.g. active-backup.
With the __netif_receive_skb() changes from:
commit 1f3c8804ac
Author: Andy Gospodarek <andy@greyhouse.net>
Date: Mon Dec 14 10:48:58 2009 +0000
bonding: allow arp_ip_targets on separate vlans to use arp validation
frames received on VLANs correctly make their way to the bond's handler,
so we no longer need to wildcard the device.
The oops can be reproduced by:
modprobe bonding
echo active-backup > /sys/class/net/bond0/bonding/mode
echo 100 > /sys/class/net/bond0/bonding/miimon
ifconfig bond0 xxx.xxx.xxx.xxx netmask xxx.xxx.xxx.xxx
echo +eth0 > /sys/class/net/bond0/bonding/slaves
echo +eth1 > /sys/class/net/bond0/bonding/slaves
echo +bond1 > /sys/class/net/bonding_masters
echo balance-alb > /sys/class/net/bond1/bonding/mode
echo 100 > /sys/class/net/bond1/bonding/miimon
ifconfig bond1 xxx.xxx.xxx.xxx netmask xxx.xxx.xxx.xxx
echo +eth2 > /sys/class/net/bond1/bonding/slaves
echo +eth3 > /sys/class/net/bond1/bonding/slaves
Pass some traffic on bond0. Boom.
[ Tested, behaves as advertised. I do not believe a test of the bonding
mode is necessary, as there is no race between the packet handler and
the bonding mode changing (the mode can only change when the device is
closed). Also updated the log message to include the reproduction and
full commit ids. -J ]
Signed-off-by: Greg Edwards <greg.edwards@hp.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When copying VLAN information to or removing from a slave
during slave addition or removal, the bonding code currently holds
the bond->lock for write to prevent concurrent modification of the
vlan_list / vlgrp.
This is unnecessary, as all of these operations occur under
RTNL. Holding the bond->lock also caused might_sleep issues for
some drivers' ndo_vlan_* functions. This patch removes the extra
locking.
Problem reported by Michael Chan <mchan@broadcom.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit ad1afb0039
("vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)")
it is now regular practice for a VLAN "add vid" for VLAN 0 to
arrive prior to any VLAN registration or creation of a vlan_group.
This patch updates the bonding code that tests for the presence
of VLANs configured above bonding. The new logic tests for bond->vlgrp
to determine if a registration has occured, instead of testing that
bonding's internal vlan_list is empty.
The old code would panic when vlan_list was not empty, but
vlgrp was still NULL (because only an "add vid" for VLAN 0 had occured).
Bonding still adds VLAN 0 to its internal list so that 802.1p
frames are handled correctly on transmit when non-VLAN accelerated
slaves are members of the bond. The test against bond->vlan_list
remains in bond_dev_queue_xmit for this reason.
Modification to the bond->vlgrp now occurs under lock (in
addition to RTNL), because not all inspections of it occur under RTNL.
Additionally, because 8021q will never issue a "kill vid" for
VLAN 0, there is now logic in bond_uninit to release any remaining
entries from vlan_list.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Cc: Pedro Garcia <pedro.netdev@dondevamos.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/bonding/bond_main.c:179:12: warning: ‘disable_netpoll’
defined but not used
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit ad1afb0039 (vlan_dev: VLAN 0 should be treated
as "no vlan tag" (802.1p packet)),
bond_inet6addr_event() might be called with a NULL bond->vlgrp pointer, and
a non empty bond->vlan_list. vlan_group_get_device() is dereferencing a NULL pointer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The test for buffer overflow ensures we have room for 6 more bytes.
sprintf, called with %s:%d, slave->dev->name, slave->queue_id may yield
far more than 6 bytes.
The correct test is res > (PAGE_SIZE - IFNAMSIZ - 6) .
Signed-off-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a small possibility that a reader gets incorrect values on 32
bit arches. SNMP applications could catch incorrect counters when a
32bit high part is changed by another stats consumer/provider.
One way to solve this is to add a rtnl_link_stats64 param to all
ndo_get_stats64() methods, and also add such a parameter to
dev_get_stats().
Rule is that we are not allowed to use dev->stats64 as a temporary
storage for 64bit stats, but a caller provided area (usually on stack)
Old drivers (only providing get_stats() method) need no changes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When two systems using bonding devices in adaptive load
balancing (ALB) communicates with each other, an endless
ping-pong of ARP replies starts between these two systems.
What happens? In the ALB mode, bonding driver keeps track
of each client connected in a hash table, so it can do the
receive load balancing (RLB). This hash table is updated
when an ARP reply is received, then it scans for the client
entry, updates its MAC address and flag it to be announced
later. Therefore, two seconds later, the alb monitor runs
and send for each updated client entry two ARP replies
updating this specific client. The same process happens on
the receiving system, causing the endless ping-pong of arp
replies.
See more information including the relevant functions below:
System 1 System 2
bond0 bond0
ping <system2>
ARP request --------->
<--------- ARP reply
+->rlb_arp_recv <---------------------+ <--- loop begins
| rlb_update_entry_from_arp |
| client_info->ntt = 1; |
| bond_info->rx_ntt = 1; |
| |
| <communication succeed> |
| |
| bond_alb_monitor |
| rlb_update_rx_clients |
| rlb_update_client |
| arp_create(ARPOP_REPLY) |
| send ARP reply --------------> V
| send ARP reply -------------->
| rlb_arp_recv
| rlb_update_entry_from_arp
| client_info->ntt = 1;
| bond_info->rx_ntt = 1;
| < snipped, same as in system 1>
+------- <-------------- send ARP reply
<-------------- send ARP reply
Besides the unneeded networking traffic, this loop breaks
a cluster because a backup system can't take over the IP
address. There is always one system sending an ARP reply
poisoning the network.
This patch fixes the problem adding a check for the MAC
address before updating it. Thus, if the MAC address didn't
change, there is no need to update neither to announce it later.
Signed-off-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for netpoll over bonded interfaces was added here:
commit f6dc31a85c
Author: WANG Cong <amwang@redhat.com>
Date: Thu May 6 00:48:51 2010 -0700
bonding: make bonding support netpoll
but it is bad enough that we should probably just disable netpoll over
bonding until some of the locking logic in the bonding driver is changed
or converted completely to RCU. Simple actions like changing the active
slave in active-backup mode will hang the box if a high enough printk
debugging level is enabled.
Keeping the old code around will be good for anyone that wants to work
on it (and for after the RCU conversion), so I propose this small patch
rather than ripping it all out.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use struct rtnl_link_stats64 as the statistics structure.
On 32-bit architectures, insert 32 bits of padding after/before each
field of struct net_device_stats to make its layout compatible with
struct rtnl_link_stats64. Add an anonymous union in net_device; move
stats into the union and add struct rtnl_link_stats64 stats64.
Add net_device_ops::ndo_get_stats64, implementations of which will
return a pointer to struct rtnl_link_stats64. Drivers that implement
this operation must not update the structure asynchronously.
Change dev_get_stats() to call ndo_get_stats64 if available, and to
return a pointer to struct rtnl_link_stats64. Change callers of
dev_get_stats() accordingly.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove useless union keyword in rtable, rt6_info and dn_route.
Since there is only one member in a union, the union keyword isn't useful.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2: changed bonding module version, modified to apply on top of changes
from previous patch in series, and updated documentation to elaborate on
multiqueue awareness that now exists in bonding driver.
This patch give the user the ability to control the output slave for
round-robin and active-backup bonding. Similar functionality was
discussed in the past, but Jay Vosburgh indicated he would rather see a
feature like this added to existing modes rather than creating a
completely new mode. Jay's thoughts as well as Neil's input surrounding
some of the issues with the first implementation pushed us toward a
design that relied on the queue_mapping rather than skb marks.
Round-robin and active-backup modes were chosen as the first users of
this slave selection as they seemed like the most logical choices when
considering a multi-switch environment.
Round-robin mode works without any modification, but active-backup does
require inclusion of the first patch in this series and setting
the 'all_slaves_active' flag. This will allow reception of unicast traffic on
any of the backup interfaces.
This was tested with IPv4-based filters as well as VLAN-based filters
with good results.
More information as well as a configuration example is available in the
patch to Documentation/networking/bonding.txt.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2: changed parameter name from 'keep_all' to 'all_slaves_active' and
skipped setting slaves to inactive rather than creating a new flag at
Jay's suggestion.
In an effort to suppress duplicate frames on certain bonding modes
(specifically the modes that do not require additional configuration on
the switch or switches connected to the host), code was added in the
generic receive patch in 2.6.16. The current behavior works quite well
for most users, but there are some times it would be nice to restore old
functionality and allow all frames to make their way up the stack.
This patch adds support for a new module option and sysfs file called
'all_slaves_active' that will restore pre-2.6.16 functionality if the
user desires. The default value is '0' and retains existing behavior,
but the user can set it to '1' and allow all frames up if desired.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the worst case, when the first loop breaks an the end of the slave list,
the slave list is iterated through twice. This patch reduces this
function only to one loop. Also makes it simpler.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is stored but never restored. So remove this as it is useless.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the code that copies slave's mac address in case that's the first slave into
bond_enslave. Ifenslave app does this also but that's not a problem. This is
something that should be done in bond_enslave, and it shound not matter from
where is it called.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes bonding_store_slaves function nicer and easier to understand.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(it's actually the same as v1)
Remove checks that duplicates similar checks in bond_enslave.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
V1->V2: corrected res/ret use
For some reason, MTU handling (storing, and restoring) is taking place in
bond_sysfs. The correct place for this code is in bond_enslave, bond_release.
So move it there.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on Andy's work, but I modified a lot.
Similar to the patch for bridge, this patch does:
1) implement the 2 methods to support netpoll for bonding;
2) modify netpoll during forwarding packets via bonding;
3) disable netpoll support of bonding when a netpoll-unabled device
is added to bonding;
4) enable netpoll support when all underlying devices support netpoll.
Cc: Andy Gospodarek <gospo@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converts the list and the core manipulating with it to be the same as uc_list.
+uses two functions for adding/removing mc address (normal and "global"
variant) instead of a function parameter.
+removes dev_mcast.c completely.
+exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
manipulation with lists on a sandbox (used in bonding and 80211 drivers)
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
+little renaming of unicast functions to be smooth with multicast ones
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
which will potentially flush all works in this workqueue, if we hold rtnl_lock
again in the work function, it will deadlock.
So move destroy_workqueue() to destructor where rtnl_lock is not held any more,
suggested by Eric.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a2fd940f (bonding: fix broken multicast with round-robin mode)
added a problem on litle endian machines.
drivers/net/bonding/bond_main.c:4159: warning: comparison is always
false due to limited range of data type
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Round-robin (mode 0) does nothing to ensure that any multicast traffic
originally destined for the host will continue to arrive at the host when
the link that sent the IGMP join or membership report goes down. One of
the benefits of absolute round-robin transmit.
Keeping track of subscribed multicast groups for each slave did not seem
like a good use of resources, so I decided to simply send on the
curr_active slave of the bond (typically the first enslaved device that
is up). This makes failover management simple as IGMP membership
reports only need to be sent when the curr_active_slave changes. I
tested this patch and it appears to work as expected.
Originally reported by Lon Hohberger <lhh@redhat.com>.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
CC: Lon Hohberger <lhh@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the type change, addresses in unicast and multicast lists wouldn't make
sense, not to mention possible different lenghts. So flush both lists here.
Note "dev_addr_discard" will be very soon replaced by "dev_mc_flush" (once
mc_list conversion will be done).
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to list macro's for the list of addresses per interface
in IPv6.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the possibility to refuse the bonding type change for
other subsystems (such as for example bridge, vlan, etc.)
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since generally there could be more netdevices changing type other
than bonding, making this event type name "bonding-unrelated"
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>