linux/drivers
Shlomo Pongratz b63b70d877 IPoIB: Use a private hash table for path lookup in xmit path
Dave Miller <davem@davemloft.net> provided a detailed description of
why the way IPoIB is using neighbours for its own ipoib_neigh struct
is buggy:

    Any time an ipoib_neigh is changed, a sequence like the following is made:

    			spin_lock_irqsave(&priv->lock, flags);
    			/*
    			 * It's safe to call ipoib_put_ah() inside
    			 * priv->lock here, because we know that
    			 * path->ah will always hold one more reference,
    			 * so ipoib_put_ah() will never do more than
    			 * decrement the ref count.
    			 */
    			if (neigh->ah)
    				ipoib_put_ah(neigh->ah);
    			list_del(&neigh->list);
    			ipoib_neigh_free(dev, neigh);
    			spin_unlock_irqrestore(&priv->lock, flags);
    			ipoib_path_lookup(skb, n, dev);

    This doesn't work, because you're leaving a stale pointer to the freed up
    ipoib_neigh in the special neigh->ha pointer cookie.  Yes, it even fails
    with all the locking done to protect _changes_ to *ipoib_neigh(n), and
    with the code in ipoib_neigh_free() that NULLs out the pointer.

    The core issue is that read side calls to *to_ipoib_neigh(n) are not
    being synchronized at all, they are performed without any locking.  So
    whether we hold the lock or not when making changes to *ipoib_neigh(n)
    you still can have threads see references to freed up ipoib_neigh
    objects.

    	cpu 1			cpu 2
    	n = *ipoib_neigh()
    				*ipoib_neigh() = NULL
    				kfree(n)
    	n->foo == OOPS

    [..]

    Perhaps the ipoib code can have a private path database it manages
    entirely itself, which holds all the necessary information and is
    looked up by some generic key which is available easily at transmit
    time and does not involve generic neighbour entries.

See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and
<http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b>
for the full discussion.

This patch aims to solve the race conditions found in the IPoIB driver.

The patch removes the connection between the core networking neighbour
structure and the ipoib_neigh structure.  In addition to avoiding the
race described above, it allows us to handle SKBs carrying IP packets
that don't have any associated neighbour.

We add an ipoib_neigh hash table with N buckets where the key is the
destination hardware address.  The ipoib_neigh is fetched from the
hash table and instead of the stashed location in the neighbour
structure. The hash table uses both RCU and reference counting to
guarantee that no ipoib_neigh instance is ever deleted while in use.

Fetching the ipoib_neigh structure instance from the hash also makes
the special code in ipoib_start_xmit that handles remote and local
bonding failover redundant.

Aged ipoib_neigh instances are deleted by a garbage collection task
that runs every M seconds and deletes every ipoib_neigh instance that
was idle for at least 2*M seconds. The deletion is safe since the
ipoib_neigh instances are protected using RCU and reference count
mechanisms.

The number of buckets (N) and frequency of running the GC thread (M),
are taken from the exported arb_tbl.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-30 07:46:50 -07:00
..
accessibility
acpi Merge branch 'pm-acpi' 2012-07-19 00:03:35 +02:00
amba
ata
atm
auxdisplay
base Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-07-23 12:27:27 -07:00
bcma bcma: add place for flash memory support 2012-07-17 15:11:40 -04:00
block Power management updates for 3.6 2012-07-22 13:36:52 -07:00
bluetooth Bluetooth: Introduce a flags variable to Three-wire UART state 2012-07-17 14:49:24 -03:00
cdrom
char Merge git://www.linux-watchdog.org/linux-watchdog 2012-07-24 13:26:08 -07:00
clk arm-soc: new SoC support 2012-07-23 16:31:31 -07:00
clocksource arm-soc: new SoC support 2012-07-23 16:31:31 -07:00
connector drivers: connector: fixed coding style issues 2012-07-16 23:23:52 -07:00
cpufreq Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
cpuidle Merge branch 'pm-domains' 2012-07-19 00:03:17 +02:00
crypto arm-soc: clk changes 2012-07-23 17:51:03 -07:00
dca
devfreq
dio
dma Merge branch 'imx/sparse-irq' of git://git.linaro.org/people/shawnguo/linux-2.6 into next/irq 2012-07-02 23:18:19 +02:00
edac
eisa
extcon
firewire
firmware
gpio arm-soc: sparse IRQ conversion 2012-07-23 17:36:02 -07:00
gpu gma500,cdv: Fix the brightness base 2012-07-16 09:20:33 -07:00
hid Merge branch 'uhid' into for-linus 2012-07-24 13:40:06 +02:00
hsi
hv
hwmon Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
hwspinlock hwspinlock/core: use global ID to register hwspinlocks on multiple devices 2012-07-07 22:35:30 +03:00
i2c i2c-omap: Add support for I2C_M_STOP message flag 2012-07-24 14:13:59 +02:00
ide
idle ACPI: intel_idle : break dependency between modules 2012-07-05 22:37:47 +02:00
ieee802154 drivers/ieee802154/at86rf230: rework irq handler 2012-07-12 07:54:45 -07:00
iio
infiniband IPoIB: Use a private hash table for path lookup in xmit path 2012-07-30 07:46:50 -07:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
iommu Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
isdn ISDN: Add check for usb_alloc_urb() result 2012-07-18 09:40:54 -07:00
leds leds: heartbeat: fix bug on panic 2012-07-04 15:55:19 +08:00
lguest
macintosh
md Three fixes for device-mapper discard processing: 2012-07-20 11:51:22 -07:00
media Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
memory
memstick
message
mfd Linux 3.5-rc7 2012-07-15 21:49:21 +01:00
misc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-07-19 11:17:30 -07:00
mmc arm-soc: clk changes 2012-07-23 17:51:03 -07:00
mtd Change the default amount of eraseblocks which UBI reserves for bad block 2012-07-23 15:53:06 -07:00
net InfiniBand/RDMA changes for the 3.6 merge window: 2012-07-24 13:56:26 -07:00
nfc NFC: Add ISO 14443 type B protocol 2012-07-09 16:42:24 -04:00
nubus
of Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2012-07-24 10:01:50 -07:00
oprofile
parisc
parport
pci Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
pcmcia
pinctrl arm-soc: pincontrol drivers 2012-07-23 17:36:53 -07:00
platform Power management updates for 3.6 2012-07-22 13:36:52 -07:00
pnp
power Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
pps
ps3
ptp
rapidio
regulator Merge branch 'regulator-drivers' into regulator-next 2012-07-22 19:32:00 +01:00
remoteproc remoteproc: fix missing CONFIG_FW_LOADER configurations 2012-07-04 11:01:12 +03:00
rpmsg rpmsg: fix dependency on initialization order 2012-07-17 13:10:38 +03:00
rtc arm-soc: device tree description updates 2012-07-23 16:17:43 -07:00
s390 KVM updates for the 3.6 merge window 2012-07-24 12:01:20 -07:00
sbus
scsi Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
sfi
sh Merge branch 'common/pinctrl' into sh-latest 2012-07-20 16:42:59 +09:00
sn
spi arm-soc: clk changes 2012-07-23 17:51:03 -07:00
ssb
staging Merge branch 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging 2012-07-24 13:44:40 -07:00
target iscsi-target: Drop bogus struct file usage for iSCSI/SCTP 2012-07-21 02:44:13 -07:00
tc
thermal
tty Features: 2012-07-24 13:14:03 -07:00
uio
usb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
uwb
vhost vhost: make vhost work queue visible 2012-07-22 01:22:23 +03:00
video Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-07-24 13:34:56 -07:00
virt
virtio virtio-balloon: fix add/get API use 2012-07-09 09:07:22 +09:30
vlynq
vme
w1
watchdog Merge git://www.linux-watchdog.org/linux-watchdog 2012-07-24 13:26:08 -07:00
xen xen PVonHVM: move shared_info to MMIO before kexec 2012-07-19 15:52:05 -04:00
zorro
Kconfig
Makefile