linux

Author	SHA1	Message	Date
Kurt Van Dijck	6d3a9a6854	net: fix validate_link_af in rtnetlink core I'm testing an API that uses IFLA_AF_SPEC attribute. In the rtnetlink core , the set_link_af() member of the rtnl_af_ops struct receives the nested attribute (as I expected), but the validate_link_af() member receives the parent attribute. IMO, this patch fixes this. Signed-off-by: Kurt Van Dijck <kurt.van.dijck@eia.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 14:39:21 -08:00
David S. Miller	62fa8a846d	net: Implement read-only protection and COW'ing of metrics. Routing metrics are now copy-on-write. Initially a route entry points it's metrics at a read-only location. If a routing table entry exists, it will point there. Else it will point at the all zero metric place-holder called 'dst_default_metrics'. The writeability state of the metrics is stored in the low bits of the metrics pointer, we have two bits left to spare if we want to store more states. For the initial implementation, COW is implemented simply via kmalloc. However future enhancements will change this to place the writable metrics somewhere else, in order to increase sharing. Very likely this "somewhere else" will be the inetpeer cache. Note also that this means that metrics updates may transiently fail if we cannot COW the metrics successfully. But even by itself, this patch should decrease memory usage and increase cache locality especially for routing workloads. In those cases the read-only metric copies stay in place and never get written to. TCP workloads where metrics get updated, and those rare cases where PMTU triggers occur, will take a very slight performance hit. But that hit will be alleviated when the long-term writable metrics move to a more sharable location. Since the metrics storage went from a u32 array of RTAX_MAX entries to what is essentially a pointer, some retooling of the dst_entry layout was necessary. Most importantly, we need to preserve the alignment of the reference count so that it doesn't share cache lines with the read-mostly state, as per Eric Dumazet's alignment assertion checks. The only non-trivial bit here is the move of the 'flags' member into the writeable cacheline. This is OK since we are always accessing the flags around the same moment when we made a modification to the reference count. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-26 20:51:05 -08:00
David S. Miller	b4e69ac670	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-01-26 13:49:30 -08:00
Eric Dumazet	26ad787962	pktgen: speedup fragmented skbs We spend lot of time clearing pages in pktgen. (Or not clearing them on ipv6 and leaking kernel memory) Since we dont modify them, we can use one zeroed page, and get references on it. This page can use NUMA affinity as well. Define pktgen_finalize_skb() helper, used both in ipv4 and ipv6 Results using skbs with one frag : Before patch : Result: OK: 608980458(c608978520+d1938) nsec, 1000000000 (100byte,1frags) 1642088pps 1313Mb/sec (1313670400bps) errors: 0 After patch : Result: OK: 345285014(c345283891+d1123) nsec, 1000000000 (100byte,1frags) 2896158pps 2316Mb/sec (2316926400bps) errors: 0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-25 13:26:05 -08:00
Vlad Dogaru	a512b92b3a	net: add sysfs entry for device group The group of a network device can be queried or changed from userspace using sysfs. For example, considering sysfs mounted in /sys, one can change the group that interface lo belongs to: echo 1 > /sys/class/net/lo/group Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 23:23:28 -08:00
Eugene Teo	b7c7d01aae	net: clear heap allocation for ethtool_get_regs() There is a conflict between commit `b00916b1` and `a77f5db3`. This patch resolves the conflict by clearing the heap allocation in ethtool_get_regs(). Cc: stable@kernel.org Signed-off-by: Eugene Teo <eugeneteo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 21:05:17 -08:00
Michał Mirosław	acd1130e87	net: reduce and unify printk level in netdev_fix_features() Reduce printk() levels to KERN_INFO in netdev_fix_features() as this will be used by ethtool and might spam dmesg unnecessarily. This converts the function to use netdev_info() instead of plain printk(). As a side effect, bonding and bridge devices will now log dropped features on every slave device change. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:45:15 -08:00
Michał Mirosław	04ed3e741d	net: change netdev->features to u32 Quoting Ben Hutchings: we presumably won't be defining features that can only be enabled on 64-bit architectures. Occurences found by `grep -r` on net/, drivers/net, include/ [ Move features and vlan_features next to each other in struct netdev, as per Eric Dumazet's suggestion -DaveM ] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:32:47 -08:00
Michał Mirosław	57422dc530	net: Move check of checksum features to netdev_fix_features() Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:29:11 -08:00
Ben Hutchings	c445477d74	net: RPS: Enable hardware acceleration of RFS Allow drivers for multiqueue hardware with flow filter tables to accelerate RFS. The driver must: 1. Set net_device::rx_cpu_rmap to a cpu_rmap of the RX completion IRQs (in queue order). This will provide a mapping from CPUs to the queues for which completions are handled nearest to them. 2. Implement net_device_ops::ndo_rx_flow_steer. This operation adds or replaces a filter steering the given flow to the given RX queue, if possible. 3. Periodically remove filters for which rps_may_expire_flow() returns true. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 14:53:01 -08:00
Michal Schmidt	d1dc7abf2f	GRO: fix merging a paged skb after non-paged skbs Suppose that several linear skbs of the same flow were received by GRO. They were thus merged into one skb with a frag_list. Then a new skb of the same flow arrives, but it is a paged skb with data starting in its frags[]. Before adding the skb to the frag_list skb_gro_receive() will of course adjust the skb to throw away the headers. It correctly modifies the page_offset and size of the frag, but it leaves incorrect information in the skb: ->data_len is not decreased at all. ->len is decreased only by headlen, as if no change were done to the frag. Later in a receiving process this causes skb_copy_datagram_iovec() to return -EFAULT and this is seen in userspace as the result of the recv() syscall. In practice the bug can be reproduced with the sfc driver. By default the driver uses an adaptive scheme when it switches between using napi_gro_receive() (with skbs) and napi_gro_frags() (with pages). The bug is reproduced when under rx load with enough successful GRO merging the driver decides to switch from the former to the latter. Manual control is also possible, so reproducing this is easy with netcat: - on machine1 (with sfc): nc -l 12345 > /dev/null - on machine2: nc machine1 12345 < /dev/zero - on machine1: echo 1 > /sys/module/sfc/parameters/rx_alloc_method # use skbs echo 2 > /sys/module/sfc/parameters/rx_alloc_method # use pages - See that nc has quit suddenly. [v2: Modified by Eric Dumazet to avoid advancing skb->data past the end and to use a temporary variable.] Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 14:27:18 -08:00
David S. Miller	5bdc22a565	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/sched/sch_hfsc.c net/sched/sch_htb.c net/sched/sch_tbf.c	2011-01-24 14:09:35 -08:00
David S. Miller	e92427b289	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6	2011-01-24 13:17:06 -08:00
Eric Dumazet	c506653d35	net: arp_ioctl() must hold RTNL Commit `941666c2e3` "net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()" introduced a regression, reported by Jamie Heilman. "arp -Ds 192.168.2.41 eth0 pub" triggered the ASSERT_RTNL() assert in pneigh_lookup() Removing RTNL requirement from arp_ioctl() was a mistake, just revert that part. Reported-by: Jamie Heilman <jamie@audible.transient.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 13:16:16 -08:00
Eric Dumazet	bb134d2298	net: netif_setup_tc() is static Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-21 13:08:27 -08:00
Patrick McHardy	ffa934f192	rtnetlink: fix link attribute validation with IFLA_GROUP rtnl_group_changelink() is invoked by rtnl_newlink() before the link attributes have been validated. Additionally the group changes are performed even if NLM_F_CREATE is specified and a new link is created, while more reasonable semantics would be to set the group value on the newly created link. Fix both problems by moving the rtnl_group_changelink() invocation down to the handling of non-existant links without NLM_F_CREATE() and add a dev_set_group() call to rtnl_create_link(). Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Vlad Dogaru <ddvlad@rosedu.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-20 23:28:54 -08:00
Eric Dumazet	6193d2be29	neigh: __rcu annotations fix some minor issues and sparse (__rcu) warnings Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-20 16:59:34 -08:00
Eric Dumazet	a2da570d62	net_sched: RCU conversion of stab This patch converts stab qdisc management to RCU, so that we can perform the qdisc_calculate_pkt_len() call before getting qdisc lock. This shortens the lock's held time in __dev_xmit_skb(). This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of cache misses and so reducing latencies. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-20 16:59:32 -08:00
Eric Dumazet	3fbd8758b0	net: dev_close_many() is static Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Octavian Purdila <opurdila@ixiacom.com> Reviewed-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-20 16:55:30 -08:00
John Fastabend	4f57c087de	net: implement mechanism for HW based QOS This patch provides a mechanism for lower layer devices to steer traffic using skb->priority to tx queues. This allows for hardware based QOS schemes to use the default qdisc without incurring the penalties related to global state and the qdisc lock. While reliably receiving skbs on the correct tx ring to avoid head of line blocking resulting from shuffling in the LLD. Finally, all the goodness from txq caching and xps/rps can still be leveraged. Many drivers and hardware exist with the ability to implement QOS schemes in the hardware but currently these drivers tend to rely on firmware to reroute specific traffic, a driver specific select_queue or the queue_mapping action in the qdisc. By using select_queue for this drivers need to be updated for each and every traffic type and we lose the goodness of much of the upstream work. Firmware solutions are inherently inflexible. And finally if admins are expected to build a qdisc and filter rules to steer traffic this requires knowledge of how the hardware is currently configured. The number of tx queues and the queue offsets may change depending on resources. Also this approach incurs all the overhead of a qdisc with filters. With the mechanism in this patch users can set skb priority using expected methods ie setsockopt() or the stack can set the priority directly. Then the skb will be steered to the correct tx queues aligned with hardware QOS traffic classes. In the normal case with single traffic class and all queues in this class everything works as is until the LLD enables multiple tcs. To steer the skb we mask out the lower 4 bits of the priority and allow the hardware to configure upto 15 distinct classes of traffic. This is expected to be sufficient for most applications at any rate it is more then the 8021Q spec designates and is equal to the number of prio bands currently implemented in the default qdisc. This in conjunction with a userspace application such as lldpad can be used to implement 8021Q transmission selection algorithms one of these algorithms being the extended transmission selection algorithm currently being used for DCB. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-19 23:31:10 -08:00
Vlad Dogaru	e7ed828f10	netlink: support setting devgroup parameters If a rtnetlink request specifies a negative or zero ifindex and has no interface name attribute, but has a group attribute, then the chenges are made to all the interfaces belonging to the specified group. Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-19 23:31:10 -08:00
Vlad Dogaru	cbda10fa97	net_device: add support for network device groups Net devices can now be grouped, enabling simpler manipulation from userspace. This patch adds a group field to the net_device structure, as well as rtnetlink support to query and modify it. Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-19 23:31:09 -08:00
Linus Torvalds	1268afe676	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits) sctp: user perfect name for Delayed SACK Timer option net: fix can_checksum_protocol() arguments swap Revert "netlink: test for all flags of the NLM_F_DUMP composite" gianfar: Fix misleading indentation in startup_gfar() net/irda/sh_irda: return to RX mode when TX error net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan. USB CDC NCM: tx_fixup() race condition fix ns83820: Avoid bad pointer deref in ns83820_init_one(). ipv6: Silence privacy extensions initialization bnx2x: Update bnx2x version to 1.62.00-4 bnx2x: Fix AER setting for BCM57712 bnx2x: Fix BCM84823 LED behavior bnx2x: Mark full duplex on some external PHYs bnx2x: Fix BCM8073/BCM8727 microcode loading bnx2x: LED fix for BCM8727 over BCM57712 bnx2x: Common init will be executed only once after POR bnx2x: Swap BCM8073 PHY polarity if required iwlwifi: fix valid chain reading from EEPROM ath5k: fix locking in tx_complete_poll_work ath9k_hw: do PA offset calibration only on longcal interval ...	2011-01-19 20:25:45 -08:00
Eric Dumazet	d402786ea4	net: fix can_checksum_protocol() arguments swap commit `0363466866` (net offloading: Convert checksums to use centrally computed features.) mistakenly swapped can_checksum_protocol() arguments. This broke IPv6 on bnx2 for instance, on NIC without TCPv6 checksum offloads. Reported-by: Hans de Bruin <jmdebruin@xmsnet.nl> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-19 14:15:21 -08:00
David S. Miller	b8f3ab4290	Revert "netlink: test for all flags of the NLM_F_DUMP composite" This reverts commit `0ab03c2b14`. It breaks several things including the avahi daemon. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-19 13:34:20 -08:00
Eric Dumazet	80f8f1027b	net: filter: dont block softirqs in sk_run_filter() Packet filter (BPF) doesnt need to disable softirqs, being fully re-entrant and lock-less. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-18 21:33:05 -08:00
David S. Miller	a5db219f4c	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-01-18 16:28:31 -08:00
Jesse Gross	6ee400aafb	net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan. In netif_skb_features() we return only the features that are valid for vlans if we have a vlan packet. However, we should not mask out NETIF_F_HW_VLAN_TX since it enables transmission of vlan tags and is obviously valid. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-18 16:13:50 -08:00
Linus Torvalds	d018b6f4f1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits) GRETH: resolve SMP issues and other problems GRETH: handle frame error interrupts GRETH: avoid writing bad speed/duplex when setting transfer mode GRETH: fixed skb buffer memory leak on frame errors GRETH: GBit transmit descriptor handling optimization GRETH: fix opening/closing GRETH: added raw AMBA vendor/device number to match against. cassini: Fix build bustage on x86. e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs e1000e: update Copyright for 2011 e1000: Avoid unhandled IRQ r8169: keep firmware in memory. netdev: tilepro: Use is_unicast_ether_addr helper etherdevice.h: Add is_unicast_ether_addr function ks8695net: Use default implementation of ethtool_ops::get_link ks8695net: Disable non-working ethtool operations USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable. vxge: Remember to release firmware after upgrading firmware netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr ipsec: update MAX_AH_AUTH_LEN to support sha512 ...	2011-01-14 13:25:30 -08:00
Eric Dumazet	1ac9ad1394	net: remove dev_txq_stats_fold() After recent changes, (percpu stats on vlan/tunnels...), we dont need anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters. Only remaining users are ixgbe, sch_teql, gianfar & macvlan : 1) ixgbe can be converted to use existing tx_ring counters. 2) macvlan incremented txq->tx_dropped, it can use the dev->stats.tx_dropped counter. 3) sch_teql : almost revert `ab35cd4b8f` (Use net_device internal stats) Now we have ndo_get_stats64(), use it, even for "unsigned long" fields (No need to bring back a struct net_device_stats) 4) gianfar adds a stats structure per tx queue to hold tx_bytes/tx_packets This removes a lockdep warning (and possible lockup) in rndis gadget, calling dev_get_stats() from hard IRQ context. Ref: http://www.spinics.net/lists/netdev/msg149202.html Reported-by: Neil Jones <neiljay@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Sandeep Gopalpet <sandeep.kumar@freescale.com> CC: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-13 21:44:34 -08:00
Linus Torvalds	27d189c02b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (46 commits) hwrng: via_rng - Fix memory scribbling on some CPUs crypto: padlock - Move padlock.h into include/crypto hwrng: via_rng - Fix asm constraints crypto: n2 - use __devexit not __exit in n2_unregister_algs crypto: mark crypto workqueues CPU_INTENSIVE crypto: mv_cesa - dont return PTR_ERR() of wrong pointer crypto: ripemd - Set module author and update email address crypto: omap-sham - backlog handling fix crypto: gf128mul - Remove experimental tag crypto: af_alg - fix af_alg memory_allocated data type crypto: aesni-intel - Fixed build with binutils 2.16 crypto: af_alg - Make sure sk_security is initialized on accept()ed sockets net: Add missing lockdep class names for af_alg include: Install linux/if_alg.h for user-space crypto API crypto: omap-aes - checkpatch --file warning fixes crypto: omap-aes - initialize aes module once per request crypto: omap-aes - unnecessary code removed crypto: omap-aes - error handling implementation improved crypto: omap-aes - redundant locking is removed crypto: omap-aes - DMA initialization fixes for OMAP off mode ...	2011-01-13 10:25:58 -08:00
Linus Torvalds	008d23e485	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.	2011-01-13 10:05:56 -08:00
David S. Miller	464143c911	Merge branch 'master' of git://1984.lsi.us.es/net-2.6	2011-01-12 18:58:40 -08:00
KOVACS Krisztian	2fc72c7b84	netfilter: fix compilation when conntrack is disabled but tproxy is enabled The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but failed to update the #ifdef stanzas guarding the defragmentation related fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c. This patch adds the required #ifdefs so that IPv6 tproxy can truly be used without connection tracking. Original report: http://marc.info/?l=linux-netdev&m=129010118516341&w=2 Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-01-12 20:25:08 +01:00
Eric Dumazet	bfe0d0298f	net_sched: factorize qdisc stats handling HTB takes into account skb is segmented in stats updates. Generalize this to all schedulers. They should use qdisc_bstats_update() helper instead of manipulating bstats.bytes and bstats.packets Add bstats_update() helper too for classes that use gnet_stats_basic_packed fields. Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no stab is setup on qdisc. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-10 16:07:54 -08:00
Tom Herbert	36909ea438	net: Add alloc_netdev_mqs function Added alloc_netdev_mqs function which allows the number of transmit and receive queues to be specified independenty. alloc_netdev_mq was changed to a macro to call the new function. Also added alloc_etherdev_mqs with same purpose. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-10 16:05:30 -08:00
Jesse Gross	0363466866	net offloading: Convert checksums to use centrally computed features. In order to compute the features for other offloads (primarily scatter/gather), we need to first check the ability of the NIC to offload the checksum for the packet. Since we have already computed this, we can directly use the result instead of figuring it out again. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 23:35:35 -08:00
Jesse Gross	02932ce9e2	net offloading: Convert skb_need_linearize() to use precomputed features. This switches skb_need_linearize() to use the features that have been centrally computed. In doing so, this fixes a problem where scatter/gather should not be used because the card does not support checksum offloading on that type of packet. On device registration we only check that some form of checksum offloading is available if scatter/gatther is enabled but we must also check at transmission time. Examples of this include IPv6 or vlan packets on a NIC that only supports IPv4 offloading. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 23:35:35 -08:00
Jesse Gross	91ecb63c07	net offloading: Convert dev_gso_segment() to use precomputed features. This switches dev_gso_segment() to use the device features computed by the centralized routine. In doing so, it fixes a problem where it would always use dev->features, instead of those appropriate to the number of vlan tags if any are present. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 23:35:34 -08:00
Jesse Gross	fc741216db	net offloading: Pass features into netif_needs_gso(). Now that there is a single function that can compute the device features relevant to a packet, we don't want to run it for each offload. This converts netif_needs_gso() to take the features of the device, rather than computing them itself. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 23:35:34 -08:00
Jesse Gross	f01a5236bd	net offloading: Generalize netif_get_vlan_features(). netif_get_vlan_features() is currently only used by netif_needs_gso(), so it only concerns itself with GSO features. However, several other places also should take into account the contents of the packet when deciding whether to offload to hardware. This generalizes the function to return features about all of the various forms of offloading. Since offloads tend to be linked together, this avoids duplicating the logic in each location (i.e. the scatter/gather code also needs the checksum logic). Suggested-by: Michał Mirosław <mirqus@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 23:35:33 -08:00
Jesse Gross	9497a0518e	net offloading: Accept NETIF_F_HW_CSUM for all protocols. We currently only have software fallback for one type of checksum: the TCP/UDP one's complement. This means that a protocol that uses hardware offloading for a different type of checksum (FCoE, SCTP) must directly check the device's features and do the right thing ahead of time. By the time we get to dev_can_checksum(), we're only deciding whether to apply the one algorithm in software or hardware. NETIF_F_HW_CSUM has the same capabilities as the software version, so we should always use it if present. The primary advantage of this is multiply tagged vlans can use hardware checksumming. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 23:35:33 -08:00
Randy Dunlap	697d0e338c	net: fix kernel-doc warning in core/filter.c Fix new kernel-doc notation warning in net/core/filter.c: Warning(net/core/filter.c:172): No description found for parameter 'fentry' Warning(net/core/filter.c:172): Excess function parameter 'filter' description in 'sk_run_filter' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 16:26:51 -08:00
Jan Engelhardt	0ab03c2b14	netlink: test for all flags of the NLM_F_DUMP composite Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT \| NLM_F_MATCH, when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL, non-dump requests with NLM_F_EXCL set are mistaken as dump requests. Substitute the condition to test for _all_ bits being set. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 16:25:03 -08:00
Linus Torvalds	23d69b09b7	Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits) usb: don't use flush_scheduled_work() speedtch: don't abuse struct delayed_work media/video: don't use flush_scheduled_work() media/video: explicitly flush request_module work ioc4: use static work_struct for ioc4_load_modules() init: don't call flush_scheduled_work() from do_initcalls() s390: don't use flush_scheduled_work() rtc: don't use flush_scheduled_work() mmc: update workqueue usages mfd: update workqueue usages dvb: don't use flush_scheduled_work() leds-wm8350: don't use flush_scheduled_work() mISDN: don't use flush_scheduled_work() macintosh/ams: don't use flush_scheduled_work() vmwgfx: don't use flush_scheduled_work() tpm: don't use flush_scheduled_work() sonypi: don't use flush_scheduled_work() hvsi: don't use flush_scheduled_work() xen: don't use flush_scheduled_work() gdrom: don't use flush_scheduled_work() ... Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c as per Tejun.	2011-01-07 16:58:04 -08:00
Eric Dumazet	2c6607c611	net: add POLLPRI to sock_def_readable() Leonardo Chiquitto found poll() could block forever on tcp sockets and Urgent data was received, if the event flag only contains POLLPRI. He did a bisection and found commit `4938d7e023` (poll: avoid extra wakeups in select/poll) was the source of the problem. Problem is TCP sockets use standard sock_def_readable() function for their sk_data_ready() handler, and sock_def_readable() doesnt signal POLLPRI. Only TCP is affected by the problem. Adding POLLPRI to the list of flags might trigger unnecessary schedules, but URGENT handling is such a seldom used feature this seems a good compromise. Thanks a lot to Leonardo for providing the bisection result and a test program as well. Reference : http://www.spinics.net/lists/netdev/msg151793.html Reported-and-bisected-by: Leonardo Chiquitto <leonardo.lists@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-06 10:54:29 -08:00
David S. Miller	17f7f4d9fc	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/ipv4/fib_frontend.c	2010-12-26 22:37:05 -08:00
David S. Miller	e058464990	Revert "ipv4: Allow configuring subnets as local addresses" This reverts commit `4465b46900`. Conflicts: net/ipv4/fib_frontend.c As reported by Ben Greear, this causes regressions: > Change `4465b46900` caused rules > to stop matching the input device properly because the > FLOWI_FLAG_MATCH_ANY_IIF is always defined in ip_dev_find(). > > This breaks rules such as: > > ip rule add pref 512 lookup local > ip rule del pref 0 lookup local > ip link set eth2 up > ip -4 addr add 172.16.0.102/24 broadcast 172.16.0.255 dev eth2 > ip rule add to 172.16.0.102 iif eth2 lookup local pref 10 > ip rule add iif eth2 lookup 10001 pref 20 > ip route add 172.16.0.0/24 dev eth2 table 10001 > ip route add unreachable 0/0 table 10001 > > If you had a second interface 'eth0' that was on a different > subnet, pinging a system on that interface would fail: > > [root@ct503-60 ~]# ping 192.168.100.1 > connect: Invalid argument Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-23 12:03:57 -08:00
Jiri Kosina	4b7bd36470	Merge branch 'master' into for-next Conflicts: MAINTAINERS arch/arm/mach-omap2/pm24xx.c drivers/scsi/bfa/bfa_fcpim.c Needed to update to apply fixes for which the old branch was too outdated.	2010-12-22 18:57:02 +01:00
Eric Dumazet	12b16dadbc	filter: optimize accesses to ancillary data We can translate pseudo load instructions at filter check time to dedicated instructions to speed up filtering and avoid one switch(). libpcap currently uses SKF_AD_PROTOCOL, but custom filters probably use other ancillary accesses. Note : I made the assertion that ancillary data was always accessed with BPF_LD\|BPF_?\|BPF_ABS instructions, not with BPF_LD\|BPF_?\|BPF_IND ones (offset given by K constant, not by K + X register) On x86_64, this saves a few bytes of text : # size net/core/filter.o.* text data bss dec hex filename 4864 0 0 4864 1300 net/core/filter.o.new 4944 0 0 4944 1350 net/core/filter.o.old Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-21 12:30:12 -08:00
Eric Dumazet	70978182d4	net: timestamp cloned packet in dev_queue_xmit_nit Le vendredi 17 décembre 2010 à 10:26 +0100, Eric Dumazet a écrit : > > I think we can add this after latest Changli patch : > > He does one skb_clone() before calling the sniffers. > We could set timestamp on this clone, instead of original skb. > > Problem solved. > [PATCH net-next-2.6] net: timestamp cloned packet in dev_queue_xmit_nit Now we do one clone of skb if at least one sniffer might take packet, we also can do the skb timestamping on the clone and let original packet unchanged. This is a generalization of commit `8caf153974` (net: sch_netem: Fix an inconsistency in ingress netem timestamps.) This way, we can have a good idea when packets are delivered to our stack (tcpdump -i ifb0), while a tcpdump on original device gives timestamps right before ingressing. This also speedup our stack, avoiding taking timestamps if not needed. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Changli Gao <xiaosuo@gmail.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Jarek Poplawski <jarkao2@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-21 10:50:38 -08:00
Joe Perches	58231186c4	pktgen: Remove unnecessary prefix from pr_<level> Remove "pktgen: " prefix string from one pr_info. pr_fmt adds it, so this is a duplicate. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-20 10:30:06 -08:00
Shan Wei	4c306a9291	net: kill unused macros These macros never be used, so remove them. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-19 21:59:35 -08:00
Changli Gao	71d9dec24d	net: increase skb->users instead of skb_clone() In dev_queue_xmit_nit(), we have to clone skbs as we need to mangle skbs, however, we don't need to clone skbs for all the packet_types. Except for the first packet_type, we increase skb->users instead of skb_clone(). Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-19 21:50:20 -08:00
David S. Miller	b4aa9e05a6	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x.h drivers/net/wireless/iwlwifi/iwl-1000.c drivers/net/wireless/iwlwifi/iwl-6000.c drivers/net/wireless/iwlwifi/iwl-core.h drivers/vhost/vhost.c	2010-12-17 12:27:22 -08:00
Michał Mirosław	55508d601d	net: Use skb_checksum_start_offset() Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset(). Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb). Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 14:43:14 -08:00
Octavian Purdila	fcbdf09d96	net: fix nulls list corruptions in sk_prot_alloc Special care is taken inside sk_port_alloc to avoid overwriting skc_node/skc_nulls_node. We should also avoid overwriting skc_bind_node/skc_portaddr_node. The patch fixes the following crash: BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff812ec6dd>] udp4_lib_lookup2+0xad/0x370 [<ffffffff812ecc22>] __udp4_lib_lookup+0x282/0x360 [<ffffffff812ed63e>] __udp4_lib_rcv+0x31e/0x700 [<ffffffff812bba45>] ? ip_local_deliver_finish+0x65/0x190 [<ffffffff812bbbf8>] ? ip_local_deliver+0x88/0xa0 [<ffffffff812eda35>] udp_rcv+0x15/0x20 [<ffffffff812bba45>] ip_local_deliver_finish+0x65/0x190 [<ffffffff812bbbf8>] ip_local_deliver+0x88/0xa0 [<ffffffff812bb2cd>] ip_rcv_finish+0x32d/0x6f0 [<ffffffff8128c14c>] ? netif_receive_skb+0x99c/0x11c0 [<ffffffff812bb94b>] ip_rcv+0x2bb/0x350 [<ffffffff8128c14c>] netif_receive_skb+0x99c/0x11c0 Signed-off-by: Leonard Crestez <lcrestez@ixiacom.com> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 14:26:56 -08:00
Octavian Purdila	443457242b	net: factorize sync-rcu call in unregister_netdevice_many Add dev_close_many and dev_deactivate_many to factorize another sync-rcu operation on the netdevice unregister path. $ modprobe dummy numdummies=10000 $ ip link set dev dummy* up $ time rmmod dummy Without the patch With the patch real 0m 24.63s real 0m 5.15s user 0m 0.00s user 0m 0.00s sys 0m 6.05s sys 0m 5.14s Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 14:04:44 -08:00
Changli Gao	b236da6931	net: use NUMA_NO_NODE instead of the magic number -1 Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 13:16:06 -08:00
Vladislav Zolotarov	a3d22a68d7	bnx2x: Take the distribution range definition out of skb_tx_hash() Move the calcualation of the Tx hash for a given hash range into a separate function and define the skb_tx_hash(), which calculates a Tx hash for a [0; dev->real_num_tx_queues - 1] hash values range, using this function (__skb_tx_hash()). Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 13:15:53 -08:00
Tejun Heo	afe2c511fb	workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() cancel_rearming_delayed_work[queue]() has been superceded by cancel_delayed_work_sync() quite some time ago. Convert all the in-kernel users. The conversions are completely equivalent and trivial. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: netdev@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: xfs-masters@oss.sgi.com Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netfilter-devel@vger.kernel.org Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: linux-nfs@vger.kernel.org	2010-12-15 10:56:11 +01:00
Eric Dumazet	a19faf0250	net: fix skb_defer_rx_timestamp() After commit `c1f19b51d1` (net: support time stamping in phy devices.), kernel might crash if CONFIG_NETWORK_PHY_TIMESTAMPING=y and skb_defer_rx_timestamp() handles a packet without an ethernet header. Fixes kernel bugzilla #24102 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=24102 Reported-and-tested-by: Andrew Watts <akwatts@ymail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-10 16:20:56 -08:00
Ben Hutchings	e596e6e4d5	ethtool: Report link-down while interface is down While an interface is down, many implementations of ethtool_ops::get_link, including the default, ethtool_op_get_link(), will report the last link state seen while the interface was up. In general the current physical link state is not available if the interface is down. Define ETHTOOL_GLINK to reflect whether the interface and any physical port have a working link, and consistently return 0 when the interface is down. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-10 15:55:23 -08:00
Junchang Wang	a8d764b983	pktgen: adding prefetchw() call We know for sure pktgen is going to write skb->data right after *_alloc_skb, causing unnecessary cache misses. Idea is to add a prefetchw() call to prefetch the first cache line indicated by skb->data. On systems with Adjacent Cache Line Prefetch, it's probably two cache lines are prefetched. With this patch, pktgen on Intel SR1625 server with two E5530 quad-core processors and a single ixgbe-based NIC went from 8.63Mpps to 9.03Mpps, with 4.6% improvement. Signed-off-by: Junchang Wang <junchangwang@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-10 15:36:52 -08:00
Eric Dumazet	4bc65dd8d8	filter: use size of fetched data in __load_pointer() __load_pointer() checks data we fetch from skb is included in head portion, but assumes we fetch one byte, instead of up to four. This wont crash because we have extra bytes (struct skb_shared_info) after head, but this can read uninitialized bytes. Fix this using size of the data (1, 2, 4 bytes) in the test. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-09 20:47:04 -08:00
Eric Dumazet	68835aba4d	net: optimize INET input path further Followup of commit `b178bb3dfc` (net: reorder struct sock fields) Optimize INET input path a bit further, by : 1) moving sk_refcnt close to sk_lock. This reduces number of dirtied cache lines by one on 64bit arches (and 64 bytes cache line size). 2) moving inet_daddr & inet_rcv_saddr at the beginning of sk (same cache line than hash / family / bound_dev_if / nulls_node) This reduces number of accessed cache lines in lookups by one, and dont increase size of inet and timewait socks. inet and tw sockets now share same place-holder for these fields. Before patch : offsetof(struct sock, sk_refcnt) = 0x10 offsetof(struct sock, sk_lock) = 0x40 offsetof(struct sock, sk_receive_queue) = 0x60 offsetof(struct inet_sock, inet_daddr) = 0x270 offsetof(struct inet_sock, inet_rcv_saddr) = 0x274 After patch : offsetof(struct sock, sk_refcnt) = 0x44 offsetof(struct sock, sk_lock) = 0x48 offsetof(struct sock, sk_receive_queue) = 0x68 offsetof(struct inet_sock, inet_daddr) = 0x0 offsetof(struct inet_sock, inet_rcv_saddr) = 0x4 compute_score() (udp or tcp) now use a single cache line per ignored item, instead of two. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-09 20:05:58 -08:00
David S. Miller	fe6c791570	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/ath/ath9k/ar9003_eeprom.c net/llc/af_llc.c	2010-12-08 13:47:38 -08:00
Eric Dumazet	15c2d75f49	net: call dev_queue_xmit_nit() after skb_dst_drop() Avoid some atomic ops on dst refcount, calling dev_queue_xmit_nit() after skb_dst_drop() in dev_hard_start_xmit(). When queueing a packet into af_packet socket, we drop dst anyway. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-08 10:39:54 -08:00
Eric Dumazet	62ab081213	filter: constify sk_run_filter() sk_run_filter() doesnt write on skb, change its prototype to reflect this. Fix two af_packet comments. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-08 10:30:34 -08:00
Eric Dumazet	941666c2e3	net: RCU conversion of dev_getbyhwaddr() and arp_ioctl() Le dimanche 05 décembre 2010 à 09:19 +0100, Eric Dumazet a écrit : > Hmm.. > > If somebody can explain why RTNL is held in arp_ioctl() (and therefore > in arp_req_delete()), we might first remove RTNL use in arp_ioctl() so > that your patch can be applied. > > Right now it is not good, because RTNL wont be necessarly held when you > are going to call arp_invalidate() ? While doing this analysis, I found a refcount bug in llc, I'll send a patch for net-2.6 Meanwhile, here is the patch for net-next-2.6 Your patch then can be applied after mine. Thanks [PATCH] net: RCU conversion of dev_getbyhwaddr() and arp_ioctl() dev_getbyhwaddr() was called under RTNL. Rename it to dev_getbyhwaddr_rcu() and change all its caller to now use RCU locking instead of RTNL. Change arp_ioctl() to use RCU instead of RTNL locking. Note: this fix a dev refcount bug in llc Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-08 10:07:24 -08:00
Changli Gao	aa94210411	net: init ingress queue The dev field of ingress queue is forgot to initialized, then NULL pointer dereference happens in qdisc_alloc(). Move inits of tx queues to netif_alloc_netdev_queues(). Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-08 09:43:27 -08:00
Miloslav Trmač	6f107b5861	net: Add missing lockdep class names for af_alg Signed-off-by: Miloslav Trmač <mitr@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-12-08 14:35:34 +08:00
David Shwatrz	8917a3c0b7	Fix a typo in datagram.c and sctp/socket.c. Hi, This patch fixes a typo in net/core/datagram.c and in net/sctp/socket.c Regards, David Shwartz Signed-off-by: David Shwartz <dshwatrz@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-06 13:10:11 -08:00
Eric Dumazet	2d5311e4e8	filter: add a security check at install time We added some security checks in commit `57fe93b374` (filter: make sure filters dont read uninitialized memory) to close a potential leak of kernel information to user. This added a potential extra cost at run time, while we can perform a check of the filter itself, to make sure a malicious user doesnt try to abuse us. This patch adds a check_loads() function, whole unique purpose is to make this check, allocating a temporary array of mask. We scan the filter and propagate a bitmask information, telling us if a load M(K) is allowed because a previous store M(K) is guaranteed. (So that sk_run_filter() can possibly not read unitialized memory) Note: this can uncover application bug, denying a filter attach, previously allowed. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Cc: Changli Gao <xiaosuo@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-06 12:59:09 -08:00
Eric Dumazet	da2033c282	filter: add SKF_AD_RXHASH and SKF_AD_CPU Add SKF_AD_RXHASH and SKF_AD_CPU to filter ancillary mechanism, to be able to build advanced filters. This can help spreading packets on several sockets with a fast selection, after RPS dispatch to N cpus for example, or to catch a percentage of flows in one queue. tcpdump -s 500 "cpu = 1" : [0] ld CPU [1] jeq #1 jt 2 jf 3 [2] ret #500 [3] ret #0 # take 12.5 % of flows (average) tcpdump -s 1000 "rxhash & 7 = 2" : [0] ld RXHASH [1] and #7 [2] jeq #2 jt 3 jf 4 [3] ret #1000 [4] ret #0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Rui <wirelesser@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-06 12:59:05 -08:00
Michał Mirosław	7903264402	net: Fix too optimistic NETIF_F_HW_CSUM features NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM+NETIF_F_IPV6_CSUM, but some drivers miss the difference. Fix this and also fix UFO dependency on checksumming offload as it makes the same mistake in assumptions. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Acked-by: Jon Mason <jon.mason@exar.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-06 12:59:04 -08:00
Eric Dumazet	46bcf14f44	filter: fix sk_filter rcu handling Pavel Emelyanov tried to fix a race between sk_filter_(de\|at)tach and sk_clone() in commit `47e958eac2` Problem is we can have several clones sharing a common sk_filter, and these clones might want to sk_filter_attach() their own filters at the same time, and can overwrite old_filter->rcu, corrupting RCU queues. We can not use filter->rcu without being sure no other thread could do the same thing. Switch code to a more conventional ref-counting technique : Do the atomic decrement immediately and queue one rcu call back when last reference is released. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-06 09:29:43 -08:00
Changli Gao	ca44ac3861	net: don't reallocate skb->head unless the current one hasn't the needed extra size or is shared skb head being allocated by kmalloc(), it might be larger than what actually requested because of discrete kmem caches sizes. Before reallocating a new skb head, check if the current one has the needed extra size. Do this check only if skb head is not shared. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-03 10:59:47 -08:00
David S. Miller	493f377d6d	tcp: Add timewait recycling bits to ipv6 connect code. This will also improve handling of ipv6 tcp socket request backlog when syncookies are not enabled. When backlog becomes very deep, last quarter of backlog is limited to validated destinations. Previously only ipv4 implemented this logic, but now ipv6 does too. Now we are only one step away from enabling timewait recycling for ipv6, and that step is simply filling in the implementation of tcp_v6_get_peer() and tcp_v6_tw_get_peer(). Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-02 12:14:29 -08:00
Eric Dumazet	f2cd2d3e9b	net sched: use xps information for qdisc NUMA affinity Allocate qdisc memory according to NUMA properties of cpus included in xps map. To be effective, qdisc should be (re)setup after changes of /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus I added a numa_node field in struct netdev_queue, containing NUMA node if all cpus included in xps_cpus share same node, else -1. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-01 12:47:42 -08:00
Eric Dumazet	a417786948	xps: add __rcu annotations Avoid sparse warnings : add __rcu annotations and use rcu_dereference_protected() where necessary. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-29 09:43:13 -08:00
Eric Dumazet	b02038a17b	xps: NUMA allocations for per cpu data store_xps_map() allocates maps that are used by single cpu, it makes sense to use NUMA allocations. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-29 09:43:13 -08:00
Tom Herbert	bf26414510	xps: Add CONFIG_XPS This patch adds XPS_CONFIG option to enable and disable XPS. This is done in the same manner as RPS_CONFIG. This is also fixes build failure in XPS code when SMP is not enabled. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-28 18:24:14 -08:00
Eric Dumazet	5a0d2268d2	net: add netif_tx_queue_frozen_or_stopped When testing struct netdev_queue state against FROZEN bit, we also test XOFF bit. We can test both bits at once and save some cycles. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-28 10:47:18 -08:00
Thomas Graf	cf7afbfeb8	rtnl: make link af-specific updates atomic As David pointed out correctly, updates to af-specific attributes are currently not atomic. If multiple changes are requested and one of them fails, previous updates may have been applied already leaving the link behind in a undefined state. This patch splits the function parse_link_af() into two functions validate_link_af() and set_link_at(). validate_link_af() is placed to validate_linkmsg() check for errors as early as possible before any changes to the link have been made. set_link_af() is called to commit the changes later. This method is not fail proof, while it is currently sufficient to make set_link_af() inerrable and thus 100% atomic, the validation function method will not be able to detect all error scenarios in the future, there will likely always be errors depending on states which are f.e. not protected by rtnl_mutex and thus may change between validation and setting. Also, instead of silently ignoring unknown address families and config blocks for address families which did not register a set function the errors EAFNOSUPPORT respectively EOPNOSUPPORT are returned to avoid comitting 4 out of 5 update requests without notifying the user. Signed-off-by: Thomas Graf <tgraf@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-27 22:56:08 -08:00
Tom Herbert	1d24eb4815	xps: Transmit Packet Steering This patch implements transmit packet steering (XPS) for multiqueue devices. XPS selects a transmit queue during packet transmission based on configuration. This is done by mapping the CPU transmitting the packet to a queue. This is the transmit side analogue to RPS-- where RPS is selecting a CPU based on receive queue, XPS selects a queue based on the CPU (previously there was an XPS patch from Eric Dumazet, but that might more appropriately be called transmit completion steering). Each transmit queue can be associated with a number of CPUs which will use the queue to send packets. This is configured as a CPU mask on a per queue basis in: /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus The mappings are stored per device in an inverted data structure that maps CPUs to queues. In the netdevice structure this is an array of num_possible_cpu structures where each structure holds and array of queue_indexes for queues which that CPU can use. The benefits of XPS are improved locality in the per queue data structures. Also, transmit completions are more likely to be done nearer to the sending thread, so this should promote locality back to the socket on free (e.g. UDP). The benefits of XPS are dependent on cache hierarchy, application load, and other factors. XPS would nominally be configured so that a queue would only be shared by CPUs which are sharing a cache, the degenerative configuration woud be that each CPU has it's own queue. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. bnx2x on 16 core AMD XPS (16 queues, 1 TX queue per CPU) 1234K at 100% CPU No XPS (16 queues) 996K at 100% CPU Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-24 11:44:20 -08:00
Tom Herbert	3853b5841c	xps: Improvements in TX queue selection In dev_pick_tx, don't do work in calculating queue index or setting the index in the sock unless the device has more than one queue. This allows the sock to be set only with a queue index of a multi-queue device which is desirable if device are stacked like in a tunnel. We also allow the mapping of a socket to queue to be changed. To maintain in order packet transmission a flag (ooo_okay) has been added to the sk_buff structure. If a transport layer sets this flag on a packet, the transmit queue can be changed for the socket. Presumably, the transport would set this if there was no possbility of creating OOO packets (for instance, there are no packets in flight for the socket). This patch includes the modification in TCP output for setting this flag. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-24 11:44:19 -08:00
Eric Dumazet	bba14de987	scm: lower SCM_MAX_FD Lower SCM_MAX_FD from 255 to 253 so that allocations for scm_fp_list are halved. (commit `f8d570a4` added two pointers in this structure) scm_fp_dup() should not copy whole structure (and trigger kmemcheck warnings), but only the used part. While we are at it, only allocate needed size. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-24 11:16:43 -08:00
Eric Dumazet	551eaff1b3	pktgen: allow faster module unload Unloading pktgen module needs ~6 seconds on a 64 cpus machine, to stop 64 kthreads. Add a pktgen_exiting variable to let kernel threads die faster, so that kthread_stop() doesnt have to wait too long for them. This variable is not tested in fast path. Note : Before exiting from pktgen_thread_worker(), we must make sure kthread_stop() is waiting for this thread to be stopped, like its done in kernel/softirq.c Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-21 10:26:44 -08:00
Eric Dumazet	7a1c8e5ab1	net: allow GFP_HIGHMEM in __vmalloc() We forgot to use __GFP_HIGHMEM in several __vmalloc() calls. In ceph, add the missing flag. In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is cleaner and allows using HIGHMEM pages as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-21 10:04:04 -08:00
David S. Miller	24912420e9	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bonding/bond_main.c net/core/net-sysfs.c net/ipv6/addrconf.c	2010-11-19 13:13:47 -08:00
Eric Dumazet	c26aed40f4	filter: use reciprocal divide At compile time, we can replace the DIV_K instruction (divide by a constant value) by a reciprocal divide. At exec time, the expensive divide is replaced by a multiply, a less expensive operation on most processors. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-19 10:06:54 -08:00
Eric Dumazet	8c1592d68b	filter: cleanup codes[] init Starting the translated instruction to 1 instead of 0 allows us to remove one descrement at check time and makes codes[] array init cleaner. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-19 10:04:23 -08:00
Eric Dumazet	93aaae2e01	filter: optimize sk_run_filter Remove pc variable to avoid arithmetic to compute fentry at each filter instruction. Jumps directly manipulate fentry pointer. As the last instruction of filter[] is guaranteed to be a RETURN, and all jumps are before the last instruction, we dont need to check filter bounds (number of instructions in filter array) at each iteration, so we remove it from sk_run_filter() params. On x86_32 remove f_k var introduced in commit `57fe93b374` (filter: make sure filters dont read uninitialized memory) Note : We could use a CONFIG_ARCH_HAS_{FEW\|MANY}_REGISTERS in order to avoid too many ifdefs in this code. This helps compiler to use cpu registers to hold fentry and A accumulator. On x86_32, this saves 401 bytes, and more important, sk_run_filter() runs much faster because less register pressure (One less conditional branch per BPF instruction) # size net/core/filter.o net/core/filter_pre.o text data bss dec hex filename 2948 0 0 2948 b84 net/core/filter.o 3349 0 0 3349 d15 net/core/filter_pre.o on x86_64 : # size net/core/filter.o net/core/filter_pre.o text data bss dec hex filename 5173 0 0 5173 1435 net/core/filter.o 5224 0 0 5224 1468 net/core/filter_pre.o Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-19 09:49:59 -08:00
Randy Dunlap	0302b8622c	net: fix kernel-doc for sk_filter_rcu_release Fix kernel-doc warning for sk_filter_rcu_release(): Warning(net/core/filter.c:586): missing initial short description on line: * sk_filter_rcu_release: Release a socket filter by rcu_head Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-19 09:27:15 -08:00
Changli Gao	4c3710afbc	net: move definitions of BPF_S_* to net/core/filter.c BPF_S_* are used internally, should not be exposed to the others. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-18 10:59:51 -08:00
Tetsuo Handa	cba328fc5e	filter: Optimize instruction revalidation code. Since repeating u16 value to u8 value conversion using switch() clause's case statement is wasteful, this patch introduces u16 to u8 mapping table and removes most of case statements. As a result, the size of net/core/filter.o is reduced by about 29% on x86. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-18 10:58:35 -08:00
John Fastabend	9e50e3ac5a	net: add priority field to pktgen Add option to set skb priority to pktgen. Useful for testing QOS features. Also by running pktgen on the vlan device the qdisc on the real device can be tested. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-18 10:43:07 -08:00
John Fastabend	7d8e76bf9a	net: zero kobject in rx_queue_release netif_set_real_num_rx_queues() can decrement and increment the number of rx queues. For example ixgbe does this as features and offloads are toggled. Presumably this could also happen across down/up on most devices if the available resources changed (cpu offlined). The kobject needs to be zero'd in this case so that the state is not preserved across kobject_put()/kobject_init_and_add(). This resolves the following error report. ixgbe 0000:03:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX kobject (ffff880324b83210): tried to init an initialized object, something is seriously wrong. Pid: 1972, comm: lldpad Not tainted 2.6.37-rc18021qaz+ #169 Call Trace: [<ffffffff8121c940>] kobject_init+0x3a/0x83 [<ffffffff8121cf77>] kobject_init_and_add+0x23/0x57 [<ffffffff8107b800>] ? mark_lock+0x21/0x267 [<ffffffff813c6d11>] net_rx_queue_update_kobjects+0x63/0xc6 [<ffffffff813b5e0e>] netif_set_real_num_rx_queues+0x5f/0x78 [<ffffffffa0261d49>] ixgbe_set_num_queues+0x1c6/0x1ca [ixgbe] [<ffffffffa0262509>] ixgbe_init_interrupt_scheme+0x1e/0x79c [ixgbe] [<ffffffffa0274596>] ixgbe_dcbnl_set_state+0x167/0x189 [ixgbe] Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-18 09:41:40 -08:00
John Fastabend	9ea19481db	net: zero kobject in rx_queue_release netif_set_real_num_rx_queues() can decrement and increment the number of rx queues. For example ixgbe does this as features and offloads are toggled. Presumably this could also happen across down/up on most devices if the available resources changed (cpu offlined). The kobject needs to be zero'd in this case so that the state is not preserved across kobject_put()/kobject_init_and_add(). This resolves the following error report. ixgbe 0000:03:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX kobject (ffff880324b83210): tried to init an initialized object, something is seriously wrong. Pid: 1972, comm: lldpad Not tainted 2.6.37-rc18021qaz+ #169 Call Trace: [<ffffffff8121c940>] kobject_init+0x3a/0x83 [<ffffffff8121cf77>] kobject_init_and_add+0x23/0x57 [<ffffffff8107b800>] ? mark_lock+0x21/0x267 [<ffffffff813c6d11>] net_rx_queue_update_kobjects+0x63/0xc6 [<ffffffff813b5e0e>] netif_set_real_num_rx_queues+0x5f/0x78 [<ffffffffa0261d49>] ixgbe_set_num_queues+0x1c6/0x1ca [ixgbe] [<ffffffffa0262509>] ixgbe_init_interrupt_scheme+0x1e/0x79c [ixgbe] [<ffffffffa0274596>] ixgbe_dcbnl_set_state+0x167/0x189 [ixgbe] Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-17 12:27:46 -08:00
Thomas Graf	f8ff182c71	rtnetlink: Link address family API Each net_device contains address family specific data such as per device settings and statistics. We already expose this data via procfs/sysfs and partially netlink. The netlink method requires the requester to send one RTM_GETLINK request for each address family it wishes to receive data of and then merge this data itself. This patch implements a new API which combines all address family specific link data in a new netlink attribute IFLA_AF_SPEC. IFLA_AF_SPEC contains a sequence of nested attributes, one for each address family which in turn defines the structure of its own attribute. Example: [IFLA_AF_SPEC] = { [AF_INET] = { [IFLA_INET_CONF] = ..., }, [AF_INET6] = { [IFLA_INET6_FLAGS] = ..., [IFLA_INET6_CONF] = ..., } } The API also allows for address families to implement a function which parses the IFLA_AF_SPEC attribute sent by userspace to implement address family specific link options. Signed-off-by: Thomas Graf <tgraf@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-17 11:28:24 -08:00
David S. Miller	6b35308850	net: Export netif_get_vlan_features(). ERROR: "netif_get_vlan_features" [drivers/net/xen-netfront.ko] undefined! Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-15 20:15:03 -08:00
Tom Herbert	fe8222406c	net: Simplify RX queue allocation This patch move RX queue allocation to alloc_netdev_mq and freeing of the queues to free_netdev (symmetric to TX queue allocation). Each kobject RX queue takes a reference to the queue's device so that the device can't be freed before all the kobjects have been released-- this obviates the need for reference counts specific to RX queues. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-15 10:57:28 -08:00
Tom Herbert	ed9af2e839	net: Move TX queue allocation to alloc_netdev_mq TX queues are now allocated in alloc_netdev_mq and freed in free_netdev. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-15 10:56:54 -08:00
Jesse Gross	58e998c6d2	offloading: Force software GSO for multiple vlan tags. We currently use vlan_features to check for TSO support if there is a vlan tag. However, it's quite likely that the NIC is not able to do TSO when there is an arbitrary number of tags. Therefore if there is more than one tag (in-band or out-of-band), fall back to software emulation. Signed-off-by: Jesse Gross <jesse@nicira.com> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-15 09:22:53 -08:00
Jesse Gross	c8d5bcd1af	offloading: Support multiple vlan tags in GSO. We assume that hardware TSO can't support multiple levels of vlan tags but we allow it to be done. Therefore, enable GSO to parse these tags so we can fallback to software. Signed-off-by: Jesse Gross <jesse@nicira.com> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-15 09:22:53 -08:00
Jesse Gross	e1e78db628	offloading: Make scatter/gather more tolerant of vlans. When checking if it is necessary to linearize a packet, we currently use vlan_features if the packet contains either an in-band or out- of-band vlan tag. However, in-band tags aren't special in any way for scatter/gather since they are part of the packet buffer and are simply more data to DMA. Therefore, only use vlan_features for out- of-band tags, which could potentially have some interaction with scatter/gather. Signed-off-by: Jesse Gross <jesse@nicira.com> CC: Ben Hutchings <bhutchings@solarflare.com> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-15 09:22:52 -08:00
David S. Miller	c25ecd0a21	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2010-11-14 11:57:05 -08:00
Thomas Graf	369cf77a6a	rtnetlink: Fix message size calculation for link messages nlmsg_total_size() calculates the length of a netlink message including header and alignment. nla_total_size() calculates the space an individual attribute consumes which was meant to be used in this context. Also, ensure to account for the attribute header for the IFLA_INFO_XSTATS attribute as implementations of get_xstats_size() seem to assume that we do so. The addition of two message headers minus the missing attribute header resulted in a calculated message size that was larger than required. Therefore we never risked running out of skb tailroom. Signed-off-by: Thomas Graf <tgraf@infradead.org> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-12 10:53:09 -08:00
Eric Dumazet	8d987e5c75	net: avoid limits overflow Robin Holt tried to boot a 16TB machine and found some limits were reached : sysctl_tcp_mem[2], sysctl_udp_mem[2] We can switch infrastructure to use long "instead" of "int", now atomic_long_t primitives are available for free. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Robin Holt <holt@sgi.com> Reviewed-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-10 12:12:00 -08:00
David S. Miller	57fe93b374	filter: make sure filters dont read uninitialized memory There is a possibility malicious users can get limited information about uninitialized stack mem array. Even if sk_run_filter() result is bound to packet length (0 .. 65535), we could imagine this can be used by hostile user. Initializing mem[] array, like Dan Rosenberg suggested in his patch is expensive since most filters dont even use this array. Its hard to make the filter validation in sk_chk_filter(), because of the jumps. This might be done later. In this patch, I use a bitmap (a single long var) so that only filters using mem[] loads/stores pay the price of added security checks. For other filters, additional cost is a single instruction. [ Since we access fentry->k a lot now, cache it in a local variable and mark filter entry pointer as const. -DaveM ] Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-10 10:38:24 -08:00
Eric Dumazet	332dd96f7a	net/dst: dst_dev_event() called after other notifiers Followup of commit `ef885afbf8` (net: use rcu_barrier() in rollback_registered_many) dst_dev_event() scans a garbage dst list that might be feeded by various network notifiers at device dismantle time. Its important to call dst_dev_event() after other notifiers, or we might enter the infamous msleep(250) in netdev_wait_allrefs(), and wait one second before calling again call_netdevice_notifiers(NETDEV_UNREGISTER, dev) to properly remove last device references. Use priority -10 to let dst_dev_notifier be called after other network notifiers (they have the default 0 priority) Reported-by: Ben Greear <greearb@candelatech.com> Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: Octavian Purdila <opurdila@ixiacom.com> Reported-by: Benjamin LaHaise <bcrl@kvack.org> Tested-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-09 12:17:16 -08:00
Joe Perches	b194a3674f	net/core/dev.c: Update WARN uses Coalesce long formats. Add missing newlines. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-09 09:22:31 -08:00
Junchang Wang	eb589063ed	pktgen: correct uninitialized queue_map This fix a bug reported by backyes. Right the first time pktgen's using queue_map that's not been initialized by set_cur_queue_map(pkt_dev); Signed-off-by: Junchang Wang <junchangwang@gmail.com> Signed-off-by: Backyes <backyes@mail.ustc.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-08 12:17:07 -08:00
Dmitry Torokhov	86c2c0a8a4	NET: pktgen - fix compile warning This should fix the following warning: net/core/pktgen.c: In function ‘pktgen_if_write’: net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Reviewed-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-07 05:28:01 -08:00
Tom Herbert	df32cc193a	net: check queue_index from sock is valid for device In dev_pick_tx recompute the queue index if the value stored in the socket is greater than or equal to the number of real queues for the device. The saved index in the sock structure is not guaranteed to be appropriate for the egress device (this could happen on a route change or in presence of tunnelling). The result of the queue index being bad would be to return a bogus queue (crash could prersumably follow). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-01 12:55:52 -07:00
Uwe Kleine-König	b595076a18	tree-wide: fix comment/printk typos "gadget", "through", "command", "maintain", "maintain", "controller", "address", "between", "initiali[zs]e", "instead", "function", "select", "already", "equal", "access", "management", "hierarchy", "registration", "interest", "relative", "memory", "offset", "already", Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-11-01 15:38:34 -04:00
Nelson Elhage	448d7b5daf	pktgen: Limit how much data we copy onto the stack. A program that accidentally writes too much data to the pktgen file can overflow the kernel stack and oops the machine. This is only triggerable by root, so there's no security issue, but it's still an unfortunate bug. printk() won't print more than 1024 bytes in a single call, anyways, so let's just never copy more than that much data. We're on a fairly shallow stack, so that should be safe even with CONFIG_4KSTACKS. Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-28 11:47:53 -07:00
David S. Miller	8acfe468b0	net: Limit socket I/O iovec total length to INT_MAX. This helps protect us from overflow issues down in the individual protocol sendmsg/recvmsg handlers. Once we hit INT_MAX we truncate out the rest of the iovec by setting the iov_len members to zero. This works because: 1) For SOCK_STREAM and SOCK_SEQPACKET sockets, partial writes are allowed and the application will just continue with another write to send the rest of the data. 2) For datagram oriented sockets, where there must be a one-to-one correspondance between write() calls and packets on the wire, INT_MAX is going to be far larger than the packet size limit the protocol is going to check for and signal with -EMSGSIZE. Based upon a patch by Linus Torvalds. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-28 11:47:52 -07:00
Eric Dumazet	7a2b03c517	fib_rules: __rcu annotates ctarget Adds __rcu annotation to (struct fib_rule)->ctarget Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-27 11:37:32 -07:00
Ben Hutchings	66c68bcc48	net: NETIF_F_HW_CSUM does not imply FCoE CRC offload NETIF_F_HW_CSUM indicates the ability to update an TCP/IP-style 16-bit checksum with the checksum of an arbitrary part of the packet data, whereas the FCoE CRC is something entirely different. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Cc: stable@kernel.org [2.6.32+] Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-27 11:37:29 -07:00
Ben Hutchings	af1905dbec	net: Fix some corner cases in dev_can_checksum() dev_can_checksum() incorrectly returns true in these cases: 1. The skb has both out-of-band and in-band VLAN tags and the device supports checksum offload for the encapsulated protocol but only with one layer of encapsulation. 2. The skb has a VLAN tag and the device supports generic checksumming but not in conjunction with VLAN encapsulation. Rearrange the VLAN tag checks to avoid these. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-27 11:37:29 -07:00
Eric Dumazet	ebb9fed2de	fib: fix fib_nl_newrule() Some panic reports in fib_rules_lookup() show a rule could have a NULL pointer as a next pointer in the rules_list. This can actually happen because of a bug in fib_nl_newrule() : It checks if current rule is the destination of unresolved gotos. (Other rules have gotos to this about to be inserted rule) Problem is it does the resolution of the gotos before the rule is inserted in the rules_list (and has a valid next pointer) Fix this by moving the rules_list insertion before the changes on gotos. A lockless reader can not any more follow a ctarget pointer, unless destination is ready (has a valid next pointer) Reported-by: Oleg A. Arkhangelsky <sysoleg@yandex.ru> Reported-by: Joe Buehler <aspam@cox.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-26 11:42:38 -07:00
Eric Dumazet	0d7da9ddd9	net: add __rcu annotation to sk_filter Add __rcu annotation to : (struct sock)->sk_filter And use appropriate rcu primitives to reduce sparse warnings if CONFIG_SPARSE_RCU_POINTER=y Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-25 14:18:28 -07:00
Eric Dumazet	1c87733d06	net_ns: add __rcu annotations add __rcu annotation to (struct net)->gen, and use rcu_dereference_protected() in net_assign_generic() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-25 14:18:27 -07:00
Eric Dumazet	6e3f7faf3e	rps: add __rcu annotations Add __rcu annotations to : (struct netdev_rx_queue)->rps_map (struct netdev_rx_queue)->rps_flow_table struct rps_sock_flow_table *rps_sock_flow_table; And use appropriate rcu primitives. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-25 14:18:27 -07:00
Eric Dumazet	198caeca3e	ipv6: ip6_ptr rcu annotations (struct net_device)->ip6_ptr is rcu protected : add __rcu annotation and proper rcu primitives. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-25 13:09:43 -07:00
David S. Miller	11a766ce91	net: Increase xmit RECURSION_LIMIT to 10. Three is definitely too low, and we know from reports that GRE tunnels stacked as deeply as 37 levels cause stack overflows, so pick some reasonable value between those two. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-25 12:51:55 -07:00
Paul Gortmaker	d618222352	pktgen: clean up handling of local/transient counter vars The temporary variable "i" is needlessly initialized to zero in two distinct cases in this file: 1) where it is set to zero and then used as an argument in an addition before being assigned a non-zero value. 2) where it is only used in a standard/typical loop counter For (1), simply delete assignment to zero and usages while still zero; for (2) simply make the loop start at zero as per standard practice as seen everywhere else in the same file. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-24 15:23:35 -07:00
Linus Torvalds	5f05647dd8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits) bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL. vlan: Calling vlan_hwaccel_do_receive() is always valid. tproxy: use the interface primary IP address as a default value for --on-ip tproxy: added IPv6 support to the socket match cxgb3: function namespace cleanup tproxy: added IPv6 support to the TPROXY target tproxy: added IPv6 socket lookup function to nf_tproxy_core be2net: Changes to use only priority codes allowed by f/w tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled tproxy: added tproxy sockopt interface in the IPV6 layer tproxy: added udp6_lib_lookup function tproxy: added const specifiers to udp lookup functions tproxy: split off ipv6 defragmentation to a separate module l2tp: small cleanup nf_nat: restrict ICMP translation for embedded header can: mcp251x: fix generation of error frames can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set can-raw: add msg_flags to distinguish local traffic 9p: client code cleanup rds: make local functions/variables static ... Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and drivers/net/wireless/ath/ath9k/debug.c as per David	2010-10-23 11:47:02 -07:00
Linus Torvalds	5d70f79b5e	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (163 commits) tracing: Fix compile issue for trace_sched_wakeup.c [S390] hardirq: remove pointless header file includes [IA64] Move local_softirq_pending() definition perf, powerpc: Fix power_pmu_event_init to not use event->ctx ftrace: Remove recursion between recordmcount and scripts/mod/empty jump_label: Add COND_STMT(), reducer wrappery perf: Optimize sw events perf: Use jump_labels to optimize the scheduler hooks jump_label: Add atomic_t interface jump_label: Use more consistent naming perf, hw_breakpoint: Fix crash in hw_breakpoint creation perf: Find task before event alloc perf: Fix task refcount bugs perf: Fix group moving irq_work: Add generic hardirq context callbacks perf_events: Fix transaction recovery in group_sched_in() perf_events: Fix bogus AMD64 generic TLB events perf_events: Fix bogus context time tracking tracing: Remove parent recording in latency tracer graph options tracing: Use one prologue for the preempt irqs off tracer function tracers ...	2010-10-21 12:54:49 -07:00
Linus Torvalds	888a6f77e0	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (52 commits) sched: fix RCU lockdep splat from task_group() rcu: using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value sched: suppress RCU lockdep splat in task_fork_fair net: suppress RCU lockdep false positive in sock_update_classid rcu: move check from rcu_dereference_bh to rcu_read_lock_bh_held rcu: Add advice to PROVE_RCU_REPEATEDLY kernel config parameter rcu: Add tracing data to support queueing models rcu: fix sparse errors in rcutorture.c rcu: only one evaluation of arg in rcu_dereference_check() unless sparse kernel: Remove undead ifdef CONFIG_DEBUG_LOCK_ALLOC rcu: fix _oddness handling of verbose stall warnings rcu: performance fixes to TINY_PREEMPT_RCU callback checking rcu: upgrade stallwarn.txt documentation for CPU-bound RT processes vhost: add __rcu annotations rcu: add comment stating that list_empty() applies to RCU-protected lists rcu: apply TINY_PREEMPT_RCU read-side speedup to TREE_PREEMPT_RCU rcu: combine duplicate code, courtesy of CONFIG_PREEMPT_RCU rcu: Upgrade srcu_read_lock() docbook about SRCU grace periods rcu: document ways of stalling updates in low-memory situations rcu: repair code-duplication FIXMEs ...	2010-10-21 12:54:12 -07:00
David S. Miller	2198a10b50	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/core/dev.c	2010-10-21 08:43:05 -07:00
stephen hemminger	d0c2b0d265	napi: unexport napi_reuse_skb The function napi_reuse_skb is only used inside core. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 04:26:38 -07:00
Tejun Heo	a5c30b349b	net/neighbour: cancel_delayed_work() + flush_scheduled_work() -> cancel_delayed_work_sync() flush_scheduled_work() is going away. Prepare for it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 04:25:48 -07:00
Ben Greear	d2ed817766	net/core: Allow tagged VLAN packets to flow through VETH devices. When there are VLANs on a VETH device, the packets being transmitted through the VETH device may be 4 bytes bigger than MTU. A check in dev_forward_skb did not take this into account and so dropped these packets. This patch is needed at least as far back as 2.6.34.7 and should be considered for -stable. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 04:06:29 -07:00
stephen hemminger	8d8a0b1cc2	rtnetlink: remove rtnl_kill_links The function rtnl_kill_links is defined but never used. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 03:09:45 -07:00
Jesse Gross	d5dbda2380	ethtool: Add support for vlan accleration. Now that vlan acceleration is handled consistently regardless of usage, it is possible to enable and disable it at will. This adds support for Ethtool operations that change the offloading status for debugging purposes, similar to other forms of hardware acceleration. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 01:26:54 -07:00
Jesse Gross	3701e51382	vlan: Centralize handling of hardware acceleration. Currently each driver that is capable of vlan hardware acceleration must be aware of the vlan groups that are configured and then pass the stripped tag to a specialized receive function. This is different from other types of hardware offload in that it places a significant amount of knowledge in the driver itself rather keeping it in the networking core. This makes vlan offloading function more similarly to other forms of offloading (such as checksum offloading or TSO) by doing the following: * On receive, stripped vlans are passed directly to the network core, without attempting to check for vlan groups or reconstructing the header if no group * vlans are made less special by folding the logic into the main receive routines * On transmit, the device layer will add the vlan header in software if the hardware doesn't support it, instead of spreading that logic out in upper layers, such as bonding. There are a number of advantages to this: * Fixes all bugs with drivers incorrectly dropping vlan headers at once. * Avoids having to disable VLAN acceleration when in promiscuous mode (good for bridging since it always puts devices in promiscuous mode). * Keeps VLAN tag separate until given to ultimate consumer, which avoids needing to do header reconstruction as in tg3 unless absolutely necessary. * Consolidates common code in core networking. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 01:26:53 -07:00
Jesse Gross	7b9c609037	vlan: Enable software emulation for vlan accleration. Currently users of hardware vlan accleration need to know whether the device supports it before generating packets. However, vlan acceleration will soon be available in a more flexible manner so knowing ahead of time becomes much more difficult. This adds a software fallback path for vlan packets on devices without the necessary offloading support, similar to other types of hardware accleration. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 01:26:52 -07:00
Eric Dumazet	27b75c95f1	net: avoid RCU for NOCACHE dst There is no point using RCU for dst we allocate for a very short time (used once). Change dst_release() to take DST_NOCACHE into account, but also change skb_dst_set_noref() to force a refcount increment for such dst. This is a _huge_ gain, because we dont waste memory to store xx thousand of dsts. Instead of queueing them to RCU, we can free them instantly. CPU caches can stay hot, re-using same memory blocks to hold temporary dsts. Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(), since atomic_dec_return() implies a full memory barrier. Stress test, 160.000.000 udp frames sent, IP route cache disabled (DDOS). Before: real 0m38.091s user 0m13.189s sys 7m53.018s After: real 0m29.946s user 0m12.157s sys 7m40.605s For reference, if IP route cache was enabled : real 0m32.030s user 0m10.521s sys 8m15.243s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-20 03:02:23 -07:00
Tom Herbert	e6484930d7	net: allocate tx queues in register_netdevice This patch introduces netif_alloc_netdev_queues which is called from register_device instead of alloc_netdev_mq. This makes TX queue allocation symmetric with RX allocation. Also, queue locks allocation is done in netdev_init_one_queue. Change set_real_num_tx_queues to fail if requested number < 1 or greater than number of allocated queues. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-20 02:27:59 -07:00
Tom Herbert	bd25fa7ba5	net: cleanups in RX queue allocation Clean up in RX queue allocation. In netif_set_real_num_rx_queues return error on attempt to set zero queues, or requested number is greater than number of allocated queues. In netif_alloc_rx_queues, do BUG_ON if queue_count is zero. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-20 02:27:59 -07:00
Tom Herbert	55513fb428	net: fail alloc_netdev_mq if queue count < 1 In alloc_netdev_mq fail if requested queue_count < 1. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-20 02:27:58 -07:00
Neil Horman	f13d493d9c	netpoll: Revert napi_poll fix for bonding driver In an erlier patch I modified napi_poll so that devices with IFF_MASTER polled the per_cpu list instead of the device list for napi. I did this because the bonding driver has no napi instances to poll, it instead expects to check the slave devices napi instances, which napi_poll was unaware of. Looking at this more closely however, I now see this isn't strictly needed. As the bond driver poll_controller calls the slaves poll_controller via netpoll_poll_dev, which recursively calls poll_napi on each slave, allowing those napi instances to get serviced. The earlier patch isn't at all harmfull, its just not needed, so lets revert it to make the code cleaner. Sorry for the noise, Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: WANG Cong <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-20 01:44:30 -07:00
Neil Horman	990c3d6f9c	bonding: Fix napi poll for bonding driver Usually the netpoll path, when preforming a napi poll can get away with just polling all the napi instances of the configured device. Thats not the case for the bonding driver however, as the napi instances which may wind up getting flagged as needing polling after the poll_controller call don't belong to the bonded device, but rather to the slave devices. Fix this by checking the device in question for the IFF_MASTER flag, if set, we know we need to check the full poll list for this cpu, rather than just the devices napi instance list. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-18 08:32:08 -07:00
Neil Horman	c2355e1ab9	bonding: Fix bonding drivers improper modification of netpoll structure The bonding driver currently modifies the netpoll structure in its xmit path while sending frames from netpoll. This is racy, as other cpus can access the netpoll structure in parallel. Since the bonding driver points np->dev to a slave device, other cpus can inadvertently attempt to send data directly to slave devices, leading to improper locking with the bonding master, lost frames, and deadlocks. This patch fixes that up. This patch also removes the real_dev pointer from the netpoll structure as that data is really only used by bonding in the poll_controller, and we can emulate its behavior by check each slave for IS_UP. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-18 08:32:07 -07:00
Eric Dumazet	a0a4a85a15	fib: remove a useless synchronize_rcu() call fib_nl_delrule() calls synchronize_rcu() for no apparent reason, while rtnl is held. I suspect it was done to avoid an atomic_inc_not_zero() in fib_rules_lookup(), which commit `7fa7cb7109` added anyway. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-16 11:13:22 -07:00
Eric Dumazet	564824b0c5	net: allocate skbs on local node commit `b30973f877` (node-aware skb allocation) spread a wrong habit of allocating net drivers skbs on a given memory node : The one closest to the NIC hardware. This is wrong because as soon as we try to scale network stack, we need to use many cpus to handle traffic and hit slub/slab management on cross-node allocations/frees when these cpus have to alloc/free skbs bound to a central node. skb allocated in RX path are ephemeral, they have a very short lifetime : Extra cost to maintain NUMA affinity is too expensive. What appeared as a nice idea four years ago is in fact a bad one. In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load, and two 10Gb NIC might deliver more than 28 million packets per second, needing all the available cpus. Cost of cross-node handling in network and vm stacks outperforms the small benefit hardware had when doing its DMA transfert in its 'local' memory node at RX time. Even trying to differentiate the two allocations done for one skb (the sk_buff on local node, the data part on NIC hardware node) is not enough to bring good performance. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-16 11:13:19 -07:00
Eric Dumazet	29b4433d99	net: percpu net_device refcount We tried very hard to remove all possible dev_hold()/dev_put() pairs in network stack, using RCU conversions. There is still an unavoidable device refcount change for every dst we create/destroy, and this can slow down some workloads (routers or some app servers, mmap af_packet) We can switch to a percpu refcount implementation, now dynamic per_cpu infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes per device. On x86, dev_hold(dev) code : before lock incl 0x280(%ebx) after: movl 0x260(%ebx),%eax incl fs:(%eax) Stress bench : (Sending 160.000.000 UDP frames, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_TRIE) Before: real 1m1.662s user 0m14.373s sys 12m55.960s After: real 0m51.179s user 0m15.329s sys 10m15.942s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-12 12:35:25 -07:00
Eric Dumazet	fc66f95c68	net dst: use a percpu_counter to track entries struct dst_ops tracks number of allocated dst in an atomic_t field, subject to high cache line contention in stress workload. Switch to a percpu_counter, to reduce number of time we need to dirty a central location. Place it on a separate cache line to avoid dirtying read only fields. Stress test : (Sending 160.000.000 UDP frames, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_TRIE, SLUB/NUMA) Before: real 0m51.179s user 0m15.329s sys 10m15.942s After: real 0m45.570s user 0m15.525s sys 9m56.669s With a small reordering of struct neighbour fields, subject of a following patch, (to separate refcnt from other read mostly fields) real 0m41.841s user 0m15.261s sys 8m45.949s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-11 13:06:53 -07:00
Eric Dumazet	0ed8ddf404	neigh: Protect neigh->ha[] with a seqlock Add a seqlock in struct neighbour to protect neigh->ha[], and avoid dirtying neighbour in stress situation (many different flows / dsts) Dirtying takes place because of read_lock(&n->lock) and n->used writes. Switching to a seqlock, and writing n->used only on jiffies changes permits less dirtying. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-11 12:54:04 -07:00
David S. Miller	d122179a3c	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/core/ethtool.c	2010-10-11 12:30:34 -07:00
Kees Cook	b00916b189	net: clear heap allocations for privileged ethtool actions Several other ethtool functions leave heap uncleared (potentially) by drivers. Some interfaces appear safe (eeprom, etc), in that the sizes are well controlled. In some situations (e.g. unchecked error conditions), the heap will remain unchanged in areas before copying back to userspace. Note that these are less of an issue since these all require CAP_NET_ADMIN. Cc: stable@kernel.org Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-11 12:23:25 -07:00
Eric Dumazet	34d101dd62	neigh: speedup neigh_hh_init() When a new dst is used to send a frame, neigh_resolve_output() tries to associate an struct hh_cache to this dst, calling neigh_hh_init() with the neigh rwlock write locked. Most of the time, hh_cache is already known and linked into neighbour, so we find it and increment its refcount. This patch changes the logic so that we call neigh_hh_init() with neighbour lock read locked only, so that fast path can be run in parallel by concurrent cpus. This brings part of the speedup we got with commit `c7d4426a98` (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones, removing one of the contention point that routers hit on multiqueue enabled machines. Further improvements would need to use a seqlock instead of an rwlock to protect neigh->ha[], to not dirty neigh too often and remove two atomic ops. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-11 09:16:57 -07:00
Tom Herbert	4315d834c1	net: Fix rxq ref counting The rx->count reference is used to track reference counts to the number of rx-queue kobjects created for the device. This patch eliminates initialization of the counter in netif_alloc_rx_queues and instead increments the counter each time a kobject is created. This is now symmetric with the decrement that is done when an object is released. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-08 14:34:32 -07:00
Kees Cook	ae6df5f96a	net: clear heap allocation for ETHTOOL_GRXCLSRLALL Calling ETHTOOL_GRXCLSRLALL with a large rule_cnt will allocate kernel heap without clearing it. For the one driver (niu) that implements it, it will leave the unused portion of heap unchanged and copy the full contents back to userspace. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-08 10:48:28 -07:00
Ben Hutchings	4e7f79511e	net: Update kernel-doc for netif_set_real_num_rx_queues() Synchronise the comment with the preceding implementation change. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-08 10:33:39 -07:00
Ingo Molnar	7cd2541cf2	Merge commit 'v2.6.36-rc7' into perf/core Conflicts: arch/x86/kernel/module.c Merge reason: Resolve the conflict, pick up fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-08 10:46:27 +02:00
Paul E. McKenney	1144182a87	net: suppress RCU lockdep false positive in sock_update_classid > =================================================== > [ INFO: suspicious rcu_dereference_check() usage. ] > --------------------------------------------------- > include/linux/cgroup.h:542 invoked rcu_dereference_check() without protection! > > other info that might help us debug this: > > > rcu_scheduler_active = 1, debug_locks = 0 > 1 lock held by swapper/1: > #0: (net_mutex){+.+.+.}, at: [<ffffffff813e9010>] > register_pernet_subsys+0x1f/0x47 > > stack backtrace: > Pid: 1, comm: swapper Not tainted 2.6.35.4-28.fc14.x86_64 #1 > Call Trace: > [<ffffffff8107bd3a>] lockdep_rcu_dereference+0xaa/0xb3 > [<ffffffff813e04b9>] sock_update_classid+0x7c/0xa2 > [<ffffffff813e054a>] sk_alloc+0x6b/0x77 > [<ffffffff8140b281>] __netlink_create+0x37/0xab > [<ffffffff813f941c>] ? rtnetlink_rcv+0x0/0x2d > [<ffffffff8140cee1>] netlink_kernel_create+0x74/0x19d > [<ffffffff8149c3ca>] ? __mutex_lock_common+0x339/0x35b > [<ffffffff813f7e9c>] rtnetlink_net_init+0x2e/0x48 > [<ffffffff813e8d7a>] ops_init+0xe9/0xff > [<ffffffff813e8f0d>] register_pernet_operations+0xab/0x130 > [<ffffffff813e901f>] register_pernet_subsys+0x2e/0x47 > [<ffffffff81db7bca>] rtnetlink_init+0x53/0x102 > [<ffffffff81db835c>] netlink_proto_init+0x126/0x143 > [<ffffffff81db8236>] ? netlink_proto_init+0x0/0x143 > [<ffffffff810021b8>] do_one_initcall+0x72/0x186 > [<ffffffff81d78ebc>] kernel_init+0x23b/0x2c9 > [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10 > [<ffffffff8149e2d0>] ? restore_args+0x0/0x30 > [<ffffffff81d78c81>] ? kernel_init+0x0/0x2c9 > [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 The sock_update_classid() function calls task_cls_classid(current), but the calling task cannot go away, so there is no danger of the associated structures disappearing. Insert an RCU read-side critical section to suppress the false positive. Reported-by: Subrata Modak <subrata@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-10-07 10:02:28 -07:00
John Fastabend	3d3211ef5c	net: netif_set_real_num_rx_queues may cap num_rx_queues at init time Do not set num_rx_queues in netif_set_real_num_rx_queues() some drivers will increase the real_num_rx_queues later due to a feature changes or available interrupts increasing. By setting num_rx_queues here this ends up creating a cap on the number of rx queues available. For example the ixgbe driver sets the max number of queues it intends to use ever then sets the current number in use with the netif_set_num_{rx\|tx}_queues calls. With the current implementation the number of rx queues gets limited so when a feature such as DCB or FCoE is enabled the queues are no longer available. kobjects will only be allocated for real_num_rx_queues so the waste in memory is minimal. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-06 23:35:15 -07:00
Eric Dumazet	767e97e1e0	neigh: RCU conversion of struct neighbour This is the second step for neighbour RCU conversion. (first was commit `d6bf7817` : RCU conversion of neigh hash table) neigh_lookup() becomes lockless, but still take a reference on found neighbour. (no more read_lock()/read_unlock() on tbl->lock) struct neighbour gets an additional rcu_head field and is freed after an RCU grace period. Future work would need to eventually not take a reference on neighbour for temporary dst (DST_NOCACHE), but this would need dst->_neighbour to use a noref bit like we did for skb->_dst. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-06 18:01:33 -07:00
Eric Dumazet	ebc0ffae5d	fib: RCU conversion of fib_lookup() fib_lookup() converted to be called in RCU protected context, no reference taken and released on a contended cache line (fib_clntref) fib_table_lookup() and fib_semantic_match() get an additional parameter. struct fib_info gets an rcu_head field, and is freed after an rcu grace period. Stress test : (Sending 160.000.000 UDP frames on same neighbour, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_HASH) (about same results for FIB_TRIE) Before patch : real 1m31.199s user 0m13.761s sys 23m24.780s After patch: real 1m5.375s user 0m14.997s sys 15m50.115s Before patch Profile : 13044.00 15.4% __ip_route_output_key vmlinux 8438.00 10.0% dst_destroy vmlinux 5983.00 7.1% fib_semantic_match vmlinux 5410.00 6.4% fib_rules_lookup vmlinux 4803.00 5.7% neigh_lookup vmlinux 4420.00 5.2% _raw_spin_lock vmlinux 3883.00 4.6% rt_set_nexthop vmlinux 3261.00 3.9% _raw_read_lock vmlinux 2794.00 3.3% fib_table_lookup vmlinux 2374.00 2.8% neigh_resolve_output vmlinux 2153.00 2.5% dst_alloc vmlinux 1502.00 1.8% _raw_read_lock_bh vmlinux 1484.00 1.8% kmem_cache_alloc vmlinux 1407.00 1.7% eth_header vmlinux 1406.00 1.7% ipv4_dst_destroy vmlinux 1298.00 1.5% __copy_from_user_ll vmlinux 1174.00 1.4% dev_queue_xmit vmlinux 1000.00 1.2% ip_output vmlinux After patch Profile : 13712.00 15.8% dst_destroy vmlinux 8548.00 9.9% __ip_route_output_key vmlinux 7017.00 8.1% neigh_lookup vmlinux 4554.00 5.3% fib_semantic_match vmlinux 4067.00 4.7% _raw_read_lock vmlinux 3491.00 4.0% dst_alloc vmlinux 3186.00 3.7% neigh_resolve_output vmlinux 3103.00 3.6% fib_table_lookup vmlinux 2098.00 2.4% _raw_read_lock_bh vmlinux 2081.00 2.4% kmem_cache_alloc vmlinux 2013.00 2.3% _raw_spin_lock vmlinux 1763.00 2.0% __copy_from_user_ll vmlinux 1763.00 2.0% ip_output vmlinux 1761.00 2.0% ipv4_dst_destroy vmlinux 1631.00 1.9% eth_header vmlinux 1440.00 1.7% _raw_read_unlock_bh vmlinux Reference results, if IP route cache is enabled : real 0m29.718s user 0m10.845s sys 7m37.341s 25213.00 29.5% __ip_route_output_key vmlinux 9011.00 10.5% dst_release vmlinux 4817.00 5.6% ip_push_pending_frames vmlinux 4232.00 5.0% ip_finish_output vmlinux 3940.00 4.6% udp_sendmsg vmlinux 3730.00 4.4% __copy_from_user_ll vmlinux 3716.00 4.4% ip_route_output_flow vmlinux 2451.00 2.9% __xfrm_lookup vmlinux 2221.00 2.6% ip_append_data vmlinux 1718.00 2.0% _raw_spin_lock_bh vmlinux 1655.00 1.9% __alloc_skb vmlinux 1572.00 1.8% sock_wfree vmlinux 1345.00 1.6% kfree vmlinux Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 20:39:38 -07:00
Eric Dumazet	d6bf781712	net neigh: RCU conversion of neigh hash table David This is the first step for RCU conversion of neigh code. Next patches will convert hash_buckets[] and "struct neighbour" to RCU protected objects. Thanks [PATCH net-next] net neigh: RCU conversion of neigh hash table Instead of storing hash_buckets, hash_mask and hash_rnd in "struct neigh_table", a new structure is defined : struct neigh_hash_table { struct neighbour *hash_buckets; unsigned int hash_mask; __u32 hash_rnd; struct rcu_head rcu; }; And "struct neigh_table" has an RCU protected pointer to such a neigh_hash_table. This means the signature of (hash)() function changed: We need to add a third parameter with the actual hash_rnd value, since this is not anymore a neigh_table field. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 14:54:36 -07:00
Eric Dumazet	110b249937	net neigh: neigh_delete() and neigh_add() changes neigh_delete() and neigh_add() dont need to touch device refcount, we hold RTNL when calling them, so device cannot disappear under us. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 14:54:35 -07:00
Eric Dumazet	caf586e5f2	net: add a core netdev->rx_dropped counter In various situations, a device provides a packet to our stack and we drop it before it enters protocol stack : - softnet backlog full (accounted in /proc/net/softnet_stat) - bad vlan tag (not accounted) - unknown/unregistered protocol (not accounted) We can handle a per-device counter of such dropped frames at core level, and automatically adds it to the device provided stats (rx_dropped), so that standard tools can be used (ifconfig, ip link, cat /proc/net/dev) This is a generalization of commit `8990f468a` (net: rx_dropped accounting), thus reverting it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 14:47:55 -07:00
stephen hemminger	1df9916e46	fib: fib_rules_cleanup can be static fib_rules_cleanup_ups is only defined and used in one place. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 00:47:39 -07:00
Eric Dumazet	24824a09e3	net: dynamic ingress_queue allocation ingress being not used very much, and net_device->ingress_queue being quite a big object (128 or 256 bytes), use a dynamic allocation if needed (tc qdisc add dev eth0 ingress ...) dev_ingress_queue(dev) helper should be used only with RTNL taken. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 00:23:44 -07:00
David S. Miller	21a180cda0	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/ipv4/Kconfig net/ipv4/tcp_timer.c	2010-10-04 11:56:38 -07:00
Eric Dumazet	c7d4426a98	net: introduce DST_NOCACHE flag While doing stress tests with IP route cache disabled, and multi queue devices, I noticed a very high contention on one rwlock used in neighbour code. When many cpus are trying to send frames (possibly using a high performance multiqueue device) to the same neighbour, they fight for the neigh->lock rwlock in order to call neigh_hh_init(), and fight on hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test()) But we dont need to call neigh_hh_init() for dst that are used only once. It costs four atomic operations at least, on two contended cache lines, plus the high contention on neigh->lock rwlock. Introduce a new dst flag, DST_NOCACHE, that is set when dst was not inserted in route cache. With the stress test bench, sending 160000000 frames on one neighbour, results are : Before patch: real 2m28.406s user 0m11.781s sys 36m17.964s After patch: real 1m26.532s user 0m12.185s sys 20m3.903s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-03 22:17:54 -07:00
Nagendra Tomar	482964e56e	net: Fix the condition passed to sk_wait_event() This patch fixes the condition (3rd arg) passed to sk_wait_event() in sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory() causes the following soft lockup in tcp_sendmsg() when the global tcp memory pool has exhausted. >>> snip <<< localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429] localhost kernel: CPU 3: localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200] [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200 localhost kernel: localhost kernel: Call Trace: localhost kernel: [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200 localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40 localhost kernel: [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0 localhost kernel: [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140 localhost kernel: [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130 localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40 localhost kernel: [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170 localhost kernel: [vfs_write+0x185/0x190] vfs_write+0x185/0x190 localhost kernel: [sys_write+0x50/0x90] sys_write+0x50/0x90 localhost kernel: [system_call+0x7e/0x83] system_call+0x7e/0x83 >>> snip <<< What is happening is, that the sk_wait_event() condition passed from sk_stream_wait_memory() evaluates to true for the case of tcp global memory exhaustion. This is because both sk_stream_memory_free() and vm_wait are true which causes sk_wait_event() to not call schedule_timeout(). Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping. This causes the caller to again try allocation, which again fails and again calls sk_stream_wait_memory(), and so on. [ Bug introduced by commit `c1cbe4b7ad` ("[NET]: Avoid atomic xchg() for non-error case") -DaveM ] Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-03 20:41:32 -07:00
Eric Dumazet	bfa5ae63b8	net: rename netdev rx_queue to ingress_queue There is some confusion with rx_queue name after RPS, and net drivers private rx_queue fields. I suggest to rename "struct net_device"->rx_queue to ingress_queue. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-29 13:25:53 -07:00
Eric Dumazet	745e20f1b6	net: add a recursion limit in xmit path As tunnel devices are going to be lockless, we need to make sure a misconfigured machine wont enter an infinite loop. Add a percpu variable, and limit to three the number of stacked xmits. Reported-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-29 13:23:09 -07:00
Tom Herbert	4465b46900	ipv4: Allow configuring subnets as local addresses This patch allows a host to be configured to respond to any address in a specified range as if it were local, without actually needing to configure the address on an interface. This is done through routing table configuration. For instance, to configure a host to respond to any address in 10.1/16 received on eth0 as a local address we can do: ip rule add from all iif eth0 lookup 200 ip route add local 10.1/16 dev lo proto kernel scope host src 127.0.0.1 table 200 This host is now reachable by any 10.1/16 address (route lookup on input for packets received on eth0 can find the route). On output, the rule will not be matched so that this host can still send packets to 10.1/16 (not sent on loopback). Presumably, external routing can be configured to make sense out of this. To make this work, we needed to modify the logic in finding the interface which is assigned a given source address for output (dev_ip_find). We perform a normal fib_lookup instead of just a lookup on the local table, and in the lookup we ignore the input interface for matching. This patch is useful to implement IP-anycast for subnets of virtual addresses. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-28 23:38:15 -07:00
Ben Hutchings	62fe0b40ab	net: Allow changing number of RX queues after device allocation For RPS, we create a kobject for each RX queue based on the number of queues passed to alloc_netdev_mq(). However, drivers generally do not determine the numbers of hardware queues to use until much later, so this usually represents the maximum number the driver may use and not the actual number in use. For TX queues, drivers can update the actual number using netif_set_real_num_tx_queues(). Add a corresponding function for RX queues, netif_set_real_num_rx_queues(). Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-27 22:09:49 -07:00
Eric Dumazet	f91ff5b9ff	net: sk_{detach\|attach}_filter() rcu fixes sk_attach_filter() and sk_detach_filter() are run with socket locked. Use the appropriate rcu_dereference_protected() instead of blocking BH, and rcu_dereference_bh(). There is no point adding BH prevention and memory barrier. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-27 21:30:44 -07:00
Eric Dumazet	7fa7cb7109	fib: use atomic_inc_not_zero() in fib_rules_lookup It seems we dont use appropriate refcount increment in an rcu_read_lock() protected section. fib_rule_get() might increment a null refcount and bad things could happen. While fib_nl_delrule() respects an rcu grace period before calling fib_rule_put(), fib_rules_cleanup_ops() calls fib_rule_put() without a grace period. Note : after this patch, we might avoid the synchronize_rcu() call done in fib_nl_delrule() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-27 21:30:44 -07:00
David S. Miller	01db403cf9	tcp: Fix >4GB writes on 64-bit. Fixes kernel bugzilla #16603 tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write zero bytes, for example. There is also the problem higher up of how verify_iovec() works. It wants to prevent the total length from looking like an error return value. However it does this using 'int', but syscalls return 'long' (and thus signed 64-bit on 64-bit machines). So it could trigger false-positives on 64-bit as written. So fix it to use 'long'. Reported-by: Olaf Bonorden <bono@onlinehome.de> Reported-by: Daniel Büse <dbuese@gmx.de> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-27 20:24:54 -07:00
David S. Miller	e40051d134	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/qlcnic/qlcnic_init.c net/ipv4/ip_output.c	2010-09-27 01:03:03 -07:00
Eric Dumazet	1b4bf461f0	rps: allocate rx queues in register_netdevice only Instead of having two places were we allocate dev->_rx, introduce netif_alloc_rx_queues() helper and call it only from register_netdevice(), not from alloc_netdev_mq() Goal is to let drivers change dev->num_rx_queues after allocating netdev and before registering it. This also removes a lot of ifdefs in net/core/dev.c Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-26 19:04:07 -07:00
Eric Dumazet	c5256c5123	net: propagate NETIF_F_HIGHDMA to vlans Automatically allows vlans to get NETIF_F_HIGHDMA if underlying device supports it. On 32bit arches (and more precisely if CONFIG_HIGHMEM is enabled), it can help to reduce cost of illegal_highdma() and __skb_linearize() calls. Tested on tg3 , bnx2, bonding, this worked very well. This is a generalization of a patch provided by Yi Zou & Jeff Kirsher. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-26 18:27:15 -07:00
Eric Dumazet	f064af1e50	net: fix a lockdep splat We have for each socket : One spinlock (sk_slock.slock) One rwlock (sk_callback_lock) Possible scenarios are : (A) (this is used in net/sunrpc/xprtsock.c) read_lock(&sk->sk_callback_lock) (without blocking BH) <BH> spin_lock(&sk->sk_slock.slock); ... read_lock(&sk->sk_callback_lock); ... (B) write_lock_bh(&sk->sk_callback_lock) stuff write_unlock_bh(&sk->sk_callback_lock) (C) spin_lock_bh(&sk->sk_slock) ... write_lock_bh(&sk->sk_callback_lock) stuff write_unlock_bh(&sk->sk_callback_lock) spin_unlock_bh(&sk->sk_slock) This (C) case conflicts with (A) : CPU1 [A] CPU2 [C] read_lock(callback_lock) <BH> spin_lock_bh(slock) <wait to spin_lock(slock)> <wait to write_lock_bh(callback_lock)> We have one problematic (C) use case in inet_csk_listen_stop() : local_bh_disable(); bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock) WARN_ON(sock_owned_by_user(child)); ... sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock) lockdep is not happy with this, as reported by Tetsuo Handa It seems only way to deal with this is to use read_lock_bh(callbacklock) everywhere. Thanks to Jarek for pointing a bug in my first attempt and suggesting this solution. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-24 22:26:10 -07:00
Eric Dumazet	a02cec2155	net: return operator cleanup Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-23 14:33:39 -07:00
Andy Shevchenko	82fd5b5d1e	net: core: use kernel's converter from hex to bin Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 18:04:45 -07:00
David S. Miller	73da16c28e	ethtool: Fix build due to lack of ethtool.h include. net/core/ethtool.c: In function 'ethtool_get_regs': net/core/ethtool.c:818:2: error: implicit declaration of function 'vmalloc' net/core/ethtool.c:818:9: warning: assignment makes pointer from integer without a cast net/core/ethtool.c:833:2: error: implicit declaration of function 'vfree' Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 16:12:11 -07:00
Ben Hutchings	a77f5db361	ethtool: Allocate register dump buffer with vmalloc() Some NICs have huge register files which exceed the maximum heap allocation size. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 14:57:59 -07:00
Ingo Molnar	7ed569206e	Merge commit 'v2.6.36-rc5' into perf/core Merge reason: Pick up the latest fixes in -rc5. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-21 13:55:11 +02:00
Ben Hutchings	be2902daee	ethtool, ixgbe: Move RX n-tuple mask fixup to ethtool The ethtool utility does not set masks for flow parameters that are not specified, so if both value and mask are 0 then this must be treated as equivalent to a mask with all bits set. Currently that is done in the only driver that implements RX n-tuple filtering, ixgbe. Move it to the ethtool core. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-17 16:53:23 -07:00
David Lamparter	3b27e10555	netns: keep vlan slaves on master netns move previously, if a vlan master device was moved from one network namespace to another, all 802.1q and macvlan slaves were deleted. we can use dev->reg_state to figure out whether dev_change_net_namespace is happening, since that won't set dev->reg_state NETREG_UNREGISTERING. so, this changes 8021q and macvlan to ignore NETDEV_UNREGISTER when reg_state is not NETREG_UNREGISTERING. Signed-off-by: David Lamparter <equinox@diac24.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-17 16:46:04 -07:00
Eric Dumazet	67c9660831	ethtool: change ethtool_set_gro() to use ethtool_op_get_rx_csum To be able to switch on GRO on a device, ethtool_set_gro() checks this device provides a get_rx_csum() method. Some devices dont provide this method, while they do support RX checksumming. This patch allows bonding to support GRO : ethtool -K bond0 gro on Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-17 11:56:18 -07:00
Stephen Rothwell	caeda9b926	net: include inetdevice.h for rcu_dereference_raw api change rcu_dereference_raw() now needs to know the type of its argument. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-16 21:39:16 -07:00
Brandon Philips	16c3ea785f	net: enable GRO by default for vlan devices Currently vlan devices don't have GRO by default as none of the Ethernet drivers add NETIF_F_GRO to their vlan_features. As GRO is a software feature add GRO to dev->vlan_features in register_netdevice() and let vlan_dev_init() take care that it gets enabled only when dev->features has NETIF_F_GRO too. Signed-off-by: Brandon Philips <bphilips@suse.de> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 22:32:29 -07:00
Eric Dumazet	95ae6b228f	ipv4: ip_ptr cleanups dev->ip_ptr is protected by rtnl and rcu. Yet some places dont use appropriate primitives and/or locking rules. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 22:06:05 -07:00
Ben Hutchings	e0de7c93b9	ethtool: Remove unimplemented flow specification types struct ethtool_rawip4_spec and struct ethtool_ether_spec are neither commented nor used by any driver, so remove them. Adjust padding in the user-visible unions that included these structures. Fix references to struct ethtool_rawip4_spec in ethtool_get_rx_ntuple(), which should use struct ethtool_usrip4_spec. struct ethtool_usrip4_spec cannot hold IPv6 host addresses and there is no separate structure that can, so remove ETH_RX_NFC_IP6 and the reference to it in niu. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-15 14:42:13 -07:00
Ingo Molnar	3aabae7d9d	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-09-15 10:27:31 +02:00
Eric Dumazet	ef885afbf8	net: use rcu_barrier() in rollback_registered_many netdev_wait_allrefs() waits that all references to a device vanishes. It currently uses a _very_ pessimistic 250 ms delay between each probe. Some users reported that no more than 4 devices can be dismantled per second, this is a pretty serious problem for some setups. Most of the time, a refcount is about to be released by an RCU callback, that is still in flight because rollback_registered_many() uses a synchronize_rcu() call instead of rcu_barrier(). Problem is visible if number of online cpus is one, because synchronize_rcu() is then a no op. time to remove 50 ipip tunnels on a UP machine : before patch : real 11.910s after patch : real 1.250s Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: Octavian Purdila <opurdila@ixiacom.com> Reported-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-14 14:27:29 -07:00
Eric Dumazet	83b6b1f5d1	flow: better memory management Allocate hash tables for every online cpus, not every possible ones. NUMA aware allocations. Dont use a full page on arches where PAGE_SIZE > 1024sizeof(void ) misc: __percpu , __read_mostly, __cpuinit annotations flow_compare_t is just an "unsigned long" Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-13 20:02:50 -07:00
stephen hemminger	9ca7f87622	pkt_sched: remov unnecessary bh_disable Now that est_tree_lock is acquired with BH protection, the other call is unnecessary. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-10 12:47:59 -07:00
David S. Miller	e548833df8	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/mac80211/main.c	2010-09-09 22:27:33 -07:00
Namhyung Kim	f39234d606	net/core: add lock context change annotations in net/core/sock.c __lock_sock() and __release_sock() releases and regrabs lock but were missing proper annotations. Add it. This removes following warning from sparse. (Currently __lock_sock() does not emit any warning about it but I think it is better to add also.) net/core/sock.c:1580:17: warning: context imbalance in '__release_sock' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-09 15:02:39 -07:00
Namhyung Kim	a700d8be73	net/core: remove address space warnings on verify_iovec() move_addr_to_kernel() and copy_from_user() requires their argument as __user pointer but were missing proper markups. Add it. This removes following warnings from sparse. net/core/iovec.c:44:52: warning: incorrect type in argument 1 (different address spaces) net/core/iovec.c:44:52: expected void [noderef] <asn:1>uaddr net/core/iovec.c:44:52: got void msg_name net/core/iovec.c:55:34: warning: incorrect type in argument 2 (different address spaces) net/core/iovec.c:55:34: expected void const [noderef] <asn:1>from net/core/iovec.c:55:34: got struct iovec msg_iov Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-09 15:02:38 -07:00
Changli Gao	6febfca98f	net: rps: add the shortcut for one rps_cpus When there is only one rps_cpus, skb_get_rxhash() can be eliminated. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-08 13:10:53 -07:00
Jarek Poplawski	64289c8e68	gro: Re-fix different skb headrooms The patch: "gro: fix different skb headrooms" in its part: "2) allocate a minimal skb for head of frag_list" is buggy. The copied skb has p->data set at the ip header at the moment, and skb_gro_offset is the length of ip + tcp headers. So, after the change the length of mac header is skipped. Later skb_set_mac_header() sets it into the NET_SKB_PAD area (if it's long enough) and ip header is misaligned at NET_SKB_PAD + NET_IP_ALIGN offset. There is no reason to assume the original skb was wrongly allocated, so let's copy it as it was. bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626 fixes commit: `3d3be4333f` Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-08 10:32:15 -07:00
Helmut Schaa	deabc772f3	net: fix tx queue selection for bridged devices implementing select_queue When a net device is implementing the select_queue callback and is part of a bridge, frames coming from the bridge already have a tx queue associated to the socket (introduced in commit `a4ee3ce329`, "net: Use sk_tx_queue_mapping for connected sockets"). The call to sk_tx_queue_get will then return the tx queue used by the bridge instead of calling the select_queue callback. In case of mac80211 this broke QoS which is implemented by using the select_queue callback. Furthermore it introduced problems with rt2x00 because frames with the same TID and RA sometimes appeared on different tx queues which the hw cannot handle correctly. Fix this by always calling select_queue first if it is available and only afterwards use the socket tx queue mapping. Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-07 13:57:20 -07:00
Koki Sanagi	07dc22e729	skb: Add tracepoints to freeing skb This patch adds tracepoint to consume_skb and add trace_kfree_skb before __kfree_skb in skb_free_datagram_locked and net_tx_action. Combinating with tracepoint on dev_hard_start_xmit, we can check how long it takes to free transmitted packets. And using it, we can calculate how many packets driver had at that time. It is useful when a drop of transmitted packet is a problem. sshd-6828 [000] 112689.258154: consume_skb: skbaddr=f2d99bb8 Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C724364.50903@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-09-07 17:51:53 +02:00
Koki Sanagi	cf66ba58b5	netdev: Add tracepoints to netdev layer This patch adds tracepoint to dev_queue_xmit, dev_hard_start_xmit, netif_rx and netif_receive_skb. These tracepoints help you to monitor network driver's input/output. <idle>-0 [001] 112447.902030: netif_rx: dev=eth1 skbaddr=f3ef0900 len=84 <idle>-0 [001] 112447.902039: netif_receive_skb: dev=eth1 skbaddr=f3ef0900 len=84 sshd-6828 [000] 112447.903257: net_dev_queue: dev=eth4 skbaddr=f3fca538 len=226 sshd-6828 [000] 112447.903260: net_dev_xmit: dev=eth4 skbaddr=f3fca538 len=226 rc=0 Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com> Cc: Izumo Taku <izumi.taku@jp.fujitsu.com> Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Scott Mcmillan <scott.a.mcmillan@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <4C72431E.3000901@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-09-07 17:51:33 +02:00
Eric Dumazet	db40980fcd	net: poll() optimizations No need to test twice sk->sk_shutdown & RCV_SHUTDOWN Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-06 18:48:45 -07:00
Eric Dumazet	1fd63041c4	net: pskb_expand_head() optimization pskb_expand_head() blindly takes references on fragments before calling skb_release_data(), potentially releasing these references. We can add a fast path, avoiding these atomic operations, if we own the last reference on skb->head. Based on a previous patch from David Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-06 18:24:59 -07:00
Eric Dumazet	52ee7a04a0	net: remove two kmemcheck annotations __alloc_skb() uses a memset() to clear all the beginning of skb, including bitfields contained in 'flags1' & 'flags2'. We dont need any more to use kmemcheck_annotate_bitfield() on these fields. However, we still need it for the clone part, which is not cleared. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-03 09:44:51 -07:00
Jarek Poplawski	0b5d404e34	pkt_sched: Fix lockdep warning on est_tree_lock in gen_estimator This patch fixes a lockdep warning: [ 516.287584] ========================================================= [ 516.288386] [ INFO: possible irq lock inversion dependency detected ] [ 516.288386] 2.6.35b #7 [ 516.288386] --------------------------------------------------------- [ 516.288386] swapper/0 just changed the state of lock: [ 516.288386] (&qdisc_tx_lock){+.-...}, at: [<c12eacda>] est_timer+0x62/0x1b4 [ 516.288386] but this lock took another, SOFTIRQ-unsafe lock in the past: [ 516.288386] (est_tree_lock){+.+...} [ 516.288386] [ 516.288386] and interrupts could create inverse lock ordering between them. ... So, est_tree_lock needs BH protection because it's taken by qdisc_tx_lock, which is used both in BH and process contexts. (Full warning with this patch at netdev, 02 Sep 2010.) Fixes commit: `ae638c47dc` ("pkt_sched: gen_estimator: add a new lock") Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-02 13:22:11 -07:00
Eric Dumazet	c07b68e841	net: dev_add_pack() & __dev_remove_pack() changes Add a small helper ptype_head() to get the head to manipulate dev_add_pack() & __dev_remove_pack() can use a spinlock without blocking BH, since softirq use RCU, and these functions are run from process context only. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-02 10:12:06 -07:00
Eric Dumazet	3d3be4333f	gro: fix different skb headrooms Packets entering GRO might have different headrooms, even for a given flow (because of implementation details in drivers, like copybreak). We cant force drivers to deliver packets with a fixed headroom. 1) fix skb_segment() skb_segment() makes the false assumption headrooms of fragments are same than the head. When CHECKSUM_PARTIAL is used, this can give csum_start errors, and crash later in skb_copy_and_csum_dev() 2) allocate a minimal skb for head of frag_list skb_gro_receive() uses netdev_alloc_skb(headroom + skb_gro_offset(p)) to allocate a fresh skb. This adds NET_SKB_PAD to a padding already provided by netdevice, depending on various things, like copybreak. Use alloc_skb() to allocate an exact padding, to reduce cache line needs: NET_SKB_PAD + NET_IP_ALIGN bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626 Many thanks to Plamen Petrov, testing many debugging patches ! With help of Jarek Poplawski. Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-01 19:17:35 -07:00
stephen hemminger	fa50d64576	net: make rx_queue sysfs_ops const Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-01 18:12:20 -07:00
Eric Dumazet	6602cebb5b	net: skbuff.c cleanup (skb->data - skb->head) can be changed by skb_headroom(skb) Remove some uses of NET_SKBUFF_DATA_USES_OFFSET, using (skb_end_pointer(skb) - skb->head) or (skb_tail_pointer(skb) - skb->head) : compiler does the right thing, and this is more readable for us ;) (struct skb_shared_info *) casts in pskb_expand_head() to help memcpy() to use aligned moves. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-01 10:57:55 -07:00
Eric Dumazet	86cac58b71	skge: add GRO support - napi_gro_flush() is exported from net/core/dev.c, to avoid an irq_save/irq_restore in the packet receive path. - use napi_gro_receive() instead of netif_receive_skb() - use napi_gro_flush() before calling __napi_complete() - turn on NETIF_F_GRO by default - Tested on a Marvell 88E8001 Gigabit NIC Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-01 10:57:55 -07:00
Eric Dumazet	ba4fd9d828	pktgen: remove non used variable remove non used variable "queue" in pg_cleanup Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-31 13:37:06 -07:00
Eric Dumazet	40d0802b3e	gro: __napi_gro_receive() optimizations compare_ether_header() can have a special implementation on 64 bit arches if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is defined. __napi_gro_receive() and vlan_gro_common() can avoid a conditional branch to perform device match. On x86_64, __napi_gro_receive() has now 38 instructions instead of 53 As gcc-4.4.3 still choose to not inline it, add inline keyword to this performance critical function. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-26 22:03:08 -07:00
John W. Linville	e569aa78ba	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem Conflicts: drivers/net/wireless/libertas/if_sdio.c	2010-08-25 14:51:42 -04:00
stephen hemminger	0fdc100bdc	ethtool: allow non-netadmin to query settings The SNMP daemon uses ethtool to determine the speed of network interfaces. This fails on Debian (and probably elsewhere) because for security SNMP daemon runs as non-root user (snmp). Note: A similar patch was rejected previously because of a concern about the possibility that on some hardware querying the ethtool settings requires access to the PHY and could slow the machine down. But the security risk of requiring SNMP daemon (and related services) to run as root far out weighs the risk of denial-of-service. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-23 20:43:16 -07:00
Eric Dumazet	afdcba371f	net: copy_rtnl_link_stats64() simplification No need to use a temporary struct rtnl_link_stats64 variable, just copy the source to skb buffer. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-23 20:43:16 -07:00
David S. Miller	21dc330157	net: Rename skb_has_frags to skb_has_frag_list SKBs can be "fragmented" in two ways, via a page array (called skb_shinfo(skb)->frags[]) and via a list of SKBs (called skb_shinfo(skb)->frag_list). Since skb_has_frags() tests the latter, it's name is confusing since it sounds more like it's testing the former. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-23 00:13:46 -07:00
Changli Gao	05532121da	net: 802.1q: make vlan_hwaccel_do_receive() return void vlan_hwaccel_do_receive() always returns 0, so make it return void. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-22 21:03:33 -07:00
David S. Miller	d3c6e7ad09	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2010-08-21 23:32:24 -07:00
Changli Gao	1003489e06	net: rps: fix the wrong network header pointer __skb_get_rxhash() was broken after the commit: commit `bfb564e739` Author: Krishna Kumar <krkumar2@in.ibm.com> Date: Wed Aug 4 06:15:52 2010 +0000 core: Factor out flow calculation from get_rps_cpu Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-21 22:54:49 -07:00
Changli Gao	12fcdefb36	net: rps: use proto_ports_offset() to handle the AH message correctly The SPI isn't at the beginning of an AH message. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-19 17:16:23 -07:00
Changli Gao	dbe5775bbc	net: rps: skip fragment when computing rxhash Fragmented IP packets may have no transfer header, so when computing rxhash, we should skip them. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-19 17:10:38 -07:00
Changli Gao	2d47b45951	net: rps: reset network header before calling skb_get_rxhash() skb_get_rxhash() assumes the network header pointer of the skb is set properly after the commit: commit `bfb564e739` Author: Krishna Kumar <krkumar2@in.ibm.com> Date: Wed Aug 4 06:15:52 2010 +0000 core: Factor out flow calculation from get_rps_cpu Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-19 17:08:37 -07:00
Oliver Hartkopp	2244d07bfa	net: simplify flags for tx timestamping This patch removes the abstraction introduced by the union skb_shared_tx in the shared skb data. The access of the different union elements at several places led to some confusion about accessing the shared tx_flags e.g. in skb_orphan_try(). http://marc.info/?l=linux-netdev&m=128084897415886&w=2 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-19 00:08:30 -07:00
Jarek Poplawski	e5093aec2e	net: Fix a memmove bug in dev_gro_receive() >Xin Xiaohui wrote: > I looked into the code dev_gro_receive(), found the code here: > if the frags[0] is pulled to 0, then the page will be released, > and memmove() frags left. > Is that right? I'm not sure if memmove do right or not, but > frags[0].size is never set after memove at least. what I think > a simple way is not to do anything if we found frags[0].size == 0. > The patch is as followed. ... This version of the patch fixes the bug directly in memmove. Reported-by: "Xin, Xiaohui" <xiaohui.xin@intel.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-17 17:37:28 -07:00
Ben Hutchings	0141480205	ethtool: Provide a default implementation of ethtool_ops::get_drvinfo The driver name and bus address for a net_device can normally be found through the driver model now. Instead of requiring drivers to provide this information redundantly through the ethtool_ops::get_drvinfo operation, use the driver model to do so if the driver does not define the operation. Since ETHTOOL_GDRVINFO no longer requires the driver to implement any operations, do not require net_device::ethtool_ops to be set either. Remove implementations of get_drvinfo and ethtool_ops that provide only this information. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-17 02:31:15 -07:00
Krishna Kumar	bfb564e739	core: Factor out flow calculation from get_rps_cpu Factor out flow calculation code from get_rps_cpu, since other functions can use the same code. Revisions: v2 (Ben): Separate flow calcuation out and use in select queue. v3 (Arnd): Don't re-implement MIN. v4 (Changli): skb->data points to ethernet header in macvtap, and make a fast path. Tested macvtap with this patch. v5 (Changli): - Cache skb->rxhash in skb_get_rxhash - macvtap may not have pow(2) queues, so change code for queue selection. (Arnd): - Use first available queue if all fails. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-16 21:06:24 -07:00
Johannes Berg	0460079495	cfg80211: support sysfs namespaces Enable using network namespaces with wireless devices even when sysfs is enabled using the same infrastructure that was built for netdevs. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2010-08-16 15:26:40 -04:00
Changli Gao	cece1945bf	net: disable preemption before call smp_processor_id() Although netif_rx() isn't expected to be called in process context with preemption enabled, it'd better handle this case. And this is why get_cpu() is used in the non-RPS #ifdef branch. If tree RCU is selected, rcu_read_lock() won't disable preemption, so preempt_disable() should be called explictly. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-07 20:35:43 -07:00
Jarek Poplawski	ce9e76c845	net: Fix napi_gro_frags vs netpoll path The netpoll_rx_on() check in __napi_gro_receive() skips part of the "common" GRO_NORMAL path, especially "pull:" in dev_gro_receive(), where at least eth header should be copied for entirely paged skbs. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-05 13:21:25 -07:00
David S. Miller	3578b0c8ab	Revert "net: remove zap_completion_queue" This reverts commit `15e83ed788`. As explained by Johannes Berg, the optimization made here is invalid. Or, at best, incomplete. Not only destructor invocation, but conntract entry releasing must be executed outside of hw IRQ context. So just checking "skb->destructor" is insufficient. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-03 00:24:04 -07:00
Changli Gao	a427615e04	net: cleanup inclusion Commit `ab95bfe01f` replaces bridge and macvlan hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-02 22:45:49 -07:00
Stephen Hemminger	de38483010	net: ingress filter message limit If user misconfigures ingress and causes a redirection loop, don't overwhelm the log. This is also a error case so make it unlikely. Found by inspection, luckily not in real system. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-01 00:33:23 -07:00
David S. Miller	bb7e95c8fd	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x_main.c Merge bnx2x bug fixes in by hand... :-/ Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-27 21:01:35 -07:00
Changli Gao	a256be70c5	drop_monitor: use genl_register_family_with_ops() [ Fix unused local variable build warnings. -DaveM ] Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-26 20:59:42 -07:00
Ben Greear	c736eefadb	net: dev_forward_skb should call nf_reset With conn-track zones and probably with different network namespaces, the netfilter logic needs to be re-calculated on packet receive. If the netfilter logic is not reset, it will not be recalculated properly. This patch adds the nf_reset logic to dev_forward_skb. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-25 21:58:46 -07:00
Eric Dumazet	fed66381d6	net: pskb_expand_head() optimization Move frags[] at the end of struct skb_shared_info, and make pskb_expand_head() copy only the used part of it instead of whole array. This should avoid kmemcheck warnings and speedup pskb_expand_head() as well, avoiding a lot of cache misses. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-24 21:05:57 -07:00
Stefan Assmann	c1f79426e2	sysfs: add attribute to indicate hw address assignment type Add addr_assign_type to struct net_device and expose it via sysfs. This new attribute has the purpose of giving user-space the ability to distinguish between different assignment types of MAC addresses. For example user-space can treat NICs with randomly generated MAC addresses differently than NICs that have permanent (locally assigned) MAC addresses. For the former udev could write a persistent net rule by matching the device path instead of the MAC address. There's also the case of devices that 'steal' MAC addresses from slave devices. In which it is also be beneficial for user-space to be aware of the fact. This patch also introduces a helper function to assist adoption of drivers that generate MAC addresses randomly. Signed-off-by: Stefan Assmann <sassmann@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-24 20:49:29 -07:00
Andy Shevchenko	451e07a264	net: core: don't use own hex_to_bin() method Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-23 12:50:51 -07:00
David S. Miller	be2b6e6235	net: Fix skb_copy_expand() handling of ->csum_start It should only be adjusted if ip_summed == CHECKSUM_PARTIAL. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-22 13:27:09 -07:00
Andrea Shepard	00c5a9834b	net: Fix corruption of skb csum field in pskb_expand_head() of net/core/skbuff.c Make pskb_expand_head() check ip_summed to make sure csum_start is really csum_start and not csum before adjusting it. This fixes a bug I encountered using a Sun Quad-Fast Ethernet card and VLANs. On my configuration, the sunhme driver produces skbs with differing amounts of headroom on receive depending on the packet size. See line 2030 of drivers/net/sunhme.c; packets smaller than RX_COPY_THRESHOLD have 52 bytes of headroom but packets larger than that cutoff have only 20 bytes. When these packets reach the VLAN driver, vlan_check_reorder_header() calls skb_cow(), which, if the packet has less than NET_SKB_PAD (== 32) bytes of headroom, uses pskb_expand_head() to make more. Then, pskb_expand_head() needs to adjust a lot of offsets into the skb, including csum_start. Since csum_start is a union with csum, if the packet has a valid csum value this will corrupt it, which was the effect I observed. The sunhme hardware computes receive checksums, so the skbs would be created by the driver with ip_summed == CHECKSUM_COMPLETE and a valid csum field, and then pskb_expand_head() would corrupt the csum field, leading to an "hw csum error" message later on, for example in icmp_rcv() for pings larger than the sunhme RX_COPY_THRESHOLD. On the basis of the comment at the beginning of include/linux/skbuff.h, I believe that the csum_start skb field is only meaningful if ip_csummed is CSUM_PARTIAL, so this patch makes pskb_expand_head() adjust it only in that case to avoid corrupting a valid csum value. Please see my more in-depth disucssion of tracking down this bug for more details if you like: http://puellavulnerata.livejournal.com/112186.html http://puellavulnerata.livejournal.com/112567.html http://puellavulnerata.livejournal.com/112891.html http://puellavulnerata.livejournal.com/113096.html http://puellavulnerata.livejournal.com/113591.html I am not subscribed to this list, so please CC me on replies. Signed-off-by: Andrea Shepard <andrea@persephoneslair.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-22 13:25:18 -07:00
David S. Miller	11fe883936	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/vhost/net.c net/bridge/br_device.c Fix merge conflict in drivers/vhost/net.c with guidance from Stephen Rothwell. Revert the effects of net-2.6 commit `573201f36f` since net-next-2.6 has fixes that make bridge netpoll work properly thus we don't need it disabled. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-20 18:25:24 -07:00
Neil Horman	4b706372f1	drop_monitor: Add error code to detect duplicate state changes Patch to add -EAGAIN error to dropwatch netlink message handling code. -EAGAIN will be returned anytime userspace attempts to transition the state of the drop monitor service to a state that its already in. That allows user space to detect this condition, so it doesn't wait for a success ACK that will never arrive. Tested successfully by me Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-20 13:28:04 -07:00
Nicolas Dichtel	d79d991379	__dst_free(): put EXPORT_SYMBOLS after the fct Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-20 13:28:03 -07:00
Eric Dumazet	d6d9ca0fec	net: this_cpu_xxx conversions Use modern this_cpu_xxx() api, saving few bytes on x86 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-19 15:12:51 -07:00
Eric Dumazet	bd27290a59	net: 64bit stats for netdev_queue Since struct netdev_queue tx_bytes/tx_packets/tx_dropped are already protected by _xmit_lock, its easy to convert these fields to u64 instead of unsigned long. This completes 64bit stats for devices using them (vlan, macvlan, ...) Strictly, we could avoid the locking in dev_txq_stats_fold() on 64bit arches, but its slow path and we prefer keep it simple. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-19 09:35:40 -07:00

... 3 4 5 6 7 ...

2228 Commits