Commit Graph

200211 Commits

Author SHA1 Message Date
Changli Gao
210d6de78c act_mirred: don't clone skb when skb isn't shared
don't clone skb when skb isn't shared

When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
to clone a new skb. As the skb will be freed after this function returns, we
can use it freely once we get a reference to it.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/net/sch_generic.h |   11 +++++++++--
 net/sched/act_mirred.c    |    6 +++---
 2 files changed, 12 insertions(+), 5 deletions(-)
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:24:32 -07:00
Eric Dumazet
c4ead4c595 tcp: tso_fragment() might avoid GFP_ATOMIC
We can pass a gfp argument to tso_fragment() and avoid GFP_ATOMIC
allocations sometimes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:24:31 -07:00
Eric Dumazet
9618e2ffd7 vlan: 64 bit rx counters
Use u64_stats_sync infrastructure to implement 64bit rx stats.

(tx stats are addressed later)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:24:31 -07:00
Eric Dumazet
bc66154efe macvlan: 64 bit rx counters
Use u64_stats_sync infrastructure to implement 64bit stats.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:24:30 -07:00
Eric Dumazet
33d91f00c7 net: u64_stats_fetch_begin_bh() and u64_stats_fetch_retry_bh()
- Must disable preemption in case of 32bit UP in u64_stats_fetch_begin()
and u64_stats_fetch_retry()

- Add new u64_stats_fetch_begin_bh() and u64_stats_fetch_retry_bh() for
network usage, disabling BH on 32bit UP only.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:24:30 -07:00
Eric Dumazet
7a9b2d5950 net: use this_cpu_ptr()
use this_cpu_ptr(p) instead of per_cpu_ptr(p, smp_processor_id())

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:24:29 -07:00
Eric Dumazet
b6b3ecc71a net: u64_stats_sync improvements
- Add a comment about interrupts:

6) If counter might be written by an interrupt, readers should block
interrupts.

- Fix a typo in sample of use.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:24:29 -07:00
Ben Hutchings
a095cfc40e 3c59x: Specify window explicitly for access to windowed registers
Currently much of the code assumes that a specific window has been
selected, while a few functions save and restore the window.  This
makes it impossible to introduce fine-grained locking.

Make those assumptions explicit by introducing wrapper functions
to set the window and read/write a register.  Use these everywhere
except vortex_interrupt(), vortex_start_xmit() and vortex_rx().
These set the window just once, or not at all in the case of
vortex_rx() as it should always be called from vortex_interrupt().

Cache the current window in struct vortex_private to avoid
unnecessary hardware writes.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Tested-by: Arne Nordmark <nordmark@mech.kth.se> [against 2.6.32]
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:19:18 -07:00
Amerigo Wang
d2ef859034 mlx4: add dynamic LRO disable support
This patch adds dynamic LRO diable support for mlx4 net driver.
It also fixes a bug of mlx4, which checks NETIF_F_LRO flag in rx
path without rtnl lock.

(I don't have mlx4 card, so only did compiling test. Anyone who wants
to test this is more than welcome.)

This is based on Neil's initial work too, and heavily modified based
on Stanislaw's suggestions.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Neil Horman <nhorman@redhat.com>
Acked-by: Neil Horman <nhorman@redhat.com>
Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:04:10 -07:00
Jon Mason
958de1931c s2io: add dynamic LRO disable support
This patch adds dynamic LRO disable support for s2io net driver,
enables LRO by default, increases the driver version number, and
corrects the name of the LRO modparm.

This is mostly Wang's patch based on Neil's initial work, heavily
modified based on Ramkrishna's suggestions.  This has been tested on
a Neterion Xframe adapter and verified via adapter LRO statistics.

Signed-off-by: Jon Mason <jon.mason@exar.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Neil Horman <nhorman@redhat.com>
Acked-by: Neil Horman <nhorman@redhat.com>
Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ramkrishna Vepa <Ramkrishna.Vepa@exar.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28 23:04:10 -07:00
Florian Westphal
172d69e63c syncookies: add support for ECN
Allows use of ECN when syncookies are in effect by encoding ecn_ok
into the syn-ack tcp timestamp.

While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES.
With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc
removes the "if (0)".

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-26 22:00:03 -07:00
Florian Westphal
734f614bc1 syncookies: do not store rcv_wscale in tcp timestamp
As pointed out by Fernando Gont there is no need to encode rcv_wscale
into the cookie.

We did not use the restored rcv_wscale anyway; it is recomputed
via tcp_select_initial_window().

Thus we can save 4 bits in the ts option space by removing rcv_wscale.
In case window scaling was not supported, we set the (invalid) wscale
value 0xf.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-26 22:00:03 -07:00
Eric Dumazet
9587c6ddd4 ipv6: remove ipv6_statistics
commit 9261e53701 (ipv6: making ip and icmp statistics per/namespace)
forgot to remove ipv6_statistics variable.

commit bc417d99bf (ipv6: remove stale MIB definitions) took care of
icmpv6_statistics & icmpv6msg_statistics

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Denis V. Lunev <den@openvz.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:17 -07:00
Eric Dumazet
1823e4c80e snmp: add align parameter to snmp_mib_init()
In preparation for 64bit snmp counters for some mibs,
add an 'align' parameter to snmp_mib_init(), instead
of assuming mibs only contain 'unsigned long' fields.

Callers can use __alignof__(type) to provide correct
alignment.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:17 -07:00
Eric Dumazet
5eaa0bd81f loopback: use u64_stats_sync infrastructure
Commit 6b10de38f0 (loopback: Implement 64bit stats on 32bit arches)
introduced 64bit stats in loopback driver, using a private seqcount and
private helpers.

David suggested to introduce a generic infrastructure, added in (net:
Introduce u64_stats_sync infrastructure)

This patch reimplements loopback 64bit stats using the u64_stats_sync
infrastructure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:16 -07:00
Eric Dumazet
4b4194c40f arp: RCU change in arp_solicit()
Avoid two atomic ops in arp_solicit()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:16 -07:00
Gerrit Renker
59b80802a8 dccp: make implementation of Syn-RTT symmetric
This patch is thanks to Andre Noll who reported the issue and helped testing.

The Syn-RTT sampled during the initial handshake currently only works for
the client sending the DCCP-Request. TFRC penalizes the absence of an RTT
sample with a very slow initial speed (1 packet per second), which delays
slow-start significantly, resulting in sluggish performance.

This patch mirrors the "Syn RTT" principle by adding a timestamp also onto
the DCCP-Response, producing an RTT sample  when the (Data)Ack completing
the handshake arrives.

Also changed the documentation to 'TFRC' since Syn RTTs are also used by CCID-4.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:15 -07:00
Gerrit Renker
a7d13fbf85 dccp: remove unused function argument
This removes an unused 'sk' argument from several option-inserting functions.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:14 -07:00
Eric Benard
87cad5c385 net/fec: clean suspend/resume
Commit 59d4289b83 converted fec to dev_pm_ops but
didn't update the suspend/resume functions thus leading to the following warning :
"initialization from incompatible pointer type" when CONFIG_PM is set.

This patch also fixe a few indentation and style around CONFIG_PM area.

Signed-off-by: Eric Bénard <eric@eukrea.com>
Cc: netdev@vger.kernel.org
Cc: davem@davemloft.net
Cc: amit.kucheria@canonical.com
Cc: s.hauer@pengutronix.de
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:14 -07:00
Divy Le Ray
d05f6cf01c cxgb3: request 7.10 firmware
The driver requests FW 7.10
Bump up driver version.

Signed-off-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:13 -07:00
Divy Le Ray
17acad30bf cxgb3: update FW to 7.10
Update FW to 7.10

Signed-off-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:13 -07:00
Joe Perches
f9467eaec3 net/core/pktgen.c: Use pr_<level>
Add pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Remove "pktgen: " from formats
Convert printks to pr_<level>
Added func_enter() for debugging
Moved version to end of string at module_init
Coalesced long formats

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:12 -07:00
Hagen Paul Pfeifer
01f2f3f6ef net: optimize Berkeley Packet Filter (BPF) processing
Gcc is currenlty not in the ability to optimize the switch statement in
sk_run_filter() because of dense case labels. This patch replace the
OR'd labels with ordered sequenced case labels. The sk_chk_filter()
function is modified to patch/replace the original OPCODES in a
ordered but equivalent form. gcc is now in the ability to transform the
switch statement in sk_run_filter into a jump table of complexity O(1).

Until this patch gcc generates a sequence of conditional branches (O(n) of 567
byte .text segment size (arch x86_64):

7ff: 8b 06                 mov    (%rsi),%eax
801: 66 83 f8 35           cmp    $0x35,%ax
805: 0f 84 d0 02 00 00     je     adb <sk_run_filter+0x31d>
80b: 0f 87 07 01 00 00     ja     918 <sk_run_filter+0x15a>
811: 66 83 f8 15           cmp    $0x15,%ax
815: 0f 84 c5 02 00 00     je     ae0 <sk_run_filter+0x322>
81b: 77 73                 ja     890 <sk_run_filter+0xd2>
81d: 66 83 f8 04           cmp    $0x4,%ax
821: 0f 84 17 02 00 00     je     a3e <sk_run_filter+0x280>
827: 77 29                 ja     852 <sk_run_filter+0x94>
829: 66 83 f8 01           cmp    $0x1,%ax
[...]

With the modification the compiler translate the switch statement into
the following jump table fragment:

7ff: 66 83 3e 2c           cmpw   $0x2c,(%rsi)
803: 0f 87 1f 02 00 00     ja     a28 <sk_run_filter+0x26a>
809: 0f b7 06              movzwl (%rsi),%eax
80c: ff 24 c5 00 00 00 00  jmpq   *0x0(,%rax,8)
813: 44 89 e3              mov    %r12d,%ebx
816: e9 43 03 00 00        jmpq   b5e <sk_run_filter+0x3a0>
81b: 41 89 dc              mov    %ebx,%r12d
81e: e9 3b 03 00 00        jmpq   b5e <sk_run_filter+0x3a0>

Furthermore, I reordered the instructions to reduce cache line misses by
order the most common instruction to the start.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:12 -07:00
Ben Hutchings
bd97a63f7d sfc: Log clearer error messages for hardware monitor
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:03:32 -07:00
Ben Hutchings
477e54eba4 sfc: Use Toeplitz IPv4 hash for RSS and hash insertion
Insertion of the Falcon hash is unreliable.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:03:32 -07:00
Ben Hutchings
5d3a6fca95 sfc: Move siena_nic_data::ipv6_rss_key to efx_nic::rx_hash_key
We will use this hash key for Toeplitz IPv4 hashing too.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:03:31 -07:00
Ben Hutchings
604f6049ba sfc: Fix reading of inserted hash
The hash appears immediately before the packet data, not at the
beginning of the buffer. This means we can easily use negative offsets
from the start of packet data, so adjust the data and length at the
top of __efx_rx_packet() instead of wherever we consume the hash.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:03:30 -07:00
Vasanthy Kolluri
29046f9b1e enic: Clean ups
1) Update copyright
2) Fix hardware queue descriptor field size CQ_ENET_RQ_DESC_FCOE_SOF_BITS
3) Include rtnetlink.h instead of if_link.h
4) Selectively flush writes to interrupt mask register
5) Use pci_enable_device_mem
6) Remove unused variables and header files
7) Fix size mismatch between memory alloc and free operations of a variable
8) Check for non null arguments to vic_provinfo_alloc

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:50:30 -07:00
Vasanthy Kolluri
506e119841 enic: Bug Fix: Handle surprise hardware removals
Handle surprise hardware removals gracefully during devcmd issue and init,
cleanup of queues.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:50:25 -07:00
Vasanthy Kolluri
1825aca667 enic: Feature Add: Add loopback capability to enic devices
Hardware has the loopback capability to queue the packets transmitted from
a device to the receive queue of the same device. enic now supports the
loopback capability.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:50:24 -07:00
Vasanthy Kolluri
b5bab85c15 enic: Use receive queue buffer blocks of 32/64 entries
Change the receive queue buffer allocations into blocks of 32 entries when
ring size is less than 64, otherwise use 64 entries per block.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:50:24 -07:00
Vasanthy Kolluri
70feadf36d enic: Add new firmware devcmds
Add new firmware devcmds - CMD_PROXY_BY_BDF, CMD_PACKET_FILTER_ALL,
CMD_ENABLE_WAIT.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:50:23 -07:00
Vasanthy Kolluri
a7a79debcc enic: Use (netdev|dev|pr)_<level> macro helpers for logging
Replace all printk routines with the (netdev|dev|pr)_<level> macros that
provide verbose logs.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:50:23 -07:00
Vasanthy Kolluri
383ab92f11 enic: Clean up: Add wrapper routines for firmware devcmd calls
Add wrapper routines that issue devcmds to firmware and ensure that a
devcmd lock is held for each devcmd call.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:46:40 -07:00
Vasanthy Kolluri
99ef563901 enic: Use a lighter reset operation for enic devices
The port profile information for a dynamic enic device is set by the upper
layers, that are oblivious to the device reset operation. We do not want a
reset operation erase the network state of a dynamic enic device as there
is no way to set up the port profile information again. Hence a lighter
reset operation called hang reset is used. Hang reset, unlike soft reset
does not reset the network state and resets the host side state only.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:46:01 -07:00
Vasanthy Kolluri
f8cac14acf enic: Bug Fix: Change hardware ingress vlan rewrite mode
The current ingress vlan rewrite mode setting lets the hardware strip off
the tag control information of a packet received on native vlan. As a
result, the priority bits are also lost. The fix is to change the ingress
vlan rewrite mode setting such that the complete tag control information is
retained for packets that belong to native vlan.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:45:22 -07:00
Vasanthy Kolluri
88132f55d7 enic: Feature Add: Replace LRO with GRO
enic now uses the GRO mechanism instead of LRO to pass skbs to upper
layers.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Vasanthy Kolluri <vkolluri@cisco.com>
Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:45:00 -07:00
Michael Chan
72b8a169db cnic: Update version to 2.1.3.
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:37:21 -07:00
Michael Chan
b177a5d5d8 cnic: Further unify kcq handling code.
This eliminates some of the duplicate code for the various devices
that require the same basic kcq handling.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:37:20 -07:00
Michael Chan
644b9d4f8b cnic: Restructure kcq processing.
By doing more work in the common function cnic_get_kcqes(), and
making full use of the kcq_info structure.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:37:20 -07:00
Michael Chan
e6c2889478 cnic: Unify kcq allocation for all devices.
By creating a common data stucture kcq_info for all devices, the kcq
(kernel completion queue) for all devices can be allocated by common
code.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:37:19 -07:00
Michael Chan
66fee9ed03 cnic: Unify IRQ code for all hardware types.
By creating a common cnic_doirq().

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:37:19 -07:00
Michael Chan
520efdf44f cnic: Fine-tune CID memory space calculation.
The current code makes assumptions about the CID (context ID) memory
space and starting CID that may not be always correct when firmware
changes.  In particular, BNX2_ISCSI_START_CID may not always be fixed.
We now calculate cp->max_cid_space and cp->iscsi_start_cid dynamically
instead of using fixed constants.  The unused cp->max_iscsi_conn is also
eliminated.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 20:37:18 -07:00
Ben Hutchings
39c9cf0707 sfc: Record hardware RX hash on each skb where possible
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24 22:13:24 -07:00
Ben Hutchings
2822235278 sfc: Disable setting feature flags that are not implemented
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24 22:13:23 -07:00
Ben Hutchings
c5d5f5fdc7 sfc: Replace EFX_DRIVER_NAME with KBUILD_MODNAME
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24 22:13:23 -07:00
Ben Hutchings
62776d034c sfc: Implement message level control
Replace EFX_ERR() with netif_err(), EFX_INFO() with netif_info(),
EFX_LOG() with netif_dbg() and EFX_TRACE() and EFX_REGDUMP() with
netif_vdbg().

Replace EFX_ERR_RL(), EFX_INFO_RL() and EFX_LOG_RL() using explicit
calls to net_ratelimit().

Implement the ethtool operations to get and set message level flags,
and add a 'debug' module parameter for the initial value.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24 22:13:22 -07:00
Ben Hutchings
0c605a2061 sfc: Log MTD errors using partition name, not just net device name
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24 22:13:22 -07:00
Ben Hutchings
5b98c1bfcf sfc: Implement ethtool register dump operation
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24 22:13:14 -07:00
Konstantin Khorenko
565b7b2d2e tcp: do not send reset to already closed sockets
i've found that tcp_close() can be called for an already closed
socket, but still sends reset in this case (tcp_send_active_reset())
which seems to be incorrect.  Moreover, a packet with reset is sent
with different source port as original port number has been already
cleared on socket.  Besides that incrementing stat counter for
LINUX_MIB_TCPABORTONCLOSE also does not look correct in this case.

Initially this issue was found on 2.6.18-x RHEL5 kernel, but the same
seems to be true for the current mainstream kernel (checked on
2.6.35-rc3).  Please, correct me if i missed something.

How that happens:

1) the server receives a packet for socket in TCP_CLOSE_WAIT state
   that triggers a tcp_reset():

Call Trace:
 <IRQ>  [<ffffffff8025b9b9>] tcp_reset+0x12f/0x1e8
 [<ffffffff80046125>] tcp_rcv_state_process+0x1c0/0xa08
 [<ffffffff8003eb22>] tcp_v4_do_rcv+0x310/0x37a
 [<ffffffff80028bea>] tcp_v4_rcv+0x74d/0xb43
 [<ffffffff8024ef4c>] ip_local_deliver_finish+0x0/0x259
 [<ffffffff80037131>] ip_local_deliver+0x200/0x2f4
 [<ffffffff8003843c>] ip_rcv+0x64c/0x69f
 [<ffffffff80021d89>] netif_receive_skb+0x4c4/0x4fa
 [<ffffffff80032eca>] process_backlog+0x90/0xec
 [<ffffffff8000cc50>] net_rx_action+0xbb/0x1f1
 [<ffffffff80012d3a>] __do_softirq+0xf5/0x1ce
 [<ffffffff8001147a>] handle_IRQ_event+0x56/0xb0
 [<ffffffff8006334c>] call_softirq+0x1c/0x28
 [<ffffffff80070476>] do_softirq+0x2c/0x85
 [<ffffffff80070441>] do_IRQ+0x149/0x152
 [<ffffffff80062665>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80008a2e>] __handle_mm_fault+0x6cd/0x1303
 [<ffffffff80008903>] __handle_mm_fault+0x5a2/0x1303
 [<ffffffff80033a9d>] cache_free_debugcheck+0x21f/0x22e
 [<ffffffff8006a263>] do_page_fault+0x49a/0x7dc
 [<ffffffff80066487>] thread_return+0x89/0x174
 [<ffffffff800c5aee>] audit_syscall_exit+0x341/0x35c
 [<ffffffff80062e39>] error_exit+0x0/0x84

tcp_rcv_state_process()
...  // (sk_state == TCP_CLOSE_WAIT here)
...
        /* step 2: check RST bit */
        if(th->rst) {
                tcp_reset(sk);
                goto discard;
        }
...
---------------------------------
tcp_rcv_state_process
 tcp_reset
  tcp_done
   tcp_set_state(sk, TCP_CLOSE);
     inet_put_port
      __inet_put_port
       inet_sk(sk)->num = 0;

   sk->sk_shutdown = SHUTDOWN_MASK;

2) After that the process (socket owner) tries to write something to
   that socket and "inet_autobind" sets a _new_ (which differs from
   the original!) port number for the socket:

 Call Trace:
  [<ffffffff80255a12>] inet_bind_hash+0x33/0x5f
  [<ffffffff80257180>] inet_csk_get_port+0x216/0x268
  [<ffffffff8026bcc9>] inet_autobind+0x22/0x8f
  [<ffffffff80049140>] inet_sendmsg+0x27/0x57
  [<ffffffff8003a9d9>] do_sock_write+0xae/0xea
  [<ffffffff80226ac7>] sock_writev+0xdc/0xf6
  [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
  [<ffffffff8001fb49>] __pollwait+0x0/0xdd
  [<ffffffff8008d533>] default_wake_function+0x0/0xe
  [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
  [<ffffffff800f0b49>] do_readv_writev+0x163/0x274
  [<ffffffff80066538>] thread_return+0x13a/0x174
  [<ffffffff800145d8>] tcp_poll+0x0/0x1c9
  [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
  [<ffffffff800f0dd0>] sys_writev+0x49/0xe4
  [<ffffffff800622dd>] tracesys+0xd5/0xe0

3) sendmsg fails at last with -EPIPE (=> 'write' returns -EPIPE in userspace):

F: tcp_sendmsg1 -EPIPE: sk=ffff81000bda00d0, sport=49847, old_state=7, new_state=7, sk_err=0, sk_shutdown=3

Call Trace:
 [<ffffffff80027557>] tcp_sendmsg+0xcb/0xe87
 [<ffffffff80033300>] release_sock+0x10/0xae
 [<ffffffff8016f20f>] vgacon_cursor+0x0/0x1a7
 [<ffffffff8026bd32>] inet_autobind+0x8b/0x8f
 [<ffffffff8003a9d9>] do_sock_write+0xae/0xea
 [<ffffffff80226ac7>] sock_writev+0xdc/0xf6
 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
 [<ffffffff8001fb49>] __pollwait+0x0/0xdd
 [<ffffffff8008d533>] default_wake_function+0x0/0xe
 [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800f0b49>] do_readv_writev+0x163/0x274
 [<ffffffff80066538>] thread_return+0x13a/0x174
 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9
 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4
 [<ffffffff800622dd>] tracesys+0xd5/0xe0

tcp_sendmsg()
...
        /* Wait for a connection to finish. */
        if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
                int old_state = sk->sk_state;
                if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) {
if (f_d && (err == -EPIPE)) {
        printk("F: tcp_sendmsg1 -EPIPE: sk=%p, sport=%u, old_state=%d, new_state=%d, "
                "sk_err=%d, sk_shutdown=%d\n",
                sk, ntohs(inet_sk(sk)->sport), old_state, sk->sk_state,
                sk->sk_err, sk->sk_shutdown);
        dump_stack();
}
                        goto out_err;
                }
        }
...

4) Then the process (socket owner) understands that it's time to close
   that socket and does that (and thus triggers sending reset packet):

Call Trace:
...
 [<ffffffff80032077>] dev_queue_xmit+0x343/0x3d6
 [<ffffffff80034698>] ip_output+0x351/0x384
 [<ffffffff80251ae9>] dst_output+0x0/0xe
 [<ffffffff80036ec6>] ip_queue_xmit+0x567/0x5d2
 [<ffffffff80095700>] vprintk+0x21/0x33
 [<ffffffff800070f0>] check_poison_obj+0x2e/0x206
 [<ffffffff80013587>] poison_obj+0x36/0x45
 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
 [<ffffffff80023481>] dbg_redzone1+0x1c/0x25
 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
 [<ffffffff8000ca94>] cache_alloc_debugcheck_after+0x189/0x1c8
 [<ffffffff80023405>] tcp_transmit_skb+0x764/0x786
 [<ffffffff8025df8a>] tcp_send_active_reset+0xf9/0x14d
 [<ffffffff80258ff1>] tcp_close+0x39a/0x960
 [<ffffffff8026be12>] inet_release+0x69/0x80
 [<ffffffff80059b31>] sock_release+0x4f/0xcf
 [<ffffffff80059d4c>] sock_close+0x2c/0x30
 [<ffffffff800133c9>] __fput+0xac/0x197
 [<ffffffff800252bc>] filp_close+0x59/0x61
 [<ffffffff8001eff6>] sys_close+0x85/0xc7
 [<ffffffff800622dd>] tracesys+0xd5/0xe0

So, in brief:

* a received packet for socket in TCP_CLOSE_WAIT state triggers
  tcp_reset() which clears inet_sk(sk)->num and put socket into
  TCP_CLOSE state

* an attempt to write to that socket forces inet_autobind() to get a
  new port (but the write itself fails with -EPIPE)

* tcp_close() called for socket in TCP_CLOSE state sends an active
  reset via socket with newly allocated port

This adds an additional check in tcp_close() for already closed
sockets. We do not want to send anything to closed sockets.

Signed-off-by: Konstantin Khorenko <khorenko@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24 21:54:58 -07:00