Add the multiqueue hardware device support API to the core network
stack. Allow drivers to allocate multiple queues and manage them at
the netdev level if they choose to do so.
Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a boolean error in the new TX checksum check
that causes bogus TSO packets to be generated.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for configuring secondary unicast addresses on network
devices. To support this devices capable of filtering multiple
unicast addresses need to change their set_multicast_list function
to configure unicast filters as well and assign it to dev->set_rx_mode
instead of dev->set_multicast_list. Other devices are put into promiscous
mode when secondary unicast addresses are present.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use generic net_device address lists for multicast list handling.
Some defines are used to keep drivers working.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce struct dev_addr_list and list maintenance functions
based on dev_mc_list and the related functions. This will be
used by follow-up patches for both multicast and secondary
unicast addresses.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_mc_add/dev_mc_delete take care of uploading the list when
necessary and thats the only interface other code should use.
Also remove two incorrect calls in DECnet.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing model for checksum offload does not correctly handle
devices that can offload IPV4 and IPV6 only. The NETIF_F_HW_CSUM flag
implies device can do any arbitrary protocol.
This patch:
* adds NETIF_F_IPV6_CSUM for those devices
* fixes bnx2 and tg3 devices that need it
* add NETIF_F_IPV6_CSUM to ipv6 output (incl GSO)
* fixes assumptions about NETIF_F_ALL_CSUM in nat
* adjusts bridge union of checksumming computation
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a nested compat attribute type that can be used to convert
attributes that contain a structure to nested attributes in a
backwards compatible way.
The attribute looks like this:
struct {
[ compat contents ]
struct rtattr {
.rta_len = total size,
.rta_type = type,
} rta;
struct old_structure struct;
[ nested top-level attribute ]
struct rtattr {
.rta_len = nest size,
.rta_type = type,
} nest_attr;
[ optional 0 .. n nested attributes ]
struct rtattr {
.rta_len = private attribute len,
.rta_type = private attribute typ,
} nested_attr;
struct nested_data data;
};
Since both userspace and kernel deal correctly with attributes that are
larger than expected old versions will just parse the compat part and
ignore the rest.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently NAT (and others) that want to modify cloned skbs copy them,
even if in the vast majority of cases its not necessary because the
skb is a clone made by TCP and the portion NAT wants to modify is
actually writable because TCP release the header reference before
cloning.
The problem is that there is no clean way for NAT to find out how
long the writable header area is, so this patch introduces skb->hdr_len
to hold this length. When a headerless skb is cloned skb->hdr_len
is set to the current headroom, for regular clones it is copied from
the original. A new function skb_clone_writable(skb, len) returns
whether the skb is writable up to len bytes from skb->data. To avoid
enlarging the skb the mac_len field is reduced to 16 bit and the
new hdr_len field is put in the remaining 16 bit.
I've done a few rough benchmarks of NAT (not with this exact patch,
but a very similar one). As expected it saves huge amounts of system
time in case of sendfile, bringing it down to basically the same
amount as without NAT, with sendmsg it only helps on loopback,
probably because of the large MTU.
Transmit a 1GB file using sendfile/sendmsg over eth0/lo with and
without NAT:
- sendfile eth0, no NAT: sys 0m0.388s
- sendfile eth0, NAT: sys 0m1.835s
- sendfile eth0: NAT + path: sys 0m0.370s (~ -80%)
- sendfile lo, no NAT: sys 0m0.258s
- sendfile lo, NAT: sys 0m2.609s
- sendfile lo, NAT + patch: sys 0m0.260s (~ -90%)
- sendmsg eth0, no NAT: sys 0m2.508s
- sendmsg eth0, NAT: sys 0m2.539s
- sendmsg eth0, NAT + patch: sys 0m2.445s (no change)
- sendmsg lo, no NAT: sys 0m2.151s
- sendmsg lo, NAT: sys 0m3.557s
- sendmsg lo, NAT + patch: sys 0m2.159s (~ -40%)
I expect other users can see a similar performance improvement,
packet mangling iptables targets, ipip and ip_gre come to mind ..
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add rtnetlink API for creating, changing and deleting software devices.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split up rtnl_setlink into a function performing validation and a function
performing the actual changes. This allows to share the modifcation logic
with rtnl_newlink, which is introduced by the next patch.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
>From my recent patch:
> > #1
> > Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
> > required a work function should always (unconditionally) rearm with
> > delay > 0 - otherwise it would endlessly loop. This patch replaces
> > this function with cancel_delayed_work(). Later kernel versions don't
> > require this, so here it's only for uniformity.
But Oleg Nesterov <oleg@tv-sign.ru> found:
> But 2.6.22 doesn't need this change, why it was merged?
>
> In fact, I suspect this change adds a race,
...
His description was right (thanks), so this patch reverts #1.
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every file should include the headers containing the prototypes for
its global functions.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_clone_fraglist is static so it shouldn't be exported.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
#1
Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
required a work function should always (unconditionally) rearm with
delay > 0 - otherwise it would endlessly loop. This patch replaces
this function with cancel_delayed_work(). Later kernel versions don't
require this, so here it's only for uniformity.
#2
After deleting a timer in cancel_[rearming_]delayed_work() there could
stay a last skb queued in npinfo->txq causing a memory leak after
kfree(npinfo).
Initial patch & testing by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If sky2 device poll routine is called from netpoll_send_skb, it would
deadlock. The netpoll_send_skb held the netif_tx_lock, and the poll
routine could acquire it to clean up skb's. Other drivers might use
same locking model.
The driver is correct, netpoll should not introduce more locking
problems than it causes already. So change the code to drop lock
before calling poll handler.
Signed-off-by: Stephen Hemminger <shemminger@linux.foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having walked through the entire skbuff, skb_seq_read would leave the
last fragment mapped. As a consequence, the unwary caller would leak
kmaps, and proceed with preempt_count off by one. The only (kind of
non-intuitive) workaround is to use skb_seq_read_abort.
This patch makes sure skb_seq_read always unmaps frag_data after
having cycled through the skb's paged part.
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves the local_irq_enable() call in net_rx_action() to before
calling the CONFIG_NET_DMA's dma_async_memcpy_issue_pending() rather
than after. This shortens the irq disabled window and allows for DMA
drivers that need to do their own irq hold.
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
secmark doesn't depend on CONFIG_NET_SCHED.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When changing the link state from userspace not affecting any other
flags. Two duplicate notification are being sent, once as action
in the NETDEV_UP/NETDEV_DOWN notification chain and a second time
when comparing old and new device flags after the change has been
completed. Although harmless, the duplicates should be avoided.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifindex == 0 does not exist and implies we should do a lookup by name if
one was given.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This isn't a bug just yet as only TCP uses sk_setup_caps for GSO.
However, if and when UDP or something else starts using it this is
likely to cause a problem if we forget to add software emulation
for it at the same time.
The problem is that right now we translate GSO emulation to the
bitmask NETIF_F_GSO_MASK, which includes every protocol, even
ones that we cannot emulate.
This patch makes it provide only the ones that we can emulate.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
in4_pton converts a textual representation of an ip4 address
into an integer representation. However, when the textual representation
is of in the form ip:port, e.g. 192.168.1.1:5060, and 'delim' is set to
-1, the function bails out with an error when reading the colon.
It makes sense to allow the colon as a delimiting character without
explicitly having to set it through the 'delim' variable as there can be
no ambiguity in the point where the ip address is completely parsed. This
function is indeed called from nf_conntrack_sip.c in this way to parse
textual ip:port combinations which fails due to the reason stated above.
Signed-off-by: Jerome Borsboom <j.borsboom@erasmusmc.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current IPSEC rule resolution behavior we have does not work for a
lot of people, even though technically it's an improvement from the
-EAGAIN buisness we had before.
Right now we'll block until the key manager resolves the route. That
works for simple cases, but many folks would rather packets get
silently dropped until the key manager resolves the IPSEC rules.
We can't tell these folks to "set the socket non-blocking" because
they don't have control over the non-block setting of things like the
sockets used to resolve DNS deep inside of the resolver libraries in
libc.
With that in mind I coded up the patch below with some help from
Herbert Xu which provides packet-drop behavior during larval state
resolution, controllable via sysctl and off by default.
This lays the framework to either:
1) Make this default at some point or...
2) Move this logic into xfrm{4,6}_policy.c and implement the
ARP-like resolution queue we've all been dreaming of.
The idea would be to queue packets to the policy, then
once the larval state is resolved by the key manager we
re-resolve the route and push the packets out. The
packets would timeout if the rule didn't get resolved
in a certain amount of time.
Signed-off-by: David S. Miller <davem@davemloft.net>
sys_setsockopt() do not check properly timeout values for
SO_RCVTIMEO/SO_SNDTIMEO, for example it's possible to set negative timeout
values. POSIX do not defines behaviour for sys_setsockopt in case negative
timeouts, but requires that setsockopt() shall fail with -EDOM if the send and
receive timeout values are too big to fit into the timeout fields in the socket
structure.
In current implementation negative timeout can lead to error messages like
"schedule_timeout: wrong timeout value".
Proposed patch:
- checks tv_usec and returns -EDOM if it is wrong
- do not allows to set negative timeout values (sets 0 instead) and outputs
ratelimited information message about such attempts.
Signed-off-By: Vasily Averin <vvs@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove some unused variables and function arguments related to the
recently removed wireless extensions over rtnetlink.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_setlink doesn't allow to change subsets of the flags, just to override
the set entirely by a new one. This means that for simply setting a device
up or down userspace first needs to query the current flags, change it and
send the changed flags back, which is racy and needlessly complicated.
Mask the flags using ifi_change since this is what it is intended for.
For backwards compatibility treat ifi_change == 0 as ~0 (even though it
seems quite unlikely that anyone has been using this so far).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kenji Kaneshige found this race between device removal and
registration. On unregister it is possible for the old device to
exist, because sysfs file is still open. A new device with 'eth%d'
will select the same name, but sysfs kobject register will fial.
The following changes the shutdown order slightly. It hold a removes
the sysfs entries earlier (on unregister_netdevice), but holds a
kobject reference. Then when todo runs the actual last put free
happens.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Compiling 2.6.22-rc1 with gcc-3.2.3 for i486 fails with:
gcc -m32 -Wp,-MD,net/core/.skbuff.o.d -nostdinc -isystem /home/mikpe/pkgs/linux-x86/gnu/lib/gcc-lib/i486-pc-linux-gnu/3.2.3/include -D__KERNEL__ -Iinclude -include include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -O2 -pipe -msoft-float -mregparm=3 -freg-struct-return -mpreferred-stack-boundary=4 -march=i486 -ffreestanding -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -Iinclude/asm-i386/mach-default -fomit-frame-pointer -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(skbuff)" -D"KBUILD_MODNAME=KBUILD_STR(skbuff)" -c -o net/core/skbuff.o net/core/skbuff.c
net/core/skbuff.c:648:1: directives may not be used inside a macro argument
net/core/skbuff.c:647:39: unterminated argument list invoking macro "memcpy"
net/core/skbuff.c: In function `pskb_expand_head':
net/core/skbuff.c:651: `memcpy' undeclared (first use in this function)
net/core/skbuff.c:651: (Each undeclared identifier is reported only once
net/core/skbuff.c:651: for each function it appears in.)
net/core/skbuff.c:651: syntax error before "skb"
make[2]: *** [net/core/skbuff.o] Error 1
make[1]: *** [net/core] Error 2
make: *** [net] Error 2
The patch below implements a simple workaround which is to
clone the offending memcpy() call and specialise it for the
two different scenarios.
Other workarounds are of course possible: e.g. bind the varying
parameter in a local variable, or use a macro or inline function
to perform the varying computation.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
After initializing dev->_xmit_lock register_netdevice()
sets lockdep class according to dev->type.
Idea of this patch - by David Miller.
Reported & tested by: "Yuriy N. Shkandybin" <jura@netams.com>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Urgent events may be delayed if we already have a non-urgent event
queued for that device. This patch changes this by making sure that
an urgent event is always looked at immediately.
I've replaced the LW_RUNNING flag by LW_URGENT since whether work
is scheduled is already kept track by the work queue system.
The only complication is that we have to provide some exclusion for
the setting linkwatch_nextevent which is available in the actual
work function.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the jiffies wrap around or when the system boots up for the first
time, down events can be delayed indefinitely since we no longer
update linkwatch_nextevent when only urgent events are processed.
This patch fixes this by setting linkwatch_nextevent when a
wrap-around occurs.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently all link carrier events are delayed by up to a second
before they're processed to prevent link storms. This causes
unnecessary packet loss during that interval.
In fact, we can achieve the same effect in preventing storms by
only delaying down events and unnecssary up events. The latter
is defined as up events when we're already up.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
These days the link watch mechanism is an integral part of the
network subsystem as it manages the carrier status. So it now
makes sense to allocate some memory for it in net_device rather
than allocating it on demand.
In fact, this is necessary because we can't tolerate a memory
allocation failure since that means we'd have to potentially
throw a link up event away.
It also simplifies the code greatly.
In doing so I discovered a subtle race condition in the use
of singleevent. This race condition still exists (and is
somewhat magnified) without singleevent but it's now plugged
thanks to an smp_mb__before_clear_bit.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Upon failure to register "ptype" procfs entry, "softnet_stat" was not
removed, and an incorrect attempt was made to remove the "ptype" entry.
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup of dev_base list use, with the aim to simplify making device
list per-namespace. In almost every occasion, use of dev_base variable
and dev->next pointer could be easily replaced by for_each_netdev
loop. A few most complicated places were converted to using
first_netdev()/next_netdev().
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __dev_getfirstbyhwtype for callers that don't want a reference but
some data from the device and thus need to take the rtnl anyway.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu conviced me that a new flag was overkill; every driver
currently overrides get_stats, so we might as well make the internal
one the default. If someone did fail to set get_stats, they would now
get all 0 stats instead of "No statistics available".
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts eefa390628
The simplification made in that change works with the assumption that
the 'offset' parameter to these functions is always positive or zero,
which is not true. It can be and often is negative in order to access
SKB header values in front of skb->data.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make use of add_uevent_var() instead of (often incorrectly) open coding it.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Eric Rannaud <eric.rannaud@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Provide rename event for when we rename network devices.
Signed-off-by: Jean Tourrilhes <jt@hpl.hp.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch cleans up the call paths from the core code into wext.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves dev/core/wireless.c to net/wireless/wext.c.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide AF_RXRPC sockets that can be used to talk to AFS servers, or serve
answers to AFS clients. KerberosIV security is fully supported. The patches
and some example test programs can be found in:
http://people.redhat.com/~dhowells/rxrpc/
This will eventually replace the old implementation of kernel-only RxRPC
currently resident in net/rxrpc/.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- make the following needlessly global variables static:
- core/rtnetlink.c: struct rtnl_msg_handlers[]
- netfilter/nf_conntrack_proto.c: struct nf_ct_protos[]
- make the following needlessly global functions static:
- core/rtnetlink.c: rtnl_dump_all()
- netlink/af_netlink.c: netlink_queue_skip()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed recently that, in skb_checksum(), "offset" and "start" are
essentially the same thing and have the same value throughout the
function, despite being computed differently. Using a single variable
allows some cleanups and makes the skb_checksum() function smaller,
more readable, and presumably marginally faster.
We appear to have many other "sk_buff walker" functions built on the
exact same model, so the cleanup applies to them, too. Here is a list
of the functions I found to be affected:
net/appletalk/ddp.c:atalk_sum_skb()
net/core/datagram.c:skb_copy_datagram_iovec()
net/core/datagram.c:skb_copy_and_csum_datagram()
net/core/skbuff.c:skb_copy_bits()
net/core/skbuff.c:skb_store_bits()
net/core/skbuff.c:skb_checksum()
net/core/skbuff.c:skb_copy_and_csum_bit()
net/core/user_dma.c:dma_skb_copy_datagram_iovec()
net/xfrm/xfrm_algo.c:skb_icv_walk()
net/xfrm/xfrm_algo.c:skb_to_sgvec()
OTOH, I admit I'm a bit surprised, the cleanup is rather obvious so I'm
really wondering if I am missing something. Can anyone please comment
on this?
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that Patrick has added the code to deal with GSO in netfilter,
we no longer need the crutch that computes partial checksums just
before transmission.
This patch turns this into a warning again. If this goes OK, we
can then turn it into a BUG_ON and remove the gso_send_check cruft.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As scheduled, this patch removes the pointless wext over netlink code.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct the function name in the comments supplied with
register_netdev()
Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is far too large to be an inline and not in any hot paths.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spring cleaning time...
There seems to be a lot of places in the network code that have
extra bogus semicolons after conditionals. Most commonly is a
bogus semicolon after: switch() { }
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It isn't any faster to test a boolean global variable than do a simple
check for empty list.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Getting warnings becuase skb_store_bits has skb as constant,
but the function overwrites it. Looks like const was on the
wrong side.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch ingress queueing back to use ingress_lock. qdisc_lock_tree now locks
both the ingress and egress qdiscs on the device. All changes to data that
might be used on both ingress and egress needs to be protected by using
qdisc_lock_tree instead of manually taking dev->queue_lock. Additionally
the qdisc stats_lock needs to be initialized to ingress_lock for ingress
qdiscs.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we're now holding the rtnl during the entire dump operation, we can
remove additional locking for rtnl protected data. This patch does that
for all simple cases (dev_base_lock for dev_base walking, RCU protection
for FIB rule dumping).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hold rtnl_mutex during the entire netlink dump operation. This allows
to simplify locking in the dump callbacks, since they can now rely on
that no concurrent changes happen.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch cb_lock to mutex and allow netlink kernel users to override it
with a subsystem specific mutex for consistent locking in dump callbacks.
All netlink_dump_start users have been audited not to rely on any
side-effects of the previously used spinlock.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_copy_expand changes the headroom, so it needs to adjust the header
offsets by the difference between the old and the new value.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the bridging hook to be simple function with return value
rather than modifying the skb argument. This could generate better
code and is cleaner.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY. Therefore we should
treat it as such in the stack.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb transport pointer is currently used to specify the start
of the checksum region for transmit checksum offload. Unfortunately,
the same pointer is also used during receive side processing.
This creates a problem when we want to retransmit a received
packet with partial checksums since the skb transport pointer
would be overwritten.
This patch solves this problem by creating a new 16-bit csum_start
offset value to replace the skb transport header for the purpose
of checksums. This offset is calculated from skb->head so that
it does not have to change when skb->data changes.
No extra space is required since csum_offset itself fits within
a 16-bit word so we can use the other 16 bits for csum_start.
For backwards compatibility, just before we push a packet with
partial checksums off into the device driver, we set the skb
transport header to what it would have been under the old scheme.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we're increasing the headroom, the header offsets need to be
increased by the same amount as well.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The most common trigger of these errors is that the
config option hasn't been enable wich would make the
functionality available. Therefore returning EOPNOTSUPP
gives a better idea on what is going wrong.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move generic skbuff stuff from XFRM code to generic code so that
AF_RXRPC can use it too.
The kdoc comments I've attached to the functions needs to be checked
by whoever wrote them as I had to make some guesses about the workings
of these functions.
Signed-off-By: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To clearly state the intent of copying to linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Network drivers which keep stats allocate their own stats structure
then write a get_stats() function to return them. It would be nice if
this were done by default.
1) Add a new "stats" field to "struct net_device".
2) Add a new feature field to say "this driver uses the internal one"
3) Have a default "get_stats" which returns NULL if that feature not set.
4) Change callers to check result of get_stats call for NULL, not if
->get_stats is set.
This should not break backwards compatibility with older drivers, yet
allow modern drivers to shed some boilerplate code.
Lightly tested: works for a modified lguest network driver.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To clearly state the intent of copying from linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The results of FIB rules lookups are cached in the routing cache
except for IPv6 as no such cache exists. So far, it was the
responsibility of the user to flush the cache after modifying any
rules. This lead to many false bug reports due to misunderstanding
of this concept.
This patch automatically flushes the route cache after inserting
or deleting a rule.
Thanks to Muli Ben-Yehuda <muli@il.ibm.com> for catching a bug
in the previous patch.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of nop rules simplifies the usage of goto rules
and adds more flexibility as they allow targets to remain
while the actual content of the branches can change easly.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rules which match against device names in their selector can
remain while the device itself disappears, in fact the device
doesn't have to present when the rule is added in the first
place. The device name is resolved by trying when the rule is
added and later by listening to NETDEV_REGISTER/UNREGISTER
notifications.
This patch adds the flag FIB_RULE_DEV_DETACHED which is set
towards userspace when a rule contains a device match which
is unresolved at the moment. This eases spotting the reason
why certain rules seem not to function properly.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new rule action FR_ACT_GOTO which allows
to skip a set of rules by jumping to another rule. The rule
to jump to is specified via the FRA_GOTO attribute which
carries a rule preference.
Referring to a rule which doesn't exists is explicitely allowed.
Such goto rules are marked with the flag FIB_RULE_UNRESOLVED
and will act like a rule with a non-matching selector. The rule
will become functional as soon as its target is present.
The goto action enables performance optimizations by reducing
the average number of rules that have to be passed per lookup.
Example:
0: from all lookup local
40: not from all to 192.168.23.128 goto 32766
41: from all fwmark 0xa blackhole
42: from all fwmark 0xff blackhole
32766: from all lookup main
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all users of netlink_dump_start() use netlink_run_queue()
to process the receive queue, it is possible to return -EINTR from
netlink_dump_start() directly, therefore simplying the callers.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The error pointer argument in netlink message handlers is used
to signal the special case where processing has to be interrupted
because a dump was started but no error happened. Instead it is
simpler and more clear to return -EINTR and have netlink_run_queue()
deal with getting the queue right.
nfnetlink passed on this error pointer to its subsystem handlers
but only uses it to signal the start of a netlink dump. Therefore
it can be removed there as well.
This patch also cleans up the error handling in the affected
message handlers to be consistent since it had to be touched anyway.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes netlink_rcv_skb() to skip netlink controll messages and don't
pass them on to the message handler.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink_rcv_skb() is changed to skip messages which don't have the
NLM_F_REQUEST bit to avoid every netlink family having to perform this
check on their own.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implements a unified, protocol independant rules dumping function
which is capable of both, dumping a specific protocol family or
all of them. This speeds up dumping as less lookups are required.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new interface to register rtnetlink message
handlers replacing the exported rtnl_links[] array which
required many message handlers to be exported unnecessarly.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
I did it just in alloc_skb_from_cache, forgot __alloc_skb, fixed now.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now to convert the last one, skb->data, that will allow many simplifications
and removal of some of the offset helpers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)
Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This unifies the codes to copy netfilter related datas. Note that
__nf_copy() assumes destination skb doesn't have any netfilter
related members.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>