There is no need to set rqstp->rq_server to serv, while serv is initialized as rqstp->rq_server at previous line. And between these two lines, there is no change to rqstp->rq_server.
Signed-off-by: ideawu <ideawu@163.com>
Reviewed-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Allow the NFSv4 server to make use of TCP autotuning behaviour, which
was previously disabled by setting the sk_userlocks variable.
Set the receive buffers to be big enough to receive the whole RPC
request, and set this for the listening socket, not the accept socket.
Remove the code that readjusts the receive/send buffer sizes for the
accepted socket. Previously this code was used to influence the TCP
window management behaviour, which is no longer needed when autotuning
is enabled.
This can improve IO bandwidth on networks with high bandwidth-delay
products, where a large tcp window is required. It also simplifies
performance tuning, since getting adequate tcp buffers previously
required increasing the number of nfsd threads.
Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
Cc: Jim Rees <rees@umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Add /proc/fs/nfsd/pool_stats to export to userspace various
statistics about the operation of rpc server thread pools.
This patch is based on a forward-ported version of
knfsd-add-pool-thread-stats which has been shipping in the SGI
"Enhanced NFS" product since 2006 and which was previously
posted:
http://article.gmane.org/gmane.linux.nfs/10375
It has also been updated thus:
* moved EXPORT_SYMBOL() to near the function it exports
* made the new struct struct seq_operations const
* used SEQ_START_TOKEN instead of ((void *)1)
* merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
printk in svc_pool_stats_*()" by Harshula Jayasuriya.
* merged fix from SGI PV 964001 "Crash reading pool_stats before
nfsds are started".
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: Harshula Jayasuriya <harshula@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads. When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up. As long as there are idle
threads, one will be woken up.
If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them. Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run. This situation
causes two significant problems:
1. The CPU scheduler takes over 10% of each CPU, which is robbing
the nfsd threads of valuable CPU time.
2. At a high enough load, the nfsd threads starve userspace threads
of CPU time, to the point where daemons like portmap and rpc.mountd
do not schedule for tens of seconds at a time. Clients attempting
to mount an NFS filesystem timeout at the very first step (opening
a TCP connection to portmap) because portmap cannot wake up from
select() and call accept() in time.
Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.
The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.
Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server. That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server. The server was running 128 nfsds.
Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.
This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:
http://article.gmane.org/gmane.linux.nfs/10374
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The change to make xfrm_state objects hash on source address
broke the case where such source addresses are wildcarded.
Fix this by doing a two phase lookup, first with fully specified
source address, next using saddr wildcarded.
Reported-by: Nicolas Dichtel <nicolas.dichtel@dev.6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the transport isn't bound, then we should just return ENOTCONN, letting
call_connect_status() and/or call_status() deal with retrying. Currently,
we appear to abort all pending tasks with an EIO error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We can Oops in both xs_udp_send_request() and xs_tcp_send_request() if the
call to xs_sendpages() returns an error due to the socket not yet being
set up.
Deal with that situation by returning a new error: ENOTSOCK, so that we
know to avoid dereferencing transport->sock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not try to "uninitialize" ipv6 if its initialization had been skipped
because module parameter disable=1 had been specified.
Reported-by: Thomas Backlund <tmb@mandriva.org>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Acked-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should probably not be testing any flags after we've cleared the
RPC_TASK_RUNNING flag, since rpc_make_runnable() is then free to assign the
rpc_task to another workqueue, which may then destroy it.
We can fix any races with rpc_make_runnable() by ensuring that we only
clear the RPC_TASK_RUNNING flag while holding the rpc_wait_queue->lock that
the task is supposed to be sleeping on (and then checking whether or not
the task really is sleeping).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
freq_diff is unsigned, so test before subtraction
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
As analyzed by Patrick McHardy, vlan needs to reset it's
netdev_ops pointer in it's ->init() function but this
leaves the compat method pointers stale.
Add a netdev_resync_ops() and call it from the vlan code.
Any other driver which changes ->netdev_ops after register_netdevice()
will need to call this new function after doing so too.
With help from Patrick McHardy.
Tested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A commit c1b56878fb "tc: policing requires
a rate estimator" introduced a test which invalidates previously working
configs, based on examples from iproute2: doc/actions/actions-general.
This is too rigorous: a rate estimator is needed only when police's
"avrate" option is used.
Reported-by: Joao Correia <joaomiguelcorreia@gmail.com>
Diagnosed-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change sctp_ctl_sock_init() to try IPv4 if IPv6 socket registration
fails. Required if the IPv6 module is loaded with "disable=1", else
SCTP will fail to load.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add "disable" module parameter support to ipv6.ko by specifying
"disable=1" on module load. We just do the minimum of initializing
inetsw6[] so calls from other modules to inet6_register_protosw()
won't OOPs, then bail out. No IPv6 addresses or sockets can be
created as a result, and a reboot is required to enable IPv6.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, modular tokenring ("tr") lacks a license and fails to load:
tr: module license 'unspecified' taints kernel.
tr: Unknown symbol proc_net_fops_create
Beacuse of this, no tokenring driver can load if it depends on modular
tr. Fix this by adding GPL module license as it is in the kernel.
With this fix, tr module loads fine and tms380 driver also loads. Well,
it does'nt work but that's a different bug.
Signed-off-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
The callers of netlink_set_err() currently pass a negative value
as parameter for the error code. However, sk->sk_err wants a
positive error value. Without this patch, skb_recv_datagram() called
by netlink_recvmsg() may return a positive value to report an error.
Another choice to fix this is to change callers to pass a positive
error value, but this seems a bit inconsistent and error prone
to me. Indeed, the callers of netlink_set_err() assumed that the
(usual) negative value for error codes was fine before this patch :).
This patch also includes some documentation in docbook format
for netlink_set_err() to avoid this sort of confusion.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that net_alive is unnecessary, and the original problem
that led to it being added was simply that the icmp code thought
it was a network device and wound up being unable to handle packets
while there were still packets in the network namespace.
Now that icmp and tcp have been fixed to properly register themselves
this problem is no longer present and we have a stronger guarantee
that packets will not arrive in a network namespace then that provided
by net_alive in netif_receive_skb. So remove net_alive allowing
packet reception run a little faster.
Additionally document the strong reason why network namespace cleanup
is safe so that if something happens again someone else will have
a chance of figuring it out.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To remove the possibility of packets flying around when network
devices are being cleaned up use reisger_pernet_subsys instead of
register_pernet_device.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently I had a kernel panic in icmp_send during a network namespace
cleanup. There were packets in the arp queue that failed to be sent
and we attempted to generate an ICMP host unreachable message, but
failed because icmp_sk_exit had already been called.
The network devices are removed from a network namespace and their
arp queues are flushed before we do attempt to shutdown subsystems
so this error should have been impossible.
It turns out icmp_init is using register_pernet_device instead
of register_pernet_subsys. Which resulted in icmp being shut down
while we still had the possibility of packets in flight, making
a nasty NULL pointer deference in interrupt context possible.
Changing this to register_pernet_subsys fixes the problem in
my testing.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a network namespace is destroyed the network interfaces are
all unregistered, making addrconf_ifdown called by the netdevice
notifier.
In the other hand, the addrconf exit method does a loop on the network
devices and does addrconf_ifdown on each of them. But the ordering of
the netns subsystem is not right because it uses the register_pernet_device
instead of register_pernet_subsys. If we handle the loopback as
any network device, we can safely use register_pernet_subsys.
But if we use register_pernet_subsys, the addrconf exit method will do
exactly what was already done with the unregistering of the network
devices. So in definitive, this code is pointless.
I removed the netns addrconf exit method and moved the code to the
addrconf cleanup function.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If ERROR chunk is received with too many error causes in ESTABLISHED
state, the kernel get panic.
This is because sctp limit the max length of cmds to 14, but while
ERROR chunk is received, one error cause will add around 2 cmds by
sctp_add_cmd_sf(). So many error causes will fill the limit of cmds
and panic.
This patch fixed the problem.
This bug can be test by SCTP Conformance Test Suite
<http://networktest.sourceforge.net/>.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An extra list_del() during the module load failure and unload
resulted in a crash with a list corruption. Now sctp can
be unloaded again.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's conflicting assumptions in shifting, the caller assumes
that dupsack results in S'ed skbs (or a part of it) for sure but
never gave a hint to tcp_sacktag_one when dsack is actually in
use. Thus DSACK retrans_out -= pcount was not taken and the
counter became out of sync. Remove obstacle from that information
flow to get DSACKs accounted in tcp_sacktag_one as expected.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Tested-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netpoll entry checks are required to ensure that we don't
receive normal packets when invoked via netpoll. Unfortunately
it only ever worked for the netif_receive_skb/netif_rx entry
points. The VLAN (and subsequently GRO) entry point didn't
have the check and therefore can trigger all sorts of weird
problems.
This patch adds the netpoll check to all entry points.
I'm still uneasy with receiving at all under netpoll (which
apparently is only used by the out-of-tree kdump code). The
reason is it is perfectly legal to receive all data including
headers into highmem if netpoll is off, but if you try to do
that with netpoll on and someone gets a printk in an IRQ handler
you're going to get a nice BUG_ON.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
drr_change_class lacks a check for NULL of tca[TCA_OPTIONS], so oops
is possible.
Reported-by: Denys Fedoryschenko <denys@visp.net.lb>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already have a valid net in that place, but this is not just a
cleanup - the tw pointer can be NULL there sometimes, thus causing
an oops in NET_NS=y case.
The same place in ipv4 code already works correctly using existing
net, rather than tw's one.
The bug exists since 2.6.27.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix regression introduded by commit 079aa88 (netfilter: xt_recent: IPv6 support):
From http://bugzilla.kernel.org/show_bug.cgi?id=12753:
Problem Description:
An uninitialized buffer causes IPv4 addresses added manually (via the +IP
command to the proc interface) to never match any packets. Similarly, the -IP
command fails to remove IPv4 addresses.
Details:
In the function recent_entry_lookup, the xt_recent module does comparisons of
the entire nf_inet_addr union value, both for IPv4 and IPv6 addresses. For
addresses initialized from actual packets the remaining 12 bytes not occupied
by the IPv4 are zeroed so this works correctly. However when setting the
nf_inet_addr addr variable in the recent_mt_proc_write function, only the IPv4
bytes are initialized and the remaining 12 bytes contain garbage.
Hence addresses added in this way never match any packets, unless these
uninitialized 12 bytes happened to be zero by coincidence. Similarly, addresses
cannot consistently be removed using the proc interface due to mismatch of the
garbage bytes (although it will sometimes work to remove an address that was
added manually).
Reading the /proc/net/xt_recent/ entries hides this problem because this only
uses the first 4 bytes when displaying IPv4 addresses.
Steps to reproduce:
$ iptables -I INPUT -m recent --rcheck -j LOG
$ echo +169.254.156.239 > /proc/net/xt_recent/DEFAULT
$ cat /proc/net/xt_recent/DEFAULT
src=169.254.156.239 ttl: 0 last_seen: 119910 oldest_pkt: 1 119910
[At this point no packets from 169.254.156.239 are being logged.]
$ iptables -I INPUT -s 169.254.156.239 -m recent --set
$ cat /proc/net/xt_recent/DEFAULT
src=169.254.156.239 ttl: 0 last_seen: 119910 oldest_pkt: 1 119910
src=169.254.156.239 ttl: 255 last_seen: 126184 oldest_pkt: 4 125434, 125684, 125934, 126184
[At this point, adding the address via an iptables rule, packets are being
logged correctly.]
$ echo -169.254.156.239 > /proc/net/xt_recent/DEFAULT
$ cat /proc/net/xt_recent/DEFAULT
src=169.254.156.239 ttl: 0 last_seen: 119910 oldest_pkt: 1 119910
src=169.254.156.239 ttl: 255 last_seen: 126992 oldest_pkt: 10 125434, 125684, 125934, 126184, 126434, 126684, 126934, 126991, 126991, 126992
$ echo -169.254.156.239 > /proc/net/xt_recent/DEFAULT
$ cat /proc/net/xt_recent/DEFAULT
src=169.254.156.239 ttl: 0 last_seen: 119910 oldest_pkt: 1 119910
src=169.254.156.239 ttl: 255 last_seen: 126992 oldest_pkt: 10 125434, 125684, 125934, 126184, 126434, 126684, 126934, 126991, 126991, 126992
[Removing the address via /proc interface failed evidently.]
Possible solutions:
- initialize the addr variable in recent_mt_proc_write
- compare only 4 bytes for IPv4 addresses in recent_entry_lookup
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
net: amend the fix for SO_BSDCOMPAT gsopt infoleak
netns: build fix for net_alloc_generic
The fix for CVE-2009-0676 (upstream commit df0bca04) is incomplete. Note
that the same problem of leaking kernel memory will reappear if someone
on some architecture uses struct timeval with some internal padding (for
example tv_sec 64-bit and tv_usec 32-bit) --- then, you are going to
leak the padded bytes to userspace.
Signed-off-by: Eugene Teo <eugeneteo@kernel.sg>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_alloc_generic was defined in #ifdef CONFIG_NET_NS, but used
unconditionally. Move net_alloc_generic out of #ifdef.
Signed-off-by: Clemens Noss <cnoss@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
netns: fix double free at netns creation
veth : add the set_mac_address capability
sunlance: Beyond ARRAY_SIZE of ib->btx_ring
sungem: another error printed one too early
ISDN: fix sc/shmem printk format warning
SMSC: timeout reaches -1
smsc9420: handle magic field of ethtool_eeprom
sundance: missing parentheses?
smsc9420: fix another postfixed timeout
wimax/i2400m: driver loads firmware v1.4 instead of v1.3
vlan: Update skb->mac_header in __vlan_put_tag().
cxgb3: Add support for PCI ID 0x35.
tcp: remove obsoleted comment about different passes
TG3: &&/|| confusion
ATM: misplaced parentheses?
net/mv643xx: don't disable the mib timer too early and lock properly
net/mv643xx: use GFP_ATOMIC while atomic
atl1c: Atheros L1C Gigabit Ethernet driver
net: Kill skb_truesize_check(), it only catches false-positives.
net: forcedeth: Fix wake-on-lan regression
The CIPSO protocol engine incorrectly stated that the FIPS-188 specification
could be found in the kernel's Documentation directory. This patch corrects
that by removing the comment and directing users to the FIPS-188 documented
hosted online. For the sake of completeness I've also included a link to the
CIPSO draft specification on the NetLabel website.
Thanks to Randy Dunlap for spotting the error and letting me know.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: James Morris <jmorris@namei.org>
This patch fix a double free when a network namespace fails.
The previous code does a kfree of the net_generic structure when
one of the init subsystem initialization fails.
The 'setup_net' function does kfree(ng) and returns an error.
The caller, 'copy_net_ns', call net_free on error, and this one
calls kfree(net->gen), making this pointer freed twice.
This patch make the code symetric, the net_alloc does the net_generic
allocation and the net_free frees the net_generic.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is obsolete since the passes got combined.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
When extensions were moved to the NFPROTO_UNSPEC wildcard in
ab4f21e6fb, they disappeared from the
procfs files.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
NFLOG timeout was computed in timer by doing:
flushtimeout*HZ/100
Default value of flushtimeout was HZ (for 1 second delay). This was
wrong for non 100HZ computer. This patch modify the default delay by
using 100 instead of HZ.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
In NFLOG the per-rule qthreshold should overrides per-instance only
it is set. With current code, the per-rule qthreshold is 1 if not set
and it overrides the per-instance qthreshold.
This patch modifies the default xt_NFLOG threshold from 1 to
0. Thus a value of 0 means there is no per-rule setting and the instance
parameter has to apply.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch fixes a trivial typo that was adding a new line at end of
the nf_log_packet() prefix. It also make the logging conditionnal by
adding a LOG_INVALID test.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
A long time ago we had bugs, primarily in TCP, where we would modify
skb->truesize (for TSO queue collapsing) in ways which would corrupt
the socket memory accounting.
skb_truesize_check() was added in order to try and catch this error
more systematically.
However this debugging check has morphed into a Frankenstein of sorts
and these days it does nothing other than catch false-positives.
Signed-off-by: David S. Miller <davem@davemloft.net>
When a non-wimax interface is looked up by the stack, a bad pointer is
returned when the looked-up interface is not found in the list (of
registered WiMAX interfaces). This causes an oops in the caller when
trying to use the pointer.
Fix by properly setting the pointer to NULL if we don't exit from the
list_for_each() with a found entry.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In function sock_getsockopt() located in net/core/sock.c, optval v.val
is not correctly initialized and directly returned in userland in case
we have SO_BSDCOMPAT option set.
This dummy code should trigger the bug:
int main(void)
{
unsigned char buf[4] = { 0, 0, 0, 0 };
int len;
int sock;
sock = socket(33, 2, 2);
getsockopt(sock, 1, SO_BSDCOMPAT, &buf, &len);
printf("%x%x%x%x\n", buf[0], buf[1], buf[2], buf[3]);
close(sock);
}
Here is a patch that fix this bug by initalizing v.val just after its
declaration.
Signed-off-by: Clément Lecigne <clement.lecigne@netasq.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We try to find the correct outgoing interface for injected frames
based on the TA, but since this is a hack for hostapd 11w, restrict
the heuristic to AP mode interfaces. At some point we'll add the
ability to give an interface index in radiotap or so and just
remove this heuristic again.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: stable@kernel.org [2.6.28.x]
Signed-off-by: John W. Linville <linville@tuxdriver.com>