linux

Author	SHA1	Message	Date
Patrick McHardy	124dff01af	netfilter: don't reset nf_trace in nf_reset() Commit `130549fe` ("netfilter: reset nf_trace in nf_reset") added code to reset nf_trace in nf_reset(). This is wrong and unnecessary. nf_reset() is used in the following cases: - when passing packets up the the socket layer, at which point we want to release all netfilter references that might keep modules pinned while the packet is queued. nf_trace doesn't matter anymore at this point. - when encapsulating or decapsulating IPsec packets. We want to continue tracing these packets after IPsec processing. - when passing packets through virtual network devices. Only devices on that encapsulate in IPv4/v6 matter since otherwise nf_trace is not used anymore. Its not entirely clear whether those packets should be traced after that, however we've always done that. - when passing packets through virtual network devices that make the packet cross network namespace boundaries. This is the only cases where we clearly want to reset nf_trace and is also what the original patch intended to fix. Add a new function nf_reset_trace() and use it in dev_forward_skb() to fix this properly. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-05 15:38:10 -04:00
Pablo Neira Ayuso	12202fa757	netfilter: remove unneeded variable proc_net_netfilter Now that this supports net namespace for nflog and nfqueue, we can remove the global proc_net_netfilter which has no clients anymore. Based on patch from Gao feng. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 21:08:11 +02:00
Gao feng	e817961048	netfilter: nfnetlink_queue: add net namespace support for nfnetlink_queue This patch makes /proc/net/netfilter/nfnetlink_queue pernet. Moreover, there's a pernet instance table and lock. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 21:08:11 +02:00
Gao feng	5b023fc8d8	netfilter: enable per netns support for nf_loggers After this patch, all nf_loggers support net namespace. Still xt_LOG and ebt_log require syslog netns support. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 21:08:10 +02:00
Gao feng	9368a53c47	netfilter: nfnetlink_log: add net namespace support for nfnetlink_log This patch makes /proc/net/netfilter/nfnetlink_log pernet. Moreover, there's a pernet instance table and lock. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 21:08:01 +02:00
Gao feng	355430671a	netfilter: ipt_ULOG: add net namespace support for ipt_ULOG Add pernet support to ipt_ULOG by means of the new nf_log_set function added in (`30e0c6a` netfilter: nf_log: prepare net namespace support for loggers). This patch also make ulog_buffers and netlink socket nflognl per netns. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 21:03:43 +02:00
Gao feng	5b175b7ebd	netfilter: ebt_ulog: add net namespace support for ebt_ulog Add pernet support to ebt_ulog by means of the new nf_log_set function added in (`30e0c6a` netfilter: nf_log: prepare net namespace support for loggers). This patch also make ulog_buffers and netlink socket ebtulognl per netns. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 20:59:50 +02:00
Gao feng	69b34fb996	netfilter: xt_LOG: add net namespace support for xt_LOG Add pernet support to xt_LOG by means of the new nf_log_set function added in (`30e0c6a` netfilter: nf_log: prepare net namespace support for loggers). Since syslog ns has yet not been implemented, we don't want the containers to DDOS host's syslogd. So only enable ebt_log only from init_net and wait for syslog ns support Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 20:58:45 +02:00
Gao feng	7d2789246c	netfilter: ebt_log: add net namespace support for ebt_log Add pernet support to ebt_log by means of the new nf_log_set function added in (`30e0c6a` netfilter: nf_log: prepare net namespace support for loggers). Since syslog ns has yet not been implemented, we don't want the containers to DDOS host's syslogd. So only enable ebt_log only from init_net and wait for syslog ns support. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 20:57:27 +02:00
Gao feng	30e0c6a6be	netfilter: nf_log: prepare net namespace support for loggers This patch adds netns support to nf_log and it prepares netns support for existing loggers. It is composed of four major changes. 1) nf_log_register has been split to two functions: nf_log_register and nf_log_set. The new nf_log_register is used to globally register the nf_logger and nf_log_set is used for enabling pernet support from nf_loggers. Per netns is not yet complete after this patch, it comes in separate follow up patches. 2) Add net as a parameter of nf_log_bind_pf. Per netns is not yet complete after this patch, it only allows to bind the nf_logger to the protocol family from init_net and it skips other cases. 3) Adapt all nf_log_packet callers to pass netns as parameter. After this patch, this function only works for init_net. 4) Make the sysctl net/netfilter/nf_log pernet. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 20:12:54 +02:00
Gao feng	f3c1a44a22	netfilter: make /proc/net/netfilter pernet This patch makes this proc dentry pernet. So far only init_net had a /proc/net/netfilter directory. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 19:35:02 +02:00
Jiri Pirko	34e2ed34a0	net: ipv4: notify when address lifetime changes if userspace changes lifetime of address, send netlink notification and call notifier. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-05 00:51:12 -04:00
Eric W. Biederman	0e82e7f6df	af_unix: If we don't care about credentials coallesce all messages It was reported that the following LSB test case failed https://lsbbugs.linuxfoundation.org/attachment.cgi?id=2144 because we were not coallescing unix stream messages when the application was expecting us to. The problem was that the first send was before the socket was accepted and thus sock->sk_socket was NULL in maybe_add_creds, and the second send after the socket was accepted had a non-NULL value for sk->socket and thus we could tell the credentials were not needed so we did not bother. The unnecessary credentials on the first message cause unix_stream_recvmsg to start verifying that all messages had the same credentials before coallescing and then the coallescing failed because the second message had no credentials. Ignoring credentials when we don't care in unix_stream_recvmsg fixes a long standing pessimization which would fail to coallesce messages when reading from a unix stream socket if the senders were different even if we did not care about their credentials. I have tested this and verified that the in the LSB test case mentioned above that the messages do coallesce now, while the were failing to coallesce without this change. Reported-by: Karel Srot <ksrot@redhat.com> Reported-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-05 00:49:13 -04:00
Eric W. Biederman	25da0e3e9d	Revert "af_unix: dont send SCM_CREDENTIAL when dest socket is NULL" This reverts commit `14134f6584`. The problem that the above patch was meant to address is that af_unix messages are not being coallesced because we are sending unnecesarry credentials. Not sending credentials in maybe_add_creds totally breaks unconnected unix domain sockets that wish to send credentails to other sockets. In practice this break some versions of udev because they receive a message and the sending uid is bogus so they drop the message. Reported-by: Sven Joachim <svenjoac@gmx.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-05 00:49:03 -04:00
Vlad Yasevich	4543fbefe6	net: count hw_addr syncs so that unsync works properly. A few drivers use dev_uc_sync/unsync to synchronize the address lists from master down to slave/lower devices. In some cases (bond/team) a single address list is synched down to multiple devices. At the time of unsync, we have a leak in these lower devices, because "synced" is treated as a boolean and the address will not be unsynced for anything after the first device/call. Treat "synced" as a count (same as refcount) and allow all unsync calls to work. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-05 00:18:46 -04:00
David S. Miller	4f4ecd5f2a	Merge branch 'master' of git://1984.lsi.us.es/nf Pablo Neira Ayuso says: ==================== The following patchset contains netfilter updates for your net tree, they are: * Fix missing the skb->trace reset in nf_reset, noticed by Gao Feng while using the TRACE target with several net namespaces. * Fix prefix translation in IPv6 NPT if non-multiple of 32 prefixes are used, from Matthias Schiffer. * Fix invalid nfacct objects with empty name, they are now rejected with -EINVAL, spotted by Michael Zintakis, patch from myself. * A couple of fixes for wrong return values in the error path of nfnetlink_queue and nf_conntrack, from Wei Yongjun. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-04 17:41:53 -04:00
David S. Miller	518314ffe4	Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into wireless John W. Linville says: ==================== Here are some more fixes intended for the 3.9 stream... Regarding the mac80211 bits, Johannes says: "I had changed the idle handling to simplify it, but broken the sequencing of commands, at least for ath9k-htc, one patch restores the sequence. The other patch fixes a crash Jouni found while stress-testing the remain-on-channel code, when an item is deleted the work struct can run twice and crash the second time." As for the iwlwifi bits, Johannes says: "The only fix here is to the passive-no-RX firmware regulatory enforcement driver support code to not drop auth frames in quick succession, leading to not being able to connect to APs on passive channels in certain circumstances." Don't forget the NFC bits, about which Samuel says: "This time we have: - A crash fix for when a DGRAM LLCP socket is listening while the NFC adapter is physically removed. - A potential double skb free when the LLCP socket receive queue is full. - A fix for properly handling multiple and consecutive LLCP connections, and not trash the socket ack log. - A build failure for the MEI microread physical layer, now that the MEI bus APIs have been merged into char-misc-next." On top of that, Stone Piao provides an mwifiex fix to avoid accessing beyond the end of a buffer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-04 17:39:06 -04:00
Jesper Dangaard Brouer	19952cc4f8	net: frag queue per hash bucket locking This patch implements per hash bucket locking for the frag queue hash. This removes two write locks, and the only remaining write lock is for protecting hash rebuild. This essentially reduce the readers-writer lock to a rebuild lock. This patch is part of "net: frag performance followup" http://thread.gmane.org/gmane.linux.network/263644 of which two patches have already been accepted: Same test setup as previous: (http://thread.gmane.org/gmane.linux.network/257155) Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses Ethernet flow-control. A third interface is used for generating the DoS attack (with trafgen). Notice, I have changed the frag DoS generator script to be more efficient/deadly. Before it would only hit one RX queue, now its sending packets causing multi-queue RX, due to "better" RX hashing. Test types summary (netperf UDP_STREAM): Test-20G64K == 2x10G with 65K fragments Test-20G3F == 2x10G with 3x fragments (3*1472 bytes) Test-20G64K+DoS == Same as 20G64K with frag DoS Test-20G3F+DoS == Same as 20G3F with frag DoS Test-20G64K+MQ == Same as 20G64K with Multi-Queue frag DoS Test-20G3F+MQ == Same as 20G3F with Multi-Queue frag DoS When I rebased this-patch(03) (on top of net-next commit `a210576c`) and removed the _bh spinlock, I saw a performance regression. BUT this was caused by some unrelated change in-between. See tests below. Test (A) is what I reported before for patch-02, accepted in commit `1b5ab0de`. Test (B) verifying-retest of commit `1b5ab0de` corrospond to patch-02. Test (C) is what I reported before for this-patch Test (D) is net-next master HEAD (commit `a210576c`), which reveals some (unknown) performance regression (compared against test (B)). Test (D) function as a new base-test. Performance table summary (in Mbit/s): (#) Test-type: 20G64K 20G3F 20G64K+DoS 20G3F+DoS 20G64K+MQ 20G3F+MQ ---------- ------- ------- ---------- --------- -------- ------- (A) Patch-02 : 18848.7 13230.1 4103.04 5310.36 130.0 440.2 (B) `1b5ab0de` : 18841.5 13156.8 4101.08 5314.57 129.0 424.2 (C) Patch-03v1: 18838.0 13490.5 4405.11 6814.72 196.6 461.6 (D) `a210576c` : 18321.5 11250.4 3635.34 5160.13 119.1 405.2 (E) with _bh : 17247.3 11492.6 3994.74 6405.29 166.7 413.6 (F) without bh: 17471.3 11298.7 3818.05 6102.11 165.7 406.3 Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks. I cannot explain the slow down for 20G64K (but its an artificial "lab-test" so I'm not worried). But the other results does show improvements. And test (E) "with _bh" version is slightly better. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> ---- V2: - By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't need the spinlock _bh versions, as Netfilter currently does a local_bh_disable() before entering inet_fragment. - Fold-in desc from cover-mail V3: - Drop the chain_len counter per hash bucket. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-04 17:37:05 -04:00
John W. Linville	9306a398e7	Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211	2013-04-03 14:19:48 -04:00
John W. Linville	407ad2b7ef	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem	2013-04-03 13:50:34 -04:00
Matthias Schiffer	906b1c394d	netfilter: ip6t_NPT: Fix translation for non-multiple of 32 prefix lengths The bitmask used for the prefix mangling was being calculated incorrectly, leading to the wrong part of the address being replaced when the prefix length wasn't a multiple of 32. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-03 12:24:56 +02:00
David S. Miller	d662483264	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull net into net-next to get the synchronize_net() bug fix in bonding. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-03 01:31:54 -04:00
Jacob Keller	8facd5fb73	net: fix smatch warnings inside datagram_poll Commit `7d4c04fc17` ("net: add option to enable error queue packets waking select") has an issue due to operator precedence causing the bit-wise OR to bind to the sock_flags call instead of the result of the terniary conditional. This fixes the *_poll functions to work properly. The old code results in "mask \|= POLLPRI" instead of what was intended, which is to only include POLLPRI when the socket option is enabled. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-02 16:59:16 -04:00
Reilly Grant	990454b5a4	VSOCK: Handle changes to the VMCI context ID. The VMCI context ID of a virtual machine may change at any time. There is a VMCI event which signals this but datagrams may be processed before this is handled. It is therefore necessary to be flexible about the destination context ID of any datagrams received. (It can be assumed to be correct because it is provided by the hypervisor.) The context ID on existing sockets should be updated to reflect how the hypervisor is currently referring to the system. Signed-off-by: Reilly Grant <grantr@vmware.com> Acked-by: Andy King <acking@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-02 14:39:17 -04:00
Balakumaran Kannan	25fb6ca4ed	net IPv6 : Fix broken IPv6 routing table after loopback down-up IPv6 Routing table becomes broken once we do ifdown, ifup of the loopback(lo) interface. After down-up, routes of other interface's IPv6 addresses through 'lo' are lost. IPv6 addresses assigned to all interfaces are routed through 'lo' for internal communication. Once 'lo' is down, those routing entries are removed from routing table. But those removed entries are not being re-created properly when 'lo' is brought up. So IPv6 addresses of other interfaces becomes unreachable from the same machine. Also this breaks communication with other machines because of NDISC packet processing failure. This patch fixes this issue by reading all interface's IPv6 addresses and adding them to IPv6 routing table while bringing up 'lo'. ==Testing== Before applying the patch: $ route -A inet6 Kernel IPv6 routing table Destination Next Hop Flag Met Ref Use If 2000::20/128 :: U 256 0 0 eth0 fe80::/64 :: U 256 0 0 eth0 ::/0 :: !n -1 1 1 lo ::1/128 :: Un 0 1 0 lo 2000::20/128 :: Un 0 1 0 lo fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo ff00::/8 :: U 256 0 0 eth0 ::/0 :: !n -1 1 1 lo $ sudo ifdown lo $ sudo ifup lo $ route -A inet6 Kernel IPv6 routing table Destination Next Hop Flag Met Ref Use If 2000::20/128 :: U 256 0 0 eth0 fe80::/64 :: U 256 0 0 eth0 ::/0 :: !n -1 1 1 lo ::1/128 :: Un 0 1 0 lo ff00::/8 :: U 256 0 0 eth0 ::/0 :: !n -1 1 1 lo $ After applying the patch: $ route -A inet6 Kernel IPv6 routing table Destination Next Hop Flag Met Ref Use If 2000::20/128 :: U 256 0 0 eth0 fe80::/64 :: U 256 0 0 eth0 ::/0 :: !n -1 1 1 lo ::1/128 :: Un 0 1 0 lo 2000::20/128 :: Un 0 1 0 lo fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo ff00::/8 :: U 256 0 0 eth0 ::/0 :: !n -1 1 1 lo $ sudo ifdown lo $ sudo ifup lo $ route -A inet6 Kernel IPv6 routing table Destination Next Hop Flag Met Ref Use If 2000::20/128 :: U 256 0 0 eth0 fe80::/64 :: U 256 0 0 eth0 ::/0 :: !n -1 1 1 lo ::1/128 :: Un 0 1 0 lo 2000::20/128 :: Un 0 1 0 lo fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo ff00::/8 :: U 256 0 0 eth0 ::/0 :: !n -1 1 1 lo $ Signed-off-by: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com> Signed-off-by: Maruthi Thotad <Maruthi.Thotad@ap.sony.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-02 14:37:19 -04:00
Paul Gortmaker	5e404cd658	ipconfig: add informative timeout messages while waiting for carrier Commit `3fb72f1e6e` ("ipconfig wait for carrier") added a "wait for carrier on at least one interface" policy, with a worst case maximum wait of two minutes. However, if you encounter this, you won't get any feedback from the console as to the nature of what is going on. You just see the booting process hang for two minutes and then continue. Here we add a message so the user knows what is going on, and hence can take action to rectify the situation (e.g. fix network cable or whatever.) After the 1st 10s pause, output now begins that looks like this: Waiting up to 110 more seconds for network. Waiting up to 100 more seconds for network. Waiting up to 90 more seconds for network. Waiting up to 80 more seconds for network. ... Since most systems will have no problem getting link/carrier in the 1st 10s, the only people who will see these messages are people with genuine issues that need to be resolved. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-02 14:35:33 -04:00
Vasily Averin	f0f6ee1f70	cbq: incorrect processing of high limits currently cbq works incorrectly for limits > 10% real link bandwidth, and practically does not work for limits > 50% real link bandwidth. Below are results of experiments taken on 1 Gbit link In shaper \| Actual Result -----------+--------------- 100M \| 108 Mbps 200M \| 244 Mbps 300M \| 412 Mbps 500M \| 893 Mbps This happen because of q->now changes incorrectly in cbq_dequeue(): when it is called before real end of packet transmitting, L2T is greater than real time delay, q_now gets an extra boost but never compensate it. To fix this problem we prevent change of q->now until its synchronization with real time. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-02 14:29:20 -04:00
Eric Dumazet	4a0b5ec12f	bridge: remove a redundant synchronize_net() commit `00cfec3748` (net: add a synchronize_net() in netdev_rx_handler_unregister()) allows us to remove the synchronized_net() call from del_nbp() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-02 12:10:58 -04:00
holger@eitzenberger.org	5c33448c40	netfilter: xt_NFQUEUE: coalesce IPv4 and IPv6 hashing Because rev1 and rev3 of the target share the same hashing generalize it by introduing nfqueue_hash(). Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-02 01:26:10 +02:00
holger@eitzenberger.org	8746ddcf12	netfilter: xt_NFQUEUE: introduce CPU fanout Current NFQUEUE target uses a hash, computed over source and destination address (and other parameters), for steering the packet to the actual NFQUEUE. This, however forgets about the fact that the packet eventually is handled by a particular CPU on user request. If E. g. 1) IRQ affinity is used to handle packets on a particular CPU already (both single-queue or multi-queue case) and/or 2) RPS is used to steer packets to a specific softirq the target easily chooses an NFQUEUE which is not handled by a process pinned to the same CPU. The idea is therefore to use the CPU index for determining the NFQUEUE handling the packet. E. g. when having a system with 4 CPUs, 4 MQ queues and 4 NFQUEUEs it looks like this: +-----+ +-----+ +-----+ +-----+ \|NFQ#0\| \|NFQ#1\| \|NFQ#2\| \|NFQ#3\| +-----+ +-----+ +-----+ +-----+ ^ ^ ^ ^ \| \|NFQUEUE \| \| + + + + +-----+ +-----+ +-----+ +-----+ \|rx-0 \| \|rx-1 \| \|rx-2 \| \|rx-3 \| +-----+ +-----+ +-----+ +-----+ The NFQUEUEs not necessarily have to start with number 0, setups with less NFQUEUEs than packet-handling CPUs are not a problem as well. This patch extends the NFQUEUE target to accept a new NFQ_FLAG_CPU_FANOUT flag. If this is specified the target uses the CPU index for determining the NFQUEUE being used. I have to introduce rev3 for this. The 'flags' are folded into _v2 'bypass'. By changing the way which queue is assigned, I'm able to improve the performance if the processes reading on the NFQUEUs are pinned correctly. Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-02 01:25:44 +02:00
Gao feng	f016588861	netfilter: use IS_ENABLE to replace if defined in TRACE target Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-02 01:03:45 +02:00
Julian Anastasov	ac69269a45	ipvs: do not disable bh for long time We used a global BH disable in LOCAL_OUT hook. Add _bh suffix to all places that need it and remove the disabling from LOCAL_OUT and sync code. Functions like ip_defrag need protection from BH, so add it. As for nf_nat_mangle_tcp_packet, it needs RCU lock. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:58 +02:00
Julian Anastasov	ceec4c3816	ipvs: convert services to rcu This is the final step in RCU conversion. Things that are removed: - svc->usecnt: now svc is accessed under RCU read lock - svc->inc: and some unused code - ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE - __ip_vs_svc_lock: replaced with RCU - IP_VS_WAIT_WHILE: now readers lookup svcs and dests under RCU and work in parallel with configuration Other changes: - before now, a RCU read-side critical section included the calling of the schedule method, now it is extended to include service lookup - ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist - svc->pe and svc->scheduler remain to the end (of grace period), the schedulers are prepared for such RCU readers even after done_service is called but they need to use synchronize_rcu because last ip_vs_scheduler_put can happen while RCU read-side critical sections use an outdated svc->scheduler pointer - as planned, update_service is removed - empty services can be freed immediately after grace period. If dests were present, the services are freed from the dest trash code Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:58 +02:00
Julian Anastasov	413c2d04e9	ipvs: convert dests to rcu In previous commits the schedulers started to access svc->destinations with _rcu list traversal primitives because the IP_VS_WAIT_WHILE macro still plays the role of grace period. Now it is time to finish the updating part, i.e. adding and deleting of dests with _rcu suffix before removing the IP_VS_WAIT_WHILE in next commit. We use the same rule for conns as for the schedulers: dests can be searched in RCU read-side critical section where ip_vs_dest_hold can be called by ip_vs_bind_dest. Some things are not perfect, for example, calling functions like ip_vs_lookup_dest from updating code under RCU, just because we use some function both from reader and from updater. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:57 +02:00
Julian Anastasov	ba3a3ce14e	ipvs: convert sched_lock to spin lock As all read_locks are gone spin lock is preferred. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:56 +02:00
Julian Anastasov	ed3ffc4e48	ipvs: do not expect result from done_service This method releases the scheduler state, it can not fail. Such change will help to properly replace the scheduler in following patch. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:56 +02:00
Julian Anastasov	578bc3ef1e	ipvs: reorganize dest trash All dests will go to trash, no exceptions. But we have to use new list node t_list for this, due to RCU changes in following patches. Dests will wait there initial grace period and later all conns and schedulers to put their reference. The dests don't get reference for staying in dest trash as before. As result, we do not load ip_vs_dest_put with extra checks for last refcnt and the schedulers do not need to play games with atomic_inc_not_zero while selecting best destination. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:55 +02:00
Julian Anastasov	08cb2d032f	ipvs: convert wrr scheduler to rcu The schedule method now needs _rcu list-traversal primitive for svc->destinations. As the weight for some dest can be reduced during dest selection, change the algorithm to check weights by using minimum weights in the 1 .. max_weight-(di-1) range, with the same step (di). By this way we ensure that there will be always a weight >= 1 check before claiming that all destinations are overloaded. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:54 +02:00
Julian Anastasov	b310faad3e	ipvs: convert wlc scheduler to rcu The schedule method now needs _rcu list-traversal primitive for svc->destinations. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:54 +02:00
Julian Anastasov	1acb7f6761	ipvs: convert sh scheduler to rcu Use the 3 new methods to reassign dests. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:53 +02:00
Julian Anastasov	9be52aba7a	ipvs: convert sed scheduler to rcu The schedule method now needs _rcu list-traversal primitive for svc->destinations. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:52 +02:00
Julian Anastasov	c0d0c0a1c0	ipvs: convert rr scheduler to rcu The schedule method now needs _rcu list-traversal primitive for svc->destinations. As the previous entry could be unlinked, limit the list traversals to 2 when lookup started from previous entry. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:52 +02:00
Julian Anastasov	f92ea8f096	ipvs: convert nq scheduler to rcu The schedule method now needs _rcu list-traversal primitive for svc->destinations. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:51 +02:00
Julian Anastasov	4ebd288b69	ipvs: convert lc scheduler to rcu The schedule method now needs _rcu list-traversal primitive for svc->destinations. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:50 +02:00
Julian Anastasov	c5549571f9	ipvs: convert lblcr scheduler to rcu The schedule method now needs _rcu list-traversal primitive for svc->destinations. The read_lock for sched_lock is removed. The set.lock is removed because now it is used in rare cases, mostly under sched_lock. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:50 +02:00
Julian Anastasov	c2a4ffb70e	ipvs: convert lblc scheduler to rcu The schedule method now needs _rcu list-traversal primitive for svc->destinations. The read_lock for sched_lock is removed. Use a dead flag to prevent new entries to be created while scheduler is reclaimed. Use hlist for the hash table. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:49 +02:00
Julian Anastasov	8f3d0023b9	ipvs: convert dh scheduler to rcu Use the new add_dest and del_dest methods to reassign dests. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:49 +02:00
Julian Anastasov	fca9c20ae1	ipvs: add ip_vs_dest_hold and ip_vs_dest_put ip_vs_dest_hold will be used under RCU lock while ip_vs_dest_put can be called even after dest is removed from service, as it happens for conns and some schedulers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:48 +02:00
Julian Anastasov	6b6df46663	ipvs: preparations for using rcu in schedulers Allow schedulers to use rcu_dereference when returning destination on lookup. The RCU read-side critical section will allow ip_vs_bind_dest to get dest refcnt as preparation for the step where destinations will be deleted without an IP_VS_WAIT_WHILE guard that holds the packet processing during update. Add new optional scheduler methods add_dest, del_dest and upd_dest. For now the methods are called together with update_service but update_service will be removed in a following change. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:47 +02:00
Julian Anastasov	71dfa982f1	ipvs: change ip_vs_sched_lock to mutex The global list with schedulers ip_vs_schedulers is accessed only from user context - configuration and scheduler module [un]registration. Use ip_vs_sched_mutex instead. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:47 +02:00
Julian Anastasov	9a05475ceb	ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new We have many fields to set and few to reset, use kmem_cache_alloc instead to save some cycles. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:46 +02:00
Julian Anastasov	1845ed0bb2	ipvs: reorder keys in connection structure __ip_vs_conn_in_get and ip_vs_conn_out_get are hot places. Optimize them, so that ports are matched first. By moving net and fwmark below, on 32-bit arch we can fit caddr in 32-byte cache line and all addresses in 64-byte cache line. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:45 +02:00
Julian Anastasov	088339a57d	ipvs: convert connection locking Convert __ip_vs_conntbl_lock_array as follows: - readers that do not modify conn lists will use RCU lock - updaters that modify lists will use spinlock_t Now for conn lookups we will use RCU read-side critical section. Without using __ip_vs_conn_get such places have access to connection fields and can dereference some pointers like pe and pe_data plus the ability to update timer expiration. If full access is required we contend for reference. We add barrier in __ip_vs_conn_put, so that other CPUs see the refcnt operation after other writes. With the introduction of ip_vs_conn_unlink() we try to reorganize ip_vs_conn_expire(), so that unhashing of connections that should stay more time is avoided, even if it is for very short time. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:45 +02:00
Julian Anastasov	60b6aa3b31	ipvs: convert locks used in persistence engines Allow the readers to use RCU lock and for PE module registrations use global mutex instead of spinlock. All PE modules need to use synchronize_rcu in their module exit handler. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:44 +02:00
Julian Anastasov	276472eae0	ipvs: remove rs_lock by using RCU rs_lock was used to protect rs_table (hash table) from updaters (under global mutex) and readers (packet handlers). We can remove rs_lock by using RCU lock for readers. Reclaiming dest only with kfree_rcu is enough because the readers access only fields from the ip_vs_dest structure. Use hlist for rs_table. As we are now using hlist_del_rcu, introduce in_rs_table flag as replacement for the list_empty checks which do not work with RCU. It is needed because only NAT dests are in the rs_table. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:43 +02:00
Julian Anastasov	363c97d743	ipvs: convert app locks We use locks like tcp_app_lock, udp_app_lock, sctp_app_lock to protect access to the protocol hash tables from readers in packet context while the application instances (inc) are [un]registered under global mutex. As the hash tables are mostly read when conns are created and bound to app, use RCU for readers and reclaim app instance after grace period. Simplify ip_vs_app_inc_get because we use usecnt only for statistics and rely on module refcounting. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:43 +02:00
Julian Anastasov	026ace060d	ipvs: optimize dst usage for real server Currently when forwarding requests to real servers we use dst_lock and atomic operations when cloning the dst_cache value. As the dst_cache value does not change most of the time it is better to use RCU and to lock dst_lock only when we need to replace the obsoleted dst. For this to work we keep dst_cache in new structure protected by RCU. For packets to remote real servers we will use noref version of dst_cache, it will be valid while we are in RCU read-side critical section because now dst_release for replaced dsts will be invoked after the grace period. Packets to local real servers that are passed to local stack with NF_ACCEPT need a dst clone. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:42 +02:00
Julian Anastasov	4115ded131	ipvs: consolidate all dst checks on transmit in one place Consolidate the PMTU checks, ICMP sending and skb_dst modification in __ip_vs_get_out_rt and __ip_vs_get_out_rt_v6. Now skb_dst is changed early to simplify the transmitters. Make sure update_pmtu is called only for local clients. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:41 +02:00
Julian Anastasov	f11cb2c2aa	ipvs: do not use skb_share_check We run in contexts like ip_rcv, ipv6_rcv, br_handle_frame, do not expect shared skbs. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:41 +02:00
Julian Anastasov	183dce554a	ipvs: no need to reroute anymore on DNAT over loopback After commit `70e7341673` (ipv4: Show that ip_send_reply() is purely unicast routine.) we do not need to reroute DNAT-ed traffic over loopback because reply uses iph daddr and not rt_spec_dst. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:40 +02:00
Julian Anastasov	d1deae4d3a	ipvs: rename functions related to dst_cache reset Move and give better names to two functions: - ip_vs_dst_reset to __ip_vs_dst_cache_reset - __ip_vs_dev_reset to ip_vs_forget_dev Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:39 +02:00
Julian Anastasov	b8abdf0984	ipvs: convert the IP_VS_XMIT macros to functions It was a bad idea to hide return statements in macros. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:39 +02:00
Julian Anastasov	313eae637f	ipvs: prefer NETDEV_DOWN event to free cached dsts The real server becomes unreachable on down event, no need to wait device unregistration. Should help in releasing dsts early before dst->dev is replaced with lo. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:38 +02:00
Julian Anastasov	c90558dae5	ipvs: avoid routing by TOS for real server Avoid replacing the cached route for real server on every packet with different TOS. I doubt that routing by TOS for real server is used at all, so we should be better with such optimization. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:23:37 +02:00
Julian Anastasov	932bc4d7a5	net: add skb_dst_set_noref_force Rename skb_dst_set_noref to __skb_dst_set_noref and add force flag as suggested by David Miller. The new wrapper skb_dst_set_noref_force will force dst entries that are not cached to be attached as skb dst without taking reference as long as provided dst is reclaimed after RCU grace period. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Simon Horman <horms@verge.net.au>	2013-04-02 00:22:53 +02:00
John W. Linville	b0bb9b392d	This is the 2nd batch of NFC fixes for 3.9. This time we have: - A crash fix for when a DGRAM LLCP socket is listening while the NFC adapter is physically removed. - A potential double skb free when the LLCP socket receive queue is full. - A fix for properly handling multiple and consecutive LLCP connections, and not trash the socket ack log. - A build failure for the MEI microread physical layer, now that the MEI bus APIs have been merged into char-misc-next. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJRWMDxAAoJEIqAPN1PVmxKnaIP/iWPgCdp7xZGtiT8R/bf0f2I hpWtCnYgwyvqEyUqq6+hJ+A4J6DJH/vXLqO0hJIZ6lPuATdxpba9u4mWaoIuU+V5 4khspaWnTE7we72fVNhpT6Ms2ZZubtFynqbmPuxvu9vmk8JCMkJLfH8InQ4OGI0v mScDCByIq9xZiv+q/R2QO9vh6k3t5sJyAZONf7uuQhoJ+AdlrXNsvHPki+gfo+6T dLhgVGqytEDGKkh0AZLGc5Ss1MJFQsmTh76A4wstcJfLCisBtbhM0sZ8zWn6wmzQ KCq5YawbowuA12RYgb+cwxXV7H167lDYJxILZuK3WqL0nm0H3Z0jkakcV7F+IVNu iyh1hRqjdsgThl7aBF1VEER6Scum4WbEDXY5abCpX9l8Jbtk3+iQJWYpy8ldKEnL KqDiZnm6WC5CrcM1GQo4Uy46UPE0IlHPxupfQ3riuFjaFKyPd1x1ok1xUngdOXjn JYpxfMLVSsRBZ2y6tpDhm0r5oRHGmmJmKTzhMPKP+4+CVG8z0vwUQfydgaDlaFRa F3U7Av70vyCGt+Cr0tAXppvoXng1TSUKPHPTHjH+pEHjfVA9l1mAU5AYQ7JDB9aZ fccfkd5UG0f6ApFMhknI8kkISBN5KseNuuSAoQc4WV9jUVKk9mMifAz/Aj9MVYWx m6M8KzocEudWwM8y/BjE =ABey -----END PGP SIGNATURE----- Merge tag 'nfc-fixes-3.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-fixes Samuel Ortiz <sameo@linux.intel.com> says: "This is the 2nd batch of NFC fixes for 3.9. This time we have: - A crash fix for when a DGRAM LLCP socket is listening while the NFC adapter is physically removed. - A potential double skb free when the LLCP socket receive queue is full. - A fix for properly handling multiple and consecutive LLCP connections, and not trash the socket ack log. - A build failure for the MEI microread physical layer, now that the MEI bus APIs have been merged into char-misc-next." Signed-off-by: John W. Linville <linville@tuxdriver.com>	2013-04-01 15:14:22 -04:00
John W. Linville	95bc6b8461	Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211	2013-04-01 15:09:28 -04:00
David S. Miller	a210576cf8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/mac80211/sta_info.c net/wireless/core.h Two minor conflicts in wireless. Overlapping additions of extern declarations in net/wireless/core.h and a bug fix overlapping with the addition of a boolean parameter to __ieee80211_key_free(). Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-01 13:36:50 -04:00
Linus Torvalds	ff3421dee6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) sadb_msg prepared for IPSEC userspace forgets to initialize the satype field, fix from Nicolas Dichtel. 2) Fix mac80211 synchronization during station removal, from Johannes Berg. 3) Fix IPSEC sequence number notifications when they wrap, from Steffen Klassert. 4) Fix cfg80211 wdev tracing crashes when add_virtual_intf() returns an error pointer, from Johannes Berg. 5) In mac80211, don't call into the channel context code with the interface list mutex held. From Johannes Berg. 6) In mac80211, if we don't actually associate, do not restart the STA timer, otherwise we can crash. From Ben Greear. 7) Missing dma_mapping_error() check in e1000, ixgb, and e1000e. From Christoph Paasch. 8) Fix sja1000 driver defines to not conflict with SH port, from Marc Kleine-Budde. 9) Don't call il4965_rs_use_green with a NULL station, from Colin Ian King. 10) Suspend/Resume in the FEC driver fail because the buffer descriptors are not initialized at all the moments in which they should. Fix from Frank Li. 11) cpsw and davinci_emac drivers both use the wrong interface to restart a stopped TX queue. Use netif_wake_queue not netif_start_queue, the latter is for initialization/bringup not active management of the queue. From Mugunthan V N. 12) Fix regression in rate calculations done by psched_ratecfg_precompute(), missing u64 type promotion. From Sergey Popovich. 13) Fix length overflow in tg3 VPD parsing, from Kees Cook. 14) AOE driver fails to allocate enough headroom, resulting in crashes. Fix from Eric Dumazet. 15) RX overflow happens too quickly in sky2 driver because pause packet thresholds are not programmed correctly. From Mirko Lindner. 16) Bonding driver manages arp_interval and miimon settings incorrectly, disabling one unintentionally disables both. Fix from Nikolay Aleksandrov. 17) smsc75xx drivers don't program the RX mac properly for jumbo frames. Fix from Steve Glendinning. 18) Fix off-by-one in Codel packet scheduler. From Vijay Subramanian. 19) Fix packet corruption in atl1c by disabling MSI support, from Hannes Frederic Sowa. 20) netdev_rx_handler_unregister() needs a synchronize_net() to fix crashes in bonding driver unload stress tests. From Eric Dumazet. 21) rxlen field of ks8851 RX packet descriptors not interpreted correctly (it is 12 bits not 16 bits, so needs to be masked after shifting the 32-bit value down 16 bits). Fix from Max Nekludov. 22) Fix missed RX/TX enable in sh_eth driver due to mishandling of link change indications. From Sergei Shtylyov. 23) Fix crashes during spurious ECI interrupts in sh_eth driver, also from Sergei Shtylyov. 24) dm9000 driver initialization is done wrong for revision B devices with DSP PHY, from Joseph CHANG. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (53 commits) DM9000B: driver initialization upgrade sh_eth: make 'link' field of 'struct sh_eth_private' int sh_eth: workaround for spurious ECI interrupt sh_eth: fix handling of no LINK signal ks8851: Fix interpretation of rxlen field. net: add a synchronize_net() in netdev_rx_handler_unregister() MAINTAINERS: Update netxen_nic maintainers list atl1e: drop pci-msi support because of packet corruption net: fq_codel: Fix off-by-one error net: calxedaxgmac: Wake-on-LAN fixes net: calxedaxgmac: fix rx ring handling when OOM net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb' smsc75xx: fix jumbo frame support net: fix the use of this_cpu_ptr bonding: fix disabling of arp_interval and miimon ipv6: don't accept node local multicast traffic from the wire sky2: Threshold for Pause Packet is set wrong sky2: Receive Overflows not counted aoe: reserve enough headroom on skbs line up comment for ndo_bridge_getlink ...	2013-04-01 08:06:30 -07:00
Keller, Jacob E	7d4c04fc17	net: add option to enable error queue packets waking select Currently, when a socket receives something on the error queue it only wakes up the socket on select if it is in the "read" list, that is the socket has something to read. It is useful also to wake the socket if it is in the error list, which would enable software to wait on error queue packets without waking up for regular data on the socket. The main use case is for receiving timestamped transmit packets which return the timestamp to the socket via the error queue. This enables an application to select on the socket for the error queue only instead of for the regular traffic. -v2- * Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file * Modified every socket poll function that checks error queue Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Matthew Vick <matthew.vick@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-31 19:44:20 -04:00
Artem Savkov	90e0970f87	cfg80211: sched_scan_mtx lock in cfg80211_conn_work() Introduced in `f9f475292d` ("cfg80211: always check for scan end on P2P device") cfg80211_conn_scan() which requires sched_scan_mtx to be held can be called from cfg80211_conn_work(). Without this we are hitting multiple warnings like the following: WARNING: at net/wireless/sme.c:88 cfg80211_conn_scan+0x1dc/0x3a0 [cfg80211]() Hardware name: 0578A21 Modules linked in: ... Pid: 620, comm: kworker/3:1 Not tainted 3.9.0-rc4-next-20130328+ #326 Call Trace: [<c1036992>] warn_slowpath_common+0x72/0xa0 [<c10369e2>] warn_slowpath_null+0x22/0x30 [<faa4b0ec>] cfg80211_conn_scan+0x1dc/0x3a0 [cfg80211] [<faa4b344>] cfg80211_conn_do_work+0x94/0x380 [cfg80211] [<faa4c3b2>] cfg80211_conn_work+0xa2/0x130 [cfg80211] [<c1051858>] process_one_work+0x198/0x450 Signed-off-by: Artem Savkov <artem.savkov@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2013-03-30 22:33:19 +01:00
Antonio Quartulli	537aadc3a4	ip_gre: don't overwrite iflink during net_dev init iflink is currently set to 0 in __gre_tunnel_init(). This function is invoked in gre_tap_init() and ipgre_tunnel_init() which are both used to initialise the ndo_init field of the respective net_device_ops structs (ipgre.. and gre_tap..) used by GRE interfaces. However, in netdevice_register() iflink is first set to -1, then ndo_init is invoked and then iflink is assigned to a proper value if and only if it still was -1. Assigning 0 to iflink in ndo_init is therefore first preventing netdev_register() to correctly assign it a proper value and then breaking iflink at all since 0 has not correct meaning. Fix this by removing the iflink assignment in __gre_tunnel_init(). Introduced by `c544193214` ("GRE: Refactor GRE tunneling code.") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Pravin B Shelar <pshelar@nicira.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-30 17:28:33 -04:00
John W. Linville	9a574cd67a	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless Conflicts: net/mac80211/sta_info.c net/wireless/core.h	2013-03-29 16:41:36 -04:00
Eric Dumazet	00cfec3748	net: add a synchronize_net() in netdev_rx_handler_unregister() commit `35d48903e9` (bonding: fix rx_handler locking) added a race in bonding driver, reported by Steven Rostedt who did a very good diagnosis : <quoting Steven> I'm currently debugging a crash in an old 3.0-rt kernel that one of our customers is seeing. The bug happens with a stress test that loads and unloads the bonding module in a loop (I don't know all the details as I'm not the one that is directly interacting with the customer). But the bug looks to be something that may still be present and possibly present in mainline too. It will just be much harder to trigger it in mainline. In -rt, interrupts are threads, and can schedule in and out just like any other thread. Note, mainline now supports interrupt threads so this may be easily reproducible in mainline as well. I don't have the ability to tell the customer to try mainline or other kernels, so my hands are somewhat tied to what I can do. But according to a core dump, I tracked down that the eth irq thread crashed in bond_handle_frame() here: slave = bond_slave_get_rcu(skb->dev); bond = slave->bond; <--- BUG the slave returned was NULL and accessing slave->bond caused a NULL pointer dereference. Looking at the code that unregisters the handler: void netdev_rx_handler_unregister(struct net_device *dev) { ASSERT_RTNL(); RCU_INIT_POINTER(dev->rx_handler, NULL); RCU_INIT_POINTER(dev->rx_handler_data, NULL); } Which is basically: dev->rx_handler = NULL; dev->rx_handler_data = NULL; And looking at __netif_receive_skb() we have: rx_handler = rcu_dereference(skb->dev->rx_handler); if (rx_handler) { if (pt_prev) { ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = NULL; } switch (rx_handler(&skb)) { My question to all of you is, what stops this interrupt from happening while the bonding module is unloading? What happens if the interrupt triggers and we have this: CPU0 CPU1 ---- ---- rx_handler = skb->dev->rx_handler netdev_rx_handler_unregister() { dev->rx_handler = NULL; dev->rx_handler_data = NULL; rx_handler() bond_handle_frame() { slave = skb->dev->rx_handler; bond = slave->bond; <-- NULL pointer dereference!!! What protection am I missing in the bond release handler that would prevent the above from happening? </quoting Steven> We can fix bug this in two ways. First is adding a test in bond_handle_frame() and others to check if rx_handler_data is NULL. A second way is adding a synchronize_net() in netdev_rx_handler_unregister() to make sure that a rcu protected reader has the guarantee to see a non NULL rx_handler_data. The second way is better as it avoids an extra test in fast path. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jpirko@redhat.com> Cc: Paul E. McKenney <paulmck@us.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:38:15 -04:00
Vijay Subramanian	cd68ddd4c2	net: fq_codel: Fix off-by-one error Currently, we hold a max of sch->limit -1 number of packets instead of sch->limit packets. Fix this off-by-one error. Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:32:23 -04:00
John Fastabend	91f3e7b174	net: rtnetlink: fdb dflt dump must set idx used for cb->arg[0] In rtnl_fdb_dump() when the fdb_dump ndo op is not populated we never set the idx value so that cb->arg[0] is always 0. Resulting in a endless loop of messages. Introduced with this commit, commit `090096bf3d` Author: Vlad Yasevich <vyasevic@redhat.com> Date: Wed Mar 6 15:39:42 2013 +0000 net: generic fdb support for drivers without ndo_fdb_<op> CC: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:26:27 -04:00
Shmulik Ladkani	a561cf7edf	net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb' 'nf_reset' is called just prior calling 'netif_rx'. No need to call it twice. Reported-by: Igor Michailov <rgohita@gmail.com> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:25:28 -04:00
Pravin B Shelar	54a5d38289	ip_tunnel: Fix off-by-one error in forming dev name. As Ben pointed out following patch fixes bug in checking device name length limits while forming tunnel device name. CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:24:28 -04:00
Li RongQing	278150321a	net: simplify the getting percpu of flow_cache replace per_cpu with per_cpu_ptr to save conversion between address and pointer Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:14:46 -04:00
Li RongQing	50eab0503a	net: fix the use of this_cpu_ptr flush_tasklet is not percpu var, and percpu is percpu var, and this_cpu_ptr(&info->cache->percpu->flush_tasklet) is not equal to &this_cpu_ptr(info->cache->percpu)->flush_tasklet 1f743b076(use this_cpu_ptr per-cpu helper) introduced this bug. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 15:13:27 -04:00
Hannes Frederic Sowa	1c4a154e52	ipv6: don't accept node local multicast traffic from the wire Erik Hugne's errata proposal (Errata ID: 3480) to RFC4291 has been verified: http://www.rfc-editor.org/errata_search.php?eid=3480 We have to check for pkt_type and loopback flag because either the packets are allowed to travel over the loopback interface (in which case pkt_type is PACKET_HOST and IFF_LOOPBACK flag is set) or they travel over a non-loopback interface back to us (in which case PACKET_TYPE is PACKET_LOOPBACK and IFF_LOOPBACK flag is not set). Cc: Erik Hugne <erik.hugne@ericsson.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 14:57:33 -04:00
Hong zhi guo	10c9cbb10f	netlink: fix the warning introduced by netlink API replacement Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-29 14:44:37 -04:00
Linus Torvalds	2c3de1c2d7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns fixes from Eric W Biederman: "The bulk of the changes are fixing the worst consequences of the user namespace design oversight in not considering what happens when one namespace starts off as a clone of another namespace, as happens with the mount namespace. The rest of the changes are just plain bug fixes. Many thanks to Andy Lutomirski for pointing out many of these issues." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Restrict when proc and sysfs can be mounted ipc: Restrict mounting the mqueue filesystem vfs: Carefully propogate mounts across user namespaces vfs: Add a mount flag to lock read only bind mounts userns: Don't allow creation if the user is chrooted yama: Better permission check for ptraceme pid: Handle the exit of a multi-threaded init. scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.	2013-03-28 13:43:46 -07:00
Hong zhi guo	c60ee67f45	bridge: remove unused variable ifm Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-28 14:41:19 -04:00
John W. Linville	630a216da6	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem	2013-03-28 14:40:06 -04:00
David S. Miller	ea407f0b97	Included changes: - A fix for the network coding component which has been added within the last pull request (so it is in linux-3.10). The problem has been spotted thanks to Fengguang Wu's automated daily checks on our tree. - Implementation of the RTNL API for virtual interface creation/deletion and slave manipulation - substitution of seq_printf with seq_puts when possible - minor cleanups -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIcBAABCAAGBQJRUsb7AAoJEADl0hg6qKeOf/sQAJ4VRwpB5WLIjtLPQ1Dvd/qu thPAn4vpv02Is8H7NRULRN9aqUIzWrWtXkhBTAOh0EHq3mki6a+QT4TUQ4qswTPv BaVS70hoTtN1OdcntWTATCUuX2Y2HD/D3Cw2IZIG7X9oqTEpoTpnuQygMsG/c1+s MtNO57/WUKlwLv0pAu9IHh4qCj00eF5YHsPmsWzoWAvi/4aCgESQfXGy4Dm6HqLB E8XuVLOLtw7NSw0V7n//KtEE2llTWkfslm06dSsAmqEtzOfOCCuA4Z0asnMPJt7s 4HknqNE0D0aR0sfS+W6G24gFQbIV8R4SYWnUfMFCvvT7Ds96RONdgkfWw39nM7XB y70kXUbfDBIeHiiV2uqbTZnbuAa61p8Y8jCJvWz0nujI1/QPnUA1bqEgC2AMTc1J I6rS8mhWC6Rg+ohkK7HSQfgiZ5xcWtjFuVuOhKr0cPwhaCI0yP5PCB/B3yxuJxwD egkL58fgPhgii3DOxInAqLS/aPT5vl49J91qC3+z2/xYS3zRxHsyIRkMwgfoI2Ab TofVmzKqDwENA+CnjCPQx2gn6w/QpK8g5xmW1GgbzuKddiCcQFKnezgpeSdnQrsM CP/8Tu6r3QSLlHebysMuhsCwyT1hv8Yt0PTDpBqtIXRpV5enTYa9xW3F+UTU65ZY p9fKQXJYDLLEhfcy/TJM =FoTZ -----END PGP SIGNATURE----- Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge Included changes: - A fix for the network coding component which has been added within the last pull request (so it is in linux-3.10). The problem has been spotted thanks to Fengguang Wu's automated daily checks on our tree. - Implementation of the RTNL API for virtual interface creation/deletion and slave manipulation - substitution of seq_printf with seq_puts when possible - minor cleanups Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-28 14:34:23 -04:00
Hong zhi guo	573ce260b3	net-next: replace obsolete NLMSG_* with type safe nlmsg_* Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-28 14:25:25 -04:00
Simon Horman	e5c5d22e8d	net: add ETH_P_802_3_MIN Add a new constant ETH_P_802_3_MIN, the minimum ethernet type for an 802.3 frame. Frames with a lower value in the ethernet type field are Ethernet II. Also update all the users of this value that David Miller and I could find to use the new constant. Also correct a bug in util.c. The comparison with ETH_P_802_3_MIN should be >= not >. As suggested by Jesse Gross. Compile tested only. Cc: David Miller <davem@davemloft.net> Cc: Jesse Gross <jesse@nicira.com> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: John W. Linville <linville@tuxdriver.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Bart De Schuymer <bart.de.schuymer@pandora.be> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: linux-bluetooth@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: bridge@lists.linux-foundation.org Cc: linux-wireless@vger.kernel.org Cc: linux1394-devel@lists.sourceforge.net Cc: linux-media@vger.kernel.org Cc: netdev@vger.kernel.org Cc: dev@openvswitch.org Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-28 01:20:42 -04:00
David S. Miller	0fb031f036	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Initialize the satype field in key_notify_policy_flush(), this was left uninitialized. From Nicolas Dichtel. 2) The sequence number difference for replay notifications was misscalculated on ESN sequence number wrap. We need a separate replay notify function for esn. 3) Fix an off by one in the esn replay notify function. From Mathias Krause. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 14:07:04 -04:00
Sergey Popovich	ea872d7712	sch: add missing u64 in psched_ratecfg_precompute() It seems that commit commit `292f1c7ff6` Author: Jiri Pirko <jiri@resnulli.us> Date: Tue Feb 12 00:12:03 2013 +0000 sch: make htb_rate_cfg and functions around that generic adds little regression. Before: # tc qdisc add dev eth0 root handle 1: htb default ffff # tc class add dev eth0 classid 1:ffff htb rate 5Gbit # tc -s class show dev eth0 class htb 1:ffff root prio 0 rate 5000Mbit ceil 5000Mbit burst 625b cburst 625b Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 31 ctokens: 31 After: # tc qdisc add dev eth0 root handle 1: htb default ffff # tc class add dev eth0 classid 1:ffff htb rate 5Gbit # tc -s class show dev eth0 class htb 1:ffff root prio 0 rate 1544Mbit ceil 1544Mbit burst 625b cburst 625b Sent 5073 bytes 41 pkt (dropped 0, overlimits 0 requeues 0) rate 1976bit 2pps backlog 0b 0p requeues 0 lended: 41 borrowed: 0 giants: 0 tokens: 1802 ctokens: 1802 This probably due to lost u64 cast of rate parameter in psched_ratecfg_precompute() (net/sched/sch_generic.c). Signed-off-by: Sergey Popovich <popovich_sergei@mail.ru> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 14:06:41 -04:00
Wei Yongjun	fcca143d69	rtnetlink: fix error return code in rtnl_link_fill() Fix to return a negative error code from the error handling case instead of 0(possible overwrite to 0 by ops->fill_xstats call), as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 14:06:40 -04:00
David S. Miller	e2a553dbf1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: include/net/ipip.h The changes made to ipip.h in 'net' were already included in 'net-next' before that header was moved to another location. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 13:52:49 -04:00
Wei Yongjun	5389090b59	netfilter: nf_conntrack: fix error return code Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in function nf_conntrack_standalone_init(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-03-27 18:31:17 +01:00
Jesper Dangaard Brouer	1b5ab0def4	net: use the frag lru_lock to protect netns_frags.nqueues update Move the protection of netns_frags.nqueues updates under the LRU_lock, instead of the write lock. As they are located on the same cacheline, and this is also needed when transitioning to use per hash bucket locking. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:33 -04:00
Jesper Dangaard Brouer	68399ac37e	net: frag, avoid several CPUs grabbing same frag queue during LRU evictor loop The LRU list is protected by its own lock, since commit `3ef0eb0db4` (net: frag, move LRU list maintenance outside of rwlock), and no-longer by a read_lock. This makes it possible, to remove the inet_frag_queue, which is about to be "evicted", from the LRU list head. This avoids the problem, of several CPUs grabbing the same frag queue. Note, cannot remove the inet_frag_lru_del() call in fq_unlink() called by inet_frag_kill(), because inet_frag_kill() is also used in other situations. Thus, we use list_del_init() to allow this double list_del to work. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:33 -04:00
Andy Shevchenko	5a048e3b59	net: core: let's use native isxdigit instead of custom In kernel we have fast and pretty implementation of the isxdigit() function. Let's use it. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:32 -04:00
Jason Wang	40893fd0fd	net: switch to use skb_probe_transport_header() Switch to use the new help skb_probe_transport_header() to do the l4 header probing for untrusted sources. For packets with partial csum, the header should already been set by skb_partial_csum_set(). Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:31 -04:00
Jason Wang	e5d5decaed	net: core: let skb_partial_csum_set() set transport header For untrusted packets with partial checksum, we need to set the transport header for precise packet length estimation. We can just let skb_pratial_csum_set() to do this to avoid extra call to skb_flow_dissect() and simplify the caller. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:31 -04:00
Antonio Quartulli	0c81465357	batman-adv: use seq_puts instead of seq_printf when the format is constant As reported by checkpatch, seq_puts has to be preferred with respect to seq_printf when the format is a constant string (no va_args) Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>	2013-03-27 10:29:55 +01:00
Antonio Quartulli	9cb812c54e	batman-adv: update Makefile copyright years Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>	2013-03-27 10:29:54 +01:00

1 2 3 4 5 ...

27656 Commits