linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-27 05:11:48 +00:00

Author	SHA1	Message	Date
Samuel Ortiz	b6355e972a	NFC: nci: Handle proprietary response and notifications Allow for drivers to explicitly define handlers for each proprietary notifications and responses they expect to support. Reviewed-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>	2015-06-09 00:34:20 +02:00
Joe Perches	2622e2a03c	NFC: nci: hci: Fix releasing uninitialized skbs Several of these goto exit; uses should be direct returns as skb is not yet initialized by nci_hci_get_param(). Miscellanea: o Use !memcmp instead of memcmp() == 0 o Remove unnecessary goto from if () {... goto exit;} else {...} exit: Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>	2015-06-09 00:34:19 +02:00
Willem de Bruijn	bbbf2df003	net: replace last open coded skb_orphan_frags with function call Commit `70008aa50e` ("skbuff: convert to skb_orphan_frags") replaced open coded tests of SKBTX_DEV_ZEROCOPY and skb_copy_ubufs with calls to helper function skb_orphan_frags. Apply that to the last remaining open coded site. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-08 12:15:13 -07:00
Josh Hunt	0243508edd	ipv6: Fix protocol resubmission UDP encapsulation is broken on IPv6. This is because the logic to resubmit the nexthdr is inverted, checking for a ret value > 0 instead of < 0. Also, the resubmit label is in the wrong position since we already get the nexthdr value when performing decapsulation. In addition the skb pull is no longer necessary either. This changes the return value check to look for < 0, using it for the nexthdr on the next iteration, and moves the resubmit label to the proper location. With these changes the v6 code now matches what we do in the v4 ip input code wrt resubmitting when decapsulating. Signed-off-by: Josh Hunt <johunt@akamai.com> Acked-by: "Tom Herbert" <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-08 12:13:17 -07:00
Robert Shearman	27e41fcfa6	ipv6: fix possible use after free of dev stats The memory pointed to by idev->stats.icmpv6msgdev, idev->stats.icmpv6dev and idev->stats.ipv6 can each be used in an RCU read context without taking a reference on idev. For example, through IP6__STATS_ calls in ip6_rcv. These memory blocks are freed without waiting for an RCU grace period to elapse. This could lead to the memory being written to after it has been freed. Fix this by using call_rcu to free the memory used for stats, as well as idev after an RCU grace period has elapsed. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-08 12:12:45 -07:00
Marcel Holtmann	781f899f2f	Bluetooth: Fix race condition with user channel and setup stage During the initial setup stage of a controller, the low-level transport is actually active. This means that HCI_UP is true. To avoid toggling the transport off and back on again for normal operation the kernel holds a grace period with HCI_AUTO_OFF that will turn the low-level transport off in case no user is present. The idea of the grace period is important to avoid having to initialize all of the controller twice. So legacy ioctl and the new management interface knows how to clear this grace period and then start normal operation. For the user channel operation this grace period has not been taken into account which results in the problem that HCI_UP and HCI_AUTO_OFF are set and the kernel will return EBUSY. However from a system point of view the controller is ready to be grabbed by either the ioctl, the management interface or the user channel. This patch brings the user channel to the same level as the other two entries for operating a controller. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Cc: stable@vger.kernel.org	2015-06-08 11:04:49 +03:00
Firo Yang	f38b24c905	fib_trie: coding style: Use pointer after check As Alexander Duyck pointed out that: struct tnode { ... struct key_vector kv[1]; } The kv[1] member of struct tnode is an arry that refernced by a null pointer will not crash the system, like this: struct tnode p = NULL; struct key_vector kv = p->kv; As such p->kv doesn't actually dereference anything, it is simply a means for getting the offset to the array from the pointer p. This patch make the code more regular to avoid making people feel odd when they look at the code. Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 23:45:40 -07:00
Nikolay Aleksandrov	c4c832f89d	bridge: disable softirqs around br_fdb_update to avoid lockup br_fdb_update() can be called in process context in the following way: br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set) so we need to disable softirqs because there are softirq users of the hash_lock. One easy way to reproduce this is to modify the bridge utility to set NTF_USE, enable stp and then set maxageing to a low value so br_fdb_cleanup() is called frequently and then just add new entries in a loop. This happens because br_fdb_cleanup() is called from timer/softirq context. The spin locks in br_fdb_update were _bh before commit `f8ae737dee` ("[BRIDGE]: forwarding remove unneeded preempt and bh diasables") and at the time that commit was correct because br_fdb_update() couldn't be called from process context, but that changed after commit: `292d139898` ("bridge: add NTF_USE support") Using local_bh_disable/enable around br_fdb_update() allows us to keep using the spin_lock/unlock in br_fdb_update for the fast-path. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: `292d139898` ("bridge: add NTF_USE support") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 19:44:13 -07:00
David S. Miller	7ff46e79fb	Revert "bridge: use _bh spinlock variant for br_fdb_update to avoid lockup" This reverts commit `1d7c49037b`. Nikolay Aleksandrov has a better version of this fix. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 19:43:47 -07:00
Robert Shearman	25cc8f0763	mpls: fix possible use after free of device The mpls device is used in an RCU read context without a lock being held. As the memory is freed without waiting for the RCU grace period to elapse, the freed memory could still be in use. Address this by using kfree_rcu to free the memory for the mpls device after the RCU grace period has elapsed. Fixes: `03c57747a7` ("mpls: Per-device MPLS state") Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 19:37:27 -07:00
Wilson Kok	1d7c49037b	bridge: use _bh spinlock variant for br_fdb_update to avoid lockup br_fdb_update() can be called in process context in the following way: br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set) so we need to use spin_lock_bh because there are softirq users of the hash_lock. One easy way to reproduce this is to modify the bridge utility to set NTF_USE, enable stp and then set maxageing to a low value so br_fdb_cleanup() is called frequently and then just add new entries in a loop. This happens because br_fdb_cleanup() is called from timer/softirq context. These locks were _bh before commit `f8ae737dee` ("[BRIDGE]: forwarding remove unneeded preempt and bh diasables") and at the time that commit was correct because br_fdb_update() couldn't be called from process context, but that changed after commit: `292d139898` ("bridge: add NTF_USE support") Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: `292d139898` ("bridge: add NTF_USE support") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 15:24:54 -07:00
Eric Dumazet	b80c0e7858	tcp: get_cookie_sock() consolidation IPv4 and IPv6 share same implementation of get_cookie_sock(), and there is no point inlining it. We add tcp_ prefix to the common helper name and export it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 15:19:52 -07:00
Antonio Quartulli	94d1dd8731	batman-adv: change the MAC of each VLAN upon ndo_set_mac_address The MAC address of the soft-interface is used to initialise the "non-purge" TT entry of each existing VLAN. Therefore when the user invokes ndo_set_mac_address() all the "non-purge" TT entries have to be updated, not only the one belonging to the non-tagged network. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-07 17:07:20 +02:00
Sven Eckelmann	7dac6d9391	batman-adv: Remove unused post-VLAN ethhdr in batadv_gw_dhcp_recipient_get Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-07 17:07:20 +02:00
Sven Eckelmann	a2f2b6cd41	batman-adv: Clarify calculation precedence for '&' and '?' Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-07 17:07:19 +02:00
Sven Eckelmann	1e2c2a4fe4	batman-adv: Add required includes to all files The header files could not be build indepdent from each other. This is happened because headers didn't include the files for things they've used. This was problematic because the success of a build depended on the knowledge about the right order of local includes. Also source files were not including everything they've used explicitly. Instead they required that transitive includes are always stable. This is problematic because some transitive includes are not obvious, depend on config settings and may not be stable in the future. The order for include blocks are: * primary headers (main.h and the .h file of a .c file) * global linux headers * required local headers * extra forward declarations for pointers in function/struct declarations The only exceptions are linux/bitops.h and linux/if_ether.h in packet.h. This header file is shared with userspace applications like batctl and must therefore build together with userspace applications. The header linux/bitops.h is not part of the uapi headers and linux/if_ether.h conflicts with the musl implementation of netinet/if_ether.h. The maintainers rejected the use of __KERNEL__ preprocessor checks and thus these two headers are only in main.h. All files using packet.h first have to include main.h to work correctly. Reported-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-07 17:07:19 +02:00
Antonio Quartulli	bcef1f3c49	batman-adv: add bat_neigh_free API This API has to be used to let any routing protocol free neighbor specific allocated resources Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-07 17:07:18 +02:00
Antonio Quartulli	6f70eb7552	batman-adv: split name from variable for uint mesh attributes Some mesh attributes are behind substructs in the batadv_priv object and for this reason the name cannot be used anymore to refer to them. This patch allows to specify the variable name where the attribute is stored inside batadv_priv instead of using the name Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-07 17:07:18 +02:00
Sven Eckelmann	36fd61cb80	batman-adv: Use common Jenkins Hash implementation An unoptimized version of the Jenkins one-at-a-time hash function is used and partially copied all over the code wherever an hashtable is used. Instead the optimized version shared between the whole kernel should be used to reduce code duplication and use better optimized code. Only the DAT code must use the old implementation because it is used as distributed hash function which has to be common for all nodes. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-07 17:07:17 +02:00
Alexei Starovoitov	d691f9e8d4	bpf: allow programs to write to certain skb fields allow programs read/write skb->mark, tc_index fields and ((struct qdisc_skb_cb *)cb)->data. mark and tc_index are generically useful in TC. cb[0]-cb[4] are primarily used to pass arguments from one program to another called via bpf_tail_call() which can be seen in sockex3_kern.c example. All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs. mark, tc_index are writeable from tc_cls_act only. cb[0]-cb[4] are writeable by both sockets and tc_cls_act. Add verifier tests and improve sample code. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 02:01:33 -07:00
Alexei Starovoitov	3431205e03	bpf: make programs see skb->data == L2 for ingress and egress eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data. For ingress L2 header is already pulled, whereas for egress it's present. This is known to program writers which are currently forced to use BPF_LL_OFF workaround. Since programs don't change skb internal pointers it is safe to do pull/push right around invocation of the program and earlier taps and later pt->func() will not be affected. Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick around run_filter/BPF_PROG_RUN even if skb_shared. This fix finally allows programs to use optimized LD_ABS/IND instructions without BPF_LL_OFF for higher performance. tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o w/o JIT w/JIT before 20.5 23.6 Mpps after 21.8 26.6 Mpps Old programs with BPF_LL_OFF will still work as-is. We can now undo most of the earlier workaround commit: `a166151cbe` ("bpf: fix bpf helpers to use skb->mac_header relative offsets") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 02:01:33 -07:00
Eric Dumazet	98da81a426	tcp: remove redundant checks II For same reasons than in commit `12e25e1041` ("tcp: remove redundant checks"), we can remove redundant checks done for timewait sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 01:55:01 -07:00
Alexander Aring	ed65963ba0	mac802154: remove unneeded vif struct This patch removes the virtual interface structure from sub if data struct, because it isn't used anywhere. This structure could be useful for give per interface information at softmac driver layer. Nevertheless there exist no use case currently and it contains the interface type information currently. This information is also stored inside wpan dev which is now used to check on the wpan dev interface type. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Varka Bhadram <varkabhadram@gmail.com> Acked-by: Varka Bhadram <varkabhadram@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-07 09:13:32 +02:00
Eric Dumazet	90c337da15	inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations When an application needs to force a source IP on an active TCP socket it has to use bind(IP, port=x). As most applications do not want to deal with already used ports, x is often set to 0, meaning the kernel is in charge to find an available port. But kernel does not know yet if this socket is going to be a listener or be connected. It has very limited choices (no full knowledge of final 4-tuple for a connect()) With limited ephemeral port range (about 32K ports), it is very easy to fill the space. This patch adds a new SOL_IP socket option, asking kernel to ignore the 0 port provided by application in bind(IP, port=0) and only remember the given IP address. The port will be automatically chosen at connect() time, in a way that allows sharing a source port as long as the 4-tuples are unique. This new feature is available for both IPv4 and IPv6 (Thanks Neal) Tested: Wrote a test program and checked its behavior on IPv4 and IPv6. strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by connect(). Also getsockname() show that the port is still 0 right after bind() but properly allocated after connect(). socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5 setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0 bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0 getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0 connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0 getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0 IPv6 test : socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7 setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0 bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0 connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0 I was able to bind()/connect() a million concurrent IPv4 sockets, instead of ~32000 before patch. lpaa23:~# ulimit -n 1000010 lpaa23:~# ./bind --connect --num-flows=1000000 & 1000000 sockets lpaa23:~# grep TCP /proc/net/sockstat TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66 Check that a given source port is indeed used by many different connections : lpaa23:~# ss -t src :40000 \| head -10 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983 ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983 ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983 ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983 ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983 ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983 ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983 ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983 ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-06 23:57:12 -07:00
Loic Poulain	9380f9eacf	Bluetooth: Reorder HCI user channel socket release The hci close method needs to know if we are in user channel context. Only add the index to mgmt once close is performed. Signed-off-by: Loic Poulain <loic.poulain@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-06 20:49:04 +02:00
Jaganath Kanakkassery	951b6a0717	Bluetooth: Fix potential NULL dereference in RFCOMM bind callback addr can be NULL and it should not be dereferenced before NULL checking. Signed-off-by: Jaganath Kanakkassery <jaganath.k@samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-06 08:44:33 +02:00
Trond Myklebust	0d2a970d0a	SUNRPC: Fix a backchannel race We need to allow the server to send a new request immediately after we've replied to the previous one. Right now, there is a window between the send and the release of the old request in rpc_put_task(), where the server could send us a new backchannel RPC call, and we have no request to service it. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-05 11:15:43 -04:00
Trond Myklebust	1dddda86c0	SUNRPC: Clean up allocation and freeing of back channel requests Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-05 11:15:43 -04:00
Trond Myklebust	0f41979164	SUNRPC: Remove unused argument 'tk_ops' in rpc_run_bc_task Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-05 11:15:42 -04:00
Tom Herbert	b3baa0fbd0	mpls: Add MPLS entropy label in flow_keys In flow dissector if an MPLS header contains an entropy label this is saved in the new keyid field of flow_keys. The entropy label is then represented in the flow hash function input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:31 -07:00
Tom Herbert	1fdd512c92	net: Add GRE keyid in flow_keys In flow dissector if a GRE header contains a keyid this is saved in the new keyid field of flow_keys. The GRE keyid is then represented in the flow hash function input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:31 -07:00
Tom Herbert	87ee9e52ff	net: Add IPv6 flow label to flow_keys In flow_dissector set the flow label in flow_keys for IPv6. This also removes the shortcircuiting of flow dissection when a non-zero label is present, the flow label can be considered to provide additional entropy for a hash. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:31 -07:00
Tom Herbert	d34af823ff	net: Add VLAN ID to flow_keys In flow_dissector set vlan_id in flow_keys when VLAN is found. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:31 -07:00
Tom Herbert	45b47fd00c	net: Get rid of IPv6 hash addresses flow keys We don't need to return the IPv6 address hash as part of flow keys. In general, using the IPv6 address hash is risky in a hash value since the underlying use of xor provides no entropy. If someone really needs the hash value they can get it from the full IPv6 addresses in flow keys (e.g. from flow_get_u32_src). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:31 -07:00
Tom Herbert	9f24908901	net: Add keys for TIPC address Add a new flow key for TIPC addresses. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:31 -07:00
Tom Herbert	c3f8324188	net: Add full IPv6 addresses to flow_keys This patch adds full IPv6 addresses into flow_keys and uses them as input to the flow hash function. The implementation supports either IPv4 or IPv6 addresses in a union, and selector is used to determine how may words to input to jhash2. We also add flow_get_u32_dst and flow_get_u32_src functions which are used to get a u32 representation of the source and destination addresses. For IPv6, ipv6_addr_hash is called. These functions retain getting the legacy values of src and dst in flow_keys. With this patch, Ethertype and IP protocol are now included in the flow hash input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:30 -07:00
Tom Herbert	42aecaa9bb	net: Get skb hash over flow_keys structure This patch changes flow hashing to use jhash2 over the flow_keys structure instead just doing jhash_3words over src, dst, and ports. This method will allow us take more input into the hashing function so that we can include full IPv6 addresses, VLAN, flow labels etc. without needing to resort to xor'ing which makes for a poor hash. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:30 -07:00
Tom Herbert	c468efe2c7	net: Remove superfluous setting of key_basic key_basic is set twice in __skb_flow_dissect which seems unnecessary. Remove second one. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:30 -07:00
Tom Herbert	ce3b535547	net: Simplify GRE case in flow_dissector Do break when we see routing flag or a non-zero version number in GRE header. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 15:44:30 -07:00
Chuck Lever	ffe1f0df58	rpcrdma: Merge svcrdma and xprtrdma modules into one Bi-directional RPC support means code in svcrdma.ko invokes a bit of code in xprtrdma.ko, and vice versa. To avoid loader/linker loops, merge the server and client side modules together into a single module. When backchannel capabilities are added, the combined module will register all needed transport capabilities so that Upper Layer consumers automatically have everything needed to create a bi-directional transport connection. Module aliases are added for backwards compatibility with user space, which still may expect svcrdma.ko or xprtrdma.ko to be present. This commit reverts commit `2e8c12e1b7` ("xprtrdma: add separate Kconfig options for NFSoRDMA client and server support") and provides a single CONFIG option for enabling the new module. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-04 16:56:02 -04:00
Chuck Lever	0380a3f375	svcrdma: Add a separate "max data segs macro for svcrdma The server and client maximum are architecturally independent. Allow changing one without affecting the other. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-04 16:56:01 -04:00
Chuck Lever	b7e0b9a965	svcrdma: Replace GFP_KERNEL in a loop with GFP_NOFAIL At the 2015 LSF/MM, it was requested that memory allocation call sites that request GFP_KERNEL allocations in a loop should be annotated with __GFP_NOFAIL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-04 16:56:00 -04:00
Chuck Lever	30b7e246a6	svcrdma: Keep rpcrdma_msg fields in network byte-order Fields in struct rpcrdma_msg are __be32. Don't byte-swap these fields when decoding RPC calls and then swap them back for the reply. For the most part, they can be left alone. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-04 16:55:59 -04:00
Chuck Lever	70747c25a7	svcrdma: Fix byte-swapping in svc_rdma_sendto.c In send_write_chunks(), we have: for (xdr_off = rqstp->rq_res.head[0].iov_len, chunk_no = 0; xfer_len && chunk_no < arg_ary->wc_nchunks; chunk_no++) { . . . } Note that arg_ary->wc_nchunk is in network byte-order. For the comparison to work correctly, both have to be in native byte-order. In send_reply_chunks, we have: write_len = min(xfer_len, htonl(ch->rs_length)); xfer_len is in native byte-order, and ch->rs_length is in network byte-order. be32_to_cpu() is the correct byte swap for ch->rs_length. As an additional clean up, replace ntohl() with be32_to_cpu() in a few other places. This appears to address a problem with large rsize hangs while using PHYSICAL memory registration. I suspect that is the only registration mode that uses more than one chunk element. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=248 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-04 16:55:58 -04:00
Alexei Starovoitov	94db13fe5f	bpf: fix build due to missing tc_verd fix build error: net/core/filter.c: In function 'bpf_clone_redirect': net/core/filter.c:1429:18: error: 'struct sk_buff' has no member named 'tc_verd' if (G_TC_AT(skb2->tc_verd) & AT_INGRESS) Fixes: `3896d655f4` ("bpf: introduce bpf_clone_redirect() helper") Reported-by: Or Gerlitz <gerlitz.or@gmail.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 11:45:59 -07:00
Varka Bhadram	133be0264f	nl802154: export supported commands This patch will export the supported commands by the devices to the userspace. This will be useful to check if HardMAC drivers can support a specific command or not. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-04 12:27:15 +02:00
Lennert Buytenhek	8a70cefa30	ieee802154: Fix sockaddr_ieee802154 implicit padding information leak. The AF_IEEE802154 sockaddr looks like this: struct sockaddr_ieee802154 { sa_family_t family; /* AF_IEEE802154 / struct ieee802154_addr_sa addr; }; struct ieee802154_addr_sa { int addr_type; u16 pan_id; union { u8 hwaddr[IEEE802154_ADDR_LEN]; u16 short_addr; }; }; On most architectures there will be implicit structure padding here, in two different places: In struct sockaddr_ieee802154, two bytes of padding between 'family' (unsigned short) and 'addr', so that 'addr' starts on a four byte boundary. * In struct ieee802154_addr_sa, two bytes at the end of the structure, to make the structure 16 bytes. When calling recvmsg(2) on a PF_IEEE802154 SOCK_DGRAM socket, the ieee802154 stack constructs a struct sockaddr_ieee802154 on the kernel stack without clearing these padding fields, and, depending on the addr_type, between four and ten bytes of uncleared kernel stack will be copied to userspace. We can't just insert two 'u16 __pad's in the right places and zero those before copying an address to userspace, as not all architectures insert this implicit padding -- from a quick test it seems that avr32, cris and m68k don't insert this padding, while every other architecture that I have cross compilers for does insert this padding. The easiest way to plug the leak is to just memset the whole struct sockaddr_ieee802154 before filling in the fields we want to fill in, and that's what this patch does. Cc: stable@vger.kernel.org Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-04 12:26:58 +02:00
Varka Bhadram	07bd77fa4c	cfg802154: fix rdev-ops naming convension and format specifiers This patch make to use the same naming convention that mac802154 tracing follows and fixes the format specifier for extended addr. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-04 12:26:58 +02:00
Wei Liu	c39c4c6abb	tcp: double default TSQ output bytes limit Xen virtual network driver has higher latency than a physical NIC. Having only 128K as limit for TSQ introduced 30% regression in guest throughput. This patch raises the limit to 256K. This reduces the regression to 8%. This buys us more time to work out a proper solution in the long run. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 01:09:36 -07:00
Eric Dumazet	12e25e1041	tcp: remove redundant checks tcp_v4_rcv() checks the following before calling tcp_v4_do_rcv(): if (th->doff < sizeof(struct tcphdr) / 4) goto bad_packet; if (!pskb_may_pull(skb, th->doff * 4)) goto discard_it; So following check in tcp_v4_do_rcv() is redundant and "goto csum_err;" is wrong anyway. if (skb->len < tcp_hdrlen(skb) \|\| ...) goto csum_err; A second check can be removed after no_tcp_socket label for same reason. Same tests can be removed in tcp_v6_do_rcv() Note : short tcp frames are not properly accounted in tcpInErrs MIB, because pskb_may_pull() failure simply drops incoming skb, we might fix this in a separate patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 01:04:40 -07:00
Shawn Bohrer	6e54030932	ipv4/udp: Verify multicast group is ours in upd_v4_early_demux() `421b3885bf` "udp: ipv4: Add udp early demux" introduced a regression that allowed sockets bound to INADDR_ANY to receive packets from multicast groups that the socket had not joined. For example a socket that had joined 224.168.2.9 could also receive packets from 225.168.2.9 despite not having joined that group if ip_early_demux is enabled. Fix this by calling ip_check_mc_rcu() in udp_v4_early_demux() to verify that the multicast packet is indeed ours. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-04 00:46:26 -07:00
Martin Willi	b08b6b7791	xfrm: Define ChaCha20-Poly1305 AEAD XFRM algo for IPsec users Signed-off-by: Martin Willi <martin@strongswan.org> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2015-06-04 15:04:55 +08:00
Scott Feldman	7616dcbb21	switchdev: documentation: use switchdev_port_obj_xxx for IPv4 FIB add/modify/delete ops Clarify in documentation and code that IPV4 FIB add operation is used for both adding a new FIB entry to the device and for modifying an existing FIB entry on the device. Also, remove left-over references to ipv4_fib ops and replace with details on SWITCHDEV_PORT_IPV4_FIB object. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-03 23:47:23 -07:00
David S. Miller	cf71f43e44	Included changes: - code re-arrangement for better reading and understanding - code style fixups - comments corrections - remove unnecessary NULL check in batadv_iv_ogm_update_seqnos() - make boolean functions explicitly return a bool result - remove unnecessary variables in algo_register() and algo_select() -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVbwnkAAoJEOb/4TMchkvf6iAP/3MyKJDF9ubmOLkZpWKyBq+/ MsTEN4PFRxQQ7Q2+Cct1MshuD0DBBznG+Nu1UwYUB5ahUPUmpntJ8hQoD982jT3u K4h/tHlyEtRVxPzYwW79woE/Q+hjdGqE745eKMHury0K+SkNR4jX3yJ7bjVRwQiC Sdk6uProCCgK5JHX++bxjbTnJobCvqCSy045hjMxuwFuTG4S+5le60m+tVe21D3C tnyT3y6L4OdbhKpBRMMAFkxYUzQONxiEWMYffubM6gk+ziIAttAJemLyE+ViHAH4 Y7ItGd9Z/5+mPaO0OF3Q3jfN1jhGf3IxoYgKy9rL5JWIy6qomx0TTfPoPTDRYFR+ 2iQX59FIayaa9CgYbauHopEiDOJQ/nQ437haPO25xT9ICZbnPNWshdv9Z+zLNV/A uuUQrN+aWNLo9j40iD01s7AfPcYNDYklqygb9hSLTa7yeH/rPCG/RqJJ7zse4IQa /QMl1lUl484gPHFqMTVB7/75KL5G5B+KQdwON3AqnyRR3RrlOm7NbtcvuDTDheeW BAU5g7y/RG3DSoGtwPvFG6MyyPK8C2+niLY7EWUrs1EBWc5DGH+/oeVBR6SL46Fv KY1TiFrzvczjUKA0NyLw3w/jeE3SGxiVEBGN2Wv7veVwuV2Jc3MLGxNZGoKPum/k Vz7vG3ghIRM3aA1dO6Nx =m9Yd -----END PGP SIGNATURE----- Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge Antonio Quartulli says: ==================== pull request: batman-adv 20150603 here you have our second batch of patches intended for net-next. In this patchset you won't find any new features, but quite some code cleanup work, a bunch of code style fixes and also comments corrections by Markus Pargmann. Moreover you have a patch from Sven Eckelmann removing an unnecessary NULL check in batadv_iv_ogm_update_seqnos(). Please pull or let me know of any problem! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-03 20:22:46 -07:00
Alexei Starovoitov	3896d655f4	bpf: introduce bpf_clone_redirect() helper Allow eBPF programs attached to classifier/actions to call bpf_clone_redirect(skb, ifindex, flags) helper which will mirror or redirect the packet by dynamic ifindex selection from within the program to a target device either at ingress or at egress. Can be used for various scenarios, for example, to load balance skbs into veths, split parts of the traffic to local taps, etc. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-03 20:16:58 -07:00
Jiri Benc	640b2b107c	openvswitch: disable LRO Currently, openvswitch tries to disable LRO from the user space. This does not work correctly when the device added is a vlan interface, though. Instead of dealing with possibly complex stacked cross name space relations in the user space, do the same as bridging does and call dev_disable_lro in the kernel. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-03 19:39:35 -07:00
Chuck Lever	da7049f834	svcrdma: Remove svc_rdma_xdr_decode_deferred_req() svc_rdma_xdr_decode_deferred_req() indexes an array with an un-byte-swapped value off the wire. Fortunately this function isn't used anywhere, so simply remove it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-03 15:15:23 -04:00
Chuck Lever	3f87d5d6ac	SUNRPC: Move EXPORT_SYMBOL for svc_process Clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-03 15:15:22 -04:00
Markus Pargmann	f372d09059	batman-adv: Remove unnecessary ret variable in algo_register Remove ret variable and all jumps. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 15:57:25 +02:00
Markus Pargmann	9fb6c6519b	batman-adv: Remove unnecessary ret variable We can avoid this indirect return variable by directly returning the error values. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 15:57:24 +02:00
Markus Pargmann	f2d5cf2add	batman-adv: main, batadv_compare_eth return bool Declare the returntype of batadv_compare_eth as bool. The function called inside this helper function (ether_addr_equal_unaligned) also uses bool as return value, so there is no need to return int. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 15:57:24 +02:00
Markus Pargmann	e8ad3b1acf	batman-adv: main, Convert is_my_mac() to bool It is much clearer to see a bool type as return value than 'int' for functions that are supposed to return true or false. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 15:57:24 +02:00
Sven Eckelmann	a0c77227ff	batman-adv: Remove unnecessary check for orig_ifinfo not NULL orig_ifinfo is dereferenced multiple times in batadv_iv_ogm_update_seqnos before the check for NULL is done. The function also exists at the beginning when orig_ifinfo would have been NULL. This makes the check at the end unnecessary and only confuses the reader/code analyzers. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 15:57:23 +02:00
Markus Pargmann	21102626da	batman-adv: types, Fix comment on bcast_own batadv_orig_bat_iv->bcast_own is actually not a bitfield, it is an array. Adjust the comment accordingly. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 15:57:23 +02:00
Markus Pargmann	d491dbb68b	batman-adv: iv_ogm, fix comment function name This is a small copy paste fix for batadv_ing_buffer_avg. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 10:58:32 +02:00
Markus Pargmann	6c4a1622e2	batman-adv: iv_ogm, fix coding style The kernel coding style says, that there should not be multiple assignments in one row. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 10:58:31 +02:00
Markus Pargmann	9f52ee19c3	batman-adv: iv_ogm, Fix dup_status comment Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 10:58:31 +02:00
Markus Pargmann	23badd6dbe	batman-adv: iv_ogm_orig_update, style, add missing brackets CodingStyle describes that either none or both branches of a conditional have to have brackets. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 10:58:31 +02:00
Markus Pargmann	564891510e	batman-adv: iv_ogm_queue_add, Simplify expressions Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 10:58:30 +02:00
Markus Pargmann	940d156f52	batman-adv: iv_ogm_aggregate_new, simplify error handling It is just a bit easier to put the error handling at one place and let multiple error paths use the same calls. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-06-03 10:58:30 +02:00
Johannes Berg	c526a46767	mac80211: rename single hw-scan flag to follow naming convention The naming convention is to always have the flags prefixed with IEEE80211_HW_ so they're 'namespaced', make this flag follow it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-06-02 20:32:00 +02:00
Johannes Berg	ea1b2b45f5	mac80211: remove short slot/short preamble incapable flags There are no drivers setting IEEE80211_HW_2GHZ_SHORT_SLOT_INCAPABLE or IEEE80211_HW_2GHZ_SHORT_PREAMBLE_INCAPABLE, so any code using the two flags is dead; it's also exceedingly unlikely that any new driver could ever need to set these flags. The wcn36xx code is almost certainly broken, but this preserves the previous behaviour. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-06-02 20:28:58 +02:00
Chuck Lever	632dda833e	SUNRPC: Clean up bc_send() Clean up: Merge bc_send() into bc_svc_process(). Note: even thought this touches svc.c, it is a client-side change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 13:30:35 -04:00
Trond Myklebust	1193d58f75	SUNRPC: Backchannel handle socket nospace If the socket was busy due to a socket nospace error, then we should retry the send. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 13:30:35 -04:00
Varka Bhadram	0ecc4e688b	mac802154: add trace functionality for driver ops This patch adds trace events for driver operations. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-02 19:21:09 +02:00
Alexander Aring	1caf6f476e	ieee802154: 6lowpan: set ackreq when needed This patch sets the acknowledge request bit inside the 802.15.4 mac header when frame retries is 0 or above. The other frame retries value which is -1 indicates that the transmitter doesn't care about an acknowledge frame which will be ignored after transmitting if the node sends anyway an ack frame after receiving. This is currently unnecessary traffic if the max frame retries parameter is -1. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-06-02 17:09:35 +02:00
Doug Ledford	b806ef3bbe	Merge branch 'for-4.2-misc' into k.o/for-4.2	2015-06-02 09:33:22 -04:00
Wengang Wang	d655a9fbc8	rds: re-entry of rds_ib_xmit/rds_iw_xmit The BUG_ON at line 452/453 is triggered in function rds_send_xmit. 441 while (ret) { 442 tmp = min_t(int, ret, sg->length - 443 conn->c_xmit_data_off); 444 conn->c_xmit_data_off += tmp; 445 ret -= tmp; 446 if (conn->c_xmit_data_off == sg->length) { 447 conn->c_xmit_data_off = 0; 448 sg++; 449 conn->c_xmit_sg++; 450 if (ret != 0 && conn->c_xmit_sg == rm->data.op_nents) 451 printk(KERN_ERR "conn %p rm %p sg %p ret %d\n", conn, rm, sg, ret); 452 BUG_ON(ret != 0 && 453 conn->c_xmit_sg == rm->data.op_nents); 454 } 455 } it is complaining the total sent length is bigger that we want to send. rds_ib_xmit() is wrong for the second entry for the same rds_message returning wrong value. the sg and off passed by rds_send_xmit to rds_ib_xmit is based on scatterlist.offset/length, but the rds_ib_xmit action is based on scatterlist.dma_address/dma_length. in case dma_length is larger than length there is problem. for the 2nd and later entries of rds_ib_xmit for same rds_message, at least one of the following two is wrong: 1) the scatterlist to start with, the choosen one can far beyond the correct one. 2) the offset to start with within the scatterlist. fix: add op_dmasg and op_dmaoff to rm_data_op structure indicating the scatterlist and offset within the it to start with for rds_ib_xmit respectively. op_dmasg and op_dmaoff are initialized to zero when doing dma mapping for the first see of the message and are changed when filling send slots. the same applies to rds_iw_xmit too. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-06-02 09:22:31 -04:00
Trond Myklebust	88de6af24f	SUNRPC: Fix a memory leak in the backchannel code req->rq_private_buf isn't initialised when xprt_setup_backchannel calls xprt_free_allocation. Fixes: `fb7a0b9add` ("nfs41: New backchannel helper routines") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 08:55:28 -04:00
Stefan Hajnoczi	9300fdba25	SUNRPC: drop stale doc comments in xprtsock.c Several functions have outdated arguments listed in the doc comments. Drop documentation for arguments that no longer exist. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 08:55:28 -04:00
Johannes Berg	3b79af973c	mac80211: stop using pointers as userspace cookies Even if the pointers are really only accessible to root and used pretty much only by wpa_supplicant, this is still not great; even for debugging it'd be easier to have something that's easier to read and guaranteed to never get reused. With the recent change to make mac80211 create an ack_skb for the mgmt-tx path this becomes possible, only the client probe method needs to also allocate an ack_skb, and we can store the cookie in that skb. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-06-02 13:07:59 +02:00
Johannes Berg	b2eb0ee6d0	mac80211: copy nl80211 mgmt TX SKB for status When we return the TX status for an nl80211 mgmt TX SKB, we should also return the original frame with the status to allow userspace to match up the submission (it could also use the cookie but both ways are permissible.) As TX SKBs could be encrypted, at least in the case of ANQP while associated with the AP, copy the original SKB, store it with an ACK frame ID and restructure the status path to use that to return status with the original SKB. Otherwise, userspace (in particular wpa_supplicant) will get confused. Reported-by: Matti Gottlieb <matti.gottlieb@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-06-02 13:07:55 +02:00
Johannes Berg	db388a567f	mac80211: move TX PN to public part of key struct For drivers supporting TSO or similar features, but that still have PN assignment in software, there's a need to have some memory to store the current PN value. As mac80211 already stores this and it's somewhat complicated to add a per-driver area to the key struct (due to the dynamic sizing thereof) it makes sense to just move the TX PN to the keyconf, i.e. the public part of the key struct. As TKIP is more complicated and we won't able to offload it in this way right now (fast-xmit is skipped for TKIP unless the HW does it all, and our hardware needs MMIC calculation in software) I've not moved that for now - it's possible but requires exposing a lot of the internal TKIP state. As an bonus side effect, we can remove a lot of code by assuming the keyseq struct has a certain layout - with BUILD_BUG_ON to verify it. This might also improve performance, since now TX and RX no longer share a cacheline. Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-06-02 11:16:35 +02:00
David S. Miller	dda922c831	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 22:51:30 -07:00
David S. Miller	e453581dd5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fix for net The following patch reverts the ebtables chunk that enforces counters that was introduced in the recently applied `d26e2c9ffa` ('Revert "netfilter: ensure number of counters is >0 in do_replace()"') since this breaks ebtables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 16:56:43 -07:00
Toshiaki Makita	66e5133f19	vlan: Add GRO support for non hardware accelerated vlan Currently packets with non-hardware-accelerated vlan cannot be handled by GRO. This causes low performance for 802.1ad and stacked vlan, as their vlan tags are currently not stripped by hardware. This patch adds GRO support for non-hardware-accelerated vlan and improves receive performance of them. Test Environment: vlan device (.1Q) on vlan device (.1ad) on ixgbe (82599) Result: - Before $ netperf -t TCP_STREAM -H 192.168.20.2 -l 60 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 60.00 5233.17 Rx side CPU usage: %usr %sys %irq %soft %idle 0.27 58.03 0.00 41.70 0.00 - After $ netperf -t TCP_STREAM -H 192.168.20.2 -l 60 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 60.00 7586.85 Rx side CPU usage: %usr %sys %irq %soft %idle 0.50 25.83 0.00 59.53 14.14 [ Register VLAN offloads with priority 10 -DaveM ] Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 16:50:52 -07:00
Steffen Klassert	ccd740cbc6	vti6: Add pmtu handling to vti6_xmit. We currently rely on the PMTU discovery of xfrm. However if a packet is localy sent, the PMTU mechanism of xfrm tries to to local socket notification what might not work for applications like ping that don't check for this. So add pmtu handling to vti6_xmit to report MTU changes immediately. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 16:03:43 -07:00
Neil McKee	ccea74457b	openvswitch: include datapath actions with sampled-packet upcall to userspace If new optional attribute OVS_USERSPACE_ATTR_ACTIONS is added to an OVS_ACTION_ATTR_USERSPACE action, then include the datapath actions in the upcall. This Directly associates the sampled packet with the path it takes through the virtual switch. Path information currently includes mangling, encapsulation and decapsulation actions for tunneling protocols GRE, VXLAN, Geneve, MPLS and QinQ, but this extension requires no further changes to accommodate datapath actions that may be added in the future. Adding path information enhances visibility into complex virtual networks. Signed-off-by: Neil McKee <neil.mckee@inmon.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 15:05:40 -07:00
David S. Miller	bdef7de4b8	net: Add priority to packet_offload objects. When we scan a packet for GRO processing, we want to see the most common packet types in the front of the offload_base list. So add a priority field so we can handle this properly. IPv4/IPv6 get the highest priority with the implicit zero priority field. Next comes ethernet with a priority of 10, and then we have the MPLS types with a priority of 15. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 14:56:09 -07:00
David S. Miller	18ec898ee5	Revert "net: core: 'ethtool' issue with querying phy settings" This reverts commit `f96dee13b8`. It isn't right, ethtool is meant to manage one PHY instance per netdevice at a time, and this is selected by the SET command. Therefore by definition the GET command must only return the settings for the configured and selected PHY. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 14:43:50 -07:00
Bernhard Thaler	d26e2c9ffa	Revert "netfilter: ensure number of counters is >0 in do_replace()" This partially reverts commit `1086bbe97a` ("netfilter: ensure number of counters is >0 in do_replace()") in net/bridge/netfilter/ebtables.c. Setting rules with ebtables does not work any more with `1086bbe97a` place. There is an error message and no rules set in the end. e.g. ~# ebtables -t nat -A POSTROUTING --src 12:34:56:78:9a:bc -j DROP Unable to update the kernel. Two possible causes: 1. Multiple ebtables programs were executing simultaneously. The ebtables userspace tool doesn't by default support multiple ebtables programs running Reverting the ebtables part of `1086bbe97a` makes this work again. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-06-01 19:45:47 +02:00
Johannes Berg	c9c99f8938	mac80211: act upon and report deauth while associating When trying to associate, the AP could send a deauth frame instead. Currently mac80211 drops that frame and doesn't report it to the supplicant, which, in some versions and/or in certain circumstances will simply keep trying to associate over and over again instead of trying authentication again. Fix this by reacting to deauth frames while associating, reporting them to the supplicant and dropping the association attempt (which is bound to fail.) Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-06-01 14:10:27 +02:00
Florian Fainelli	24595346d7	net: dsa: Properly propagate errors from dsa_switch_setup_one While shuffling some code around, dsa_switch_setup_one() was introduced, and it was modified to return either an error code using ERR_PTR() or a NULL pointer when running out of memory or failing to setup a switch. This is a problem for its caler: dsa_switch_setup() which uses IS_ERR() and expects to find an error code, not a NULL pointer, so we still try to proceed with dsa_switch_setup() and operate on invalid memory addresses. This can be easily reproduced by having e.g: the bcm_sf2 driver built-in, but having no such switch, such that drv->setup will fail. Fix this by using PTR_ERR() consistently which is both more informative and avoids for the caller to use IS_ERR_OR_NULL(). Fixes: `df197195a5` ("net: dsa: split dsa_switch_setup into two functions") Reported-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:50:34 -07:00
Neal Cardwell	9f950415e4	tcp: fix child sockets to use system default congestion control if not set Linux 3.17 and earlier are explicitly engineered so that if the app doesn't specifically request a CC module on a listener before the SYN arrives, then the child gets the system default CC when the connection is established. See tcp_init_congestion_control() in 3.17 or earlier, which says "if no choice made yet assign the current value set as default". The change ("net: tcp: assign tcp cong_ops when tcp sk is created") altered these semantics, so that children got their parent listener's congestion control even if the system default had changed after the listener was created. This commit returns to those original semantics from 3.17 and earlier, since they are the original semantics from 2007 in `4d4d3d1e8` ("[TCP]: Congestion control initialization."), and some Linux congestion control workflows depend on that. In summary, if a listener socket specifically sets TCP_CONGESTION to "x", or the route locks the CC module to "x", then the child gets "x". Otherwise the child gets current system default from net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and earlier, and this commit restores that. Fixes: `55d8694fa8` ("net: tcp: assign tcp cong_ops when tcp sk is created") Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:49:14 -07:00
Sowmini Varadhan	8ba38460f3	net/rds Add getsockopt support for SO_RDS_TRANSPORT The currently attached transport for a PF_RDS socket may be obtained from user space by invoking getsockopt(2) using the SO_RDS_TRANSPORT option at the SOL_RDS level. The integer optval returned will be one of the RDS_TRANS_* constants defined in linux/rds.h. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:47:23 -07:00
Sowmini Varadhan	d97dac54bf	net/rds: Add setsockopt support for SO_RDS_TRANSPORT An application may deterministically attach the underlying transport for a PF_RDS socket by invoking setsockopt(2) with the SO_RDS_TRANSPORT option at the SOL_RDS level. The integer argument to setsockopt must be one of the RDS_TRANS_* transport types, e.g., RDS_TRANS_TCP. The option must be specified before invoking bind(2) on the socket, and may only be used once on the socket. An attempt to set the option on a bound socket, or to invoke the option after a successful SO_RDS_TRANSPORT attachment, will return EOPNOTSUPP. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:47:23 -07:00
Sowmini Varadhan	a28c257c9e	net/rds: Declare SO_RDS_TRANSPORT and RDS_TRANS_* constants in uapi/linux/rds.h User space applications that desire to explicitly select the underlying transport for a PF_RDS socket may do so by using the SO_RDS_TRANSPORT socket option at the SOL_RDS level before bind(). The integer argument provided to the socket option would be one of the RDS_TRANS_* values, e.g., RDS_TRANS_TCP. This commit exports the constant values need by such applications via <linux/rds.h> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:47:23 -07:00
Daniel Borkmann	17ca8cbf49	ebpf: allow bpf_ktime_get_ns_proto also for networking As this is already exported from tracing side via commit `d9847d310a` ("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might as well want to move it to the core, so also networking users can make use of it, e.g. to measure diffs for certain flows from ingress/egress. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:44:44 -07:00
Eric Dumazet	beb39db59d	udp: fix behavior of wrong checksums We have two problems in UDP stack related to bogus checksums : 1) We return -EAGAIN to application even if receive queue is not empty. This breaks applications using edge trigger epoll() 2) Under UDP flood, we can loop forever without yielding to other processes, potentially hanging the host, especially on non SMP. This patch is an attempt to make things better. We might in the future add extra support for rt applications wanting to better control time spent doing a recv() in a hostile environment. For example we could validate checksums before queuing packets in socket receive queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:42:18 -07:00
David S. Miller	d803731462	As we get closer to the merge window, here are a few more things for -next: * disconnect TDLS stations on CSA to avoid issues * fix a memory leak introduced in a recent commit * switch rfkill and cfg80211 to PM ops * in an unlikely scenario, prevent a bookkeeping value to get corrupted leading to dropped packets * fix a crash in VLAN assignment * switch rfkill-gpio to more modern gpiod API * send disconnected event to userspace with proper local/remote indication -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJVaEyAAAoJEDBSmw7B7bqrPiIQAKOrX4g2UNtyoTWJzA7YRu+g GEUu/CE4LQKCodCpBiEhFlhQo2WzXsHoLj5+Nr56aFAZx19VZjXWVC5JS785wYn5 r8hpOVWUUA3MVnXeL/+yz4chm0wTYN9pSpElZ4FHlUI0OkCMh2rPCTvdrbSKoGzV MN8NEO0jVE89AgOMF8gHk5YKpJ6B4QibZuUuZpgkqdwIi5udaCcrPFFrUg/NfRpA nTauP6blFUPOUV0sxbhS78uC3rqGQuYsnvab/QeGc9PDKk5ukrXzFdgRCVZq8224 Ge0JcPzwzWldk892oEJoc2OfGkg5HOil9HtC+S2ehBGuK0yEXOBIkO1ZgudTH1kC 0rLOPWVKRzTWE+sq+gWK/OjfaA7Dl6HFYYHRQ2dhm1XkqtAw8SwGQMDSIPJYWr4O jp4gYpwKVjnMmsEAg7FdKWyIiTgLyI07VnIciORXDyefddYMuofXI2pJkfzUeFeH HjCVYm2NYXDty6uneP4RC1nUbNc53FKJ5O9fW3BPMyVXD4pTjam50p9H6N7OcDN3 k3dEevWiVgvBjZPVc3HI8RaCzS/Ww1ym+MYgV97QkMfgiuE2VkiFwK+zhWn9axbc eutkzFEdDcIACCZ74hIWqMJjsMnZm9E11Uq7tifAE0bi1Wpku1xPAnxMPnI+0eiF Dgo2bmlQ/d1dHr3N3FC0 =KmwY -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-davem-2015-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== As we get closer to the merge window, here are a few more things for -next: * disconnect TDLS stations on CSA to avoid issues * fix a memory leak introduced in a recent commit * switch rfkill and cfg80211 to PM ops * in an unlikely scenario, prevent a bookkeeping value to get corrupted leading to dropped packets * fix a crash in VLAN assignment * switch rfkill-gpio to more modern gpiod API * send disconnected event to userspace with proper local/remote indication ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 17:34:26 -07:00
Lennert Buytenhek	daf4e2c892	ieee802154: Fix EUI-64 station address validation. Refuse to allow setting an EUI-64 group address as an interface address, as those are not valid station addresses. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-31 13:40:53 +02:00
David S. Miller	a9ab2184f4	Included changes: - checkpatch fixes - code cleanup - debugfs component is now compiled only if DEBUG_FS is selected - update copyright years - disable by default not-so-user-safe features -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVaCjMAAoJEOb/4TMchkvfQS0QANDW/0eOT8azlbik5+MZTC5i d+K1Xbc7Qn7ebo5F27eRGNrgV5a8Wwx1JUCANXhAfjSURItj3KoHbjLYN2lJLn5L mBoU7IWwqUzX2garm7xKm94TTaN3Q6t/NGYVeQqJXNcWBDJQcNAr7ECg8tpV16Ec +o6FPsuZBX1dKNijvcy77VNGAaauhAbfMuAYRJDx6CtCIyWg+f/vcAeTR2PCmbMD FP2qD2zHBnR5feQF9YtrCOUHX3SzKlnCBQ1DyUzWbC40eGJWQPZiml+CC0r7fNrI buOlk2yDI1Pc0/TIDrm3B3f0LqoQhmC4h0EDP/tazoiHAe/Vh06D4dmsC81XBM+H 9wEzU+C20DUjDVIyTzboIDjcSNwTN5TxK0dG72vc+yDfSSAmJVtLQ8dqQevRp6cd NPVebjCyJKXoBZWd1o7KO0s41dTbFBVHrA5ZLaEu5TcCMpKHzicJJyMr+OLgqTQE tqLMzqR+7VPmJfIwXuHX+wqHlsJCkrU1zyiuOyBn6uQ4rvbg503eadJffOAaLeCH FpOtKkQ34HNDUchgmiFVWWV1w6r3Si3/a7WRJN55B49sIZqJxxQfB2Evlk8vYNzT sVDFsNk8QnbaL2yCwxJEXj/Kgyfxj/PLAoxDnkt+cHWOF6nbGPHyIdDJQGSAHFrp NcZisqImn5iJS+2QV68a =2UBm -----END PGP SIGNATURE----- Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge Antonio Quartulli says: ==================== Included changes: - checkpatch fixes - code cleanup - debugfs component is now compiled only if DEBUG_FS is selected - update copyright years - disable by default not-so-user-safe features ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 01:07:06 -07:00
Wang Long	282c320d33	netevent: remove automatic variable in register_netevent_notifier() Remove automatic variable 'err' in register_netevent_notifier() and return the result of atomic_notifier_chain_register() directly. Signed-off-by: Wang Long <long.wanglong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 00:03:21 -07:00
David S. Miller	583d3f5af2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) default CONFIG_NETFILTER_INGRESS to y for easier compile-testing of all options. 2) Allow to bind a table to net_device. This introduces the internal NFT_AF_NEEDS_DEV flag to perform a mandatory check for this binding. This is required by the next patch. 3) Add the 'netdev' table family, this new table allows you to create ingress filter basechains. This provides access to the existing nf_tables features from ingress. 4) Kill unused argument from compat_find_calc_{match,target} in ip_tables and ip6_tables, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 00:02:30 -07:00
Julia Lawall	3d2f6d41d1	ipv6: drop unneeded goto Delete jump to a label on the next line, when that label is not used elsewhere. A simplified version of the semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r@ identifier l; @@ -if (...) goto l; -l: // </smpl> Also remove the unnecessary ret variable. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 23:48:36 -07:00
Eric Dumazet	71d9f6149c	bridge: fix br_multicast_query_expired() bug br_multicast_query_expired() querier argument is a pointer to a struct bridge_mcast_querier : struct bridge_mcast_querier { struct br_ip addr; struct net_bridge_port __rcu *port; }; Intent of the code was to clear port field, not the pointer to querier. Fixes: `2cd4143192` ("bridge: memorize and export selected IGMP/MLD querier port") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Linus Lüssing <linus.luessing@c0d3.blue> Cc: Linus Lüssing <linus.luessing@web.de> Cc: Steinar H. Gunderson <sesse@samfundet.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 23:31:28 -07:00
David S. Miller	9d52bf0a23	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2015-05-28 Here's a set of patches intended for 4.2. The majority of the changes are on the 802.15.4 side of things rather than Bluetooth related: - All sorts of cleanups & fixes to ieee802154 and related drivers - Rework of tx power support in ieee802154 and its drivers - Support for setting ieee802154 tx power through nl802154 - New IDs for the btusb driver - Various cleanups & smaller fixes to btusb - New btrtl driver for Realtec devices - Fix suspend/resume for Realtek devices Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 23:26:45 -07:00
Ying Xue	1ea23a2117	tipc: unconditionally put sock refcnt when sock timer to be deleted is pending As sock refcnt is taken when sock timer is started in sk_reset_timer(), the sock refcnt should be put when sock timer to be deleted is in pending state no matter what "probing_state" value of tipc sock is. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 18:08:37 -07:00
Alexei Starovoitov	37e82c2f97	bpf: allow BPF programs access skb->skb_iif and skb->dev->ifindex fields classic BPF already exposes skb->dev->ifindex via SKF_AD_IFINDEX extension. Allow eBPF program to access it as well. Note that classic aborts execution of the program if 'skb->dev == NULL' (which is inconvenient for program writers), whereas eBPF returns zero in such case. Also expose the 'skb_iif' field, since programs triggered by redirected packet need to known the original interface index. Summary: __skb->ifindex -> skb->dev->ifindex __skb->ingress_ifindex -> skb->skb_iif Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 17:51:13 -07:00
Sorin Dumitru	8133534c76	net: limit tcp/udp rmem/wmem to SOCK_{RCV,SND}BUF_MIN This is similar to b1cb59cf2efe(net: sysctl_net_core: check SNDBUF and RCVBUF for min length). I don't think too small values can cause crashes in the case of udp and tcp, but I've seen this set to too small values which triggered awful performance. It also makes the setting consistent across all the wmem/rmem sysctls. Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 17:37:44 -07:00
Uwe Kleine-König	f7959e9c73	net: rfkill: gpio: make better use of gpiod API Since `39b2bbe3d7` (gpio: add flags argument to gpiod_get() functions) which appeared in v3.17-rc1, the gpiod_get functions take an additional parameter that allows to specify direction and initial value for output. Furthermore there is devm_gpiod_get_optional which is designed to get optional gpios. Simplify driver accordingly. Note this makes error checking more strict because only -ENOENT is ignored when searching for the GPIOs which is good. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-29 13:13:45 +02:00
Michal Kazior	6cbfb1bb66	cfg80211: ignore netif running state when changing iftype It was possible for mac80211 to be coerced into an unexpected flow causing sdata union to become corrupted. Station pointer was put into sdata->u.vlan.sta memory location while it was really master AP's sdata->u.ap.next_beacon. This led to station entry being later freed as next_beacon before __sta_info_flush() in ieee80211_stop_ap() and a subsequent invalid pointer dereference crash. The problem was that ieee80211_ptr->use_4addr wasn't cleared on interface type changes. This could be reproduced with the following steps: # host A and host B have just booted; no # wpa_s/hostapd running; all vifs are down host A> iw wlan0 set type station host A> iw wlan0 set 4addr on host A> printf 'interface=wlan0\nssid=4addrcrash\nchannel=1\nwds_sta=1' > /tmp/hconf host A> hostapd -B /tmp/conf host B> iw wlan0 set 4addr on host B> ifconfig wlan0 up host B> iw wlan0 connect -w hostAssid host A> pkill hostapd # host A crashed: [ 127.928192] BUG: unable to handle kernel NULL pointer dereference at 00000000000006c8 [ 127.929014] IP: [<ffffffff816f4f32>] __sta_info_flush+0xac/0x158 ... [ 127.934578] [<ffffffff8170789e>] ieee80211_stop_ap+0x139/0x26c [ 127.934578] [<ffffffff8100498f>] ? dump_trace+0x279/0x28a [ 127.934578] [<ffffffff816dc661>] __cfg80211_stop_ap+0x84/0x191 [ 127.934578] [<ffffffff816dc7ad>] cfg80211_stop_ap+0x3f/0x58 [ 127.934578] [<ffffffff816c5ad6>] nl80211_stop_ap+0x1b/0x1d [ 127.934578] [<ffffffff815e53f8>] genl_family_rcv_msg+0x259/0x2b5 Note: This isn't a revert of `f8cdddb8d6` ("cfg80211: check iface combinations only when iface is running") as far as functionality is considered because `b6a550156b` ("cfg80211/mac80211: move more combination checks to mac80211") moved the logic somewhere else already. Fixes: `f8cdddb8d6` ("cfg80211: check iface combinations only when iface is running") Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-29 13:05:40 +02:00
Michal Kazior	ab499db80f	mac80211: prevent possible crypto tx tailroom corruption There was a possible race between ieee80211_reconfig() and ieee80211_delayed_tailroom_dec(). This could result in inability to transmit data if driver crashed during roaming or rekeying and subsequent skbs with insufficient tailroom appeared. This race was probably never seen in the wild because a device driver would have to crash AND recover within 0.5s which is very unlikely. I was able to prove this race exists after changing the delay to 10s locally and crashing ath10k via debugfs immediately after GTK rekeying. In case of ath10k the counter went below 0. This was harmless but other drivers which actually require tailroom (e.g. for WEP ICV or MMIC) could end up with the counter at 0 instead of >0 and introduce insufficient skb tailroom failures because mac80211 would not resize skbs appropriately anymore. Fixes: `8d1f7ecd2a` ("mac80211: defer tailroom counter manipulation when roaming") Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-29 13:04:46 +02:00
Antonio Quartulli	8ea64e2708	batman-adv: Use common declaration order in *_send_skb_(packet\|unicast) Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>	2015-05-29 10:13:37 +02:00
Markus Pargmann	01b97a3eed	batman-adv: iv_ogm_orig_update, remove unnecessary brackets Remove these unnecessary brackets inside a condition. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:37 +02:00
Markus Pargmann	8f34b38878	batman-adv: iv_ogm_can_aggregate, code readability This patch tries to increase code readability by negating the first if block and rearranging some of the other conditional blocks. This way we save an indentation level, we also save some allocation that is not necessary for one of the conditions. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:37 +02:00
Marek Lindner	fc1f869366	batman-adv: checkpatch - spaces preferred around that '*' Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:37 +02:00
Marek Lindner	00f548bf54	batman-adv: checkpatch - comparison to NULL could be rewritten Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:37 +02:00
Sven Eckelmann	dab7b62190	batman-adv: Use safer default config for optional features The current default settings for optional features in batman-adv seems to be based around the idea that the user only compiles what he requires. They will automatically enabled when they are compiled in. For example the network coding part of batman-adv is by default disabled in the out-of-tree module but will be enabled when the code is compiled during the module build. But distributions like Debian just enable all features of the batman-adv kernel module and hope that more experimental features or features with possible negative effects have to be enabled using some runtime configuration interface. The network_coding feature can help in specific setups but also has drawbacks and is not disabled by default in the out-of-tree module. Disabling by default in the runtime config seems to be also quite sane. The bridge_loop_avoidance is the only feature which is disabled by default but may be necessary even in simple setups. Packet loops may even be created during the initial node setup when this is not enabled. This is different than STP on bridges because mesh is usually used on Adhoc WiFi. Having two nodes (by accident) in the same LAN segment and in the same mesh network is rather common in this situation. Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Martin Hundebøll <martin@hundeboll.net> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	de12baece9	batman-adv: iv_ogm_send_to_if, declare char* as const This string pointer is later assigned to a constant string, so it should be defined constant at the beginning. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	9fd9b19ea0	batman-adv: iv_ogm_aggr_packet, bool return value This function returns bool values, so it should be defined to return them instead of the whole int range. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	42d9f2cbd4	batman-adv: iv_ogm_iface_enable, direct return values Directly return error values. No need to use a return variable. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	9fc1883ef2	batman-adv: Makefile, Sort alphabetically The whole Makefile is sorted, just the multicast rule is not at the right position. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	16b9ce83fb	batman-adv: tvlv realloc, move error handling into if block Instead of hiding the normal function flow inside an if block, we should just put the error handling into the if block. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:36 +02:00
Markus Pargmann	9bb218828c	batman-adv: debugfs, avoid compiling for !DEBUG_FS Normally the debugfs framework will return error pointer with -ENODEV for function calls when DEBUG_FS is not set. batman does not notice this error code and continues trying to create debugfs files and executes more code. We can avoid this code execution by disabling compiling debugfs.c when DEBUG_FS is not set. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:35 +02:00
Sven Eckelmann	83e8b87721	batman-adv: Use only queued fragments when merging The fragment queueing code now validates the total_size of each fragment, checks when enough fragments are queued to allow to merge them into a single packet and if the fragments have the correct size. Therefore, it is not required to have any other parameter for the merging function than a list of queued fragments. This change should avoid problems like in the past when the different skb from the list and the function parameter were mixed incorrectly. Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Martin Hundebøll <martin@hundeboll.net> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:35 +02:00
Sven Eckelmann	53e771457e	batman-adv: Check total_size when queueing fragments The fragmentation code was replaced in `610bfc6bc9` ("batman-adv: Receive fragmented packets and merge") by an implementation which handles the queueing+merging of fragments based on their size and the total_size of the non-fragmented packet. This total_size is announced by each fragment. The new implementation doesn't check if the the total_size information of the packets inside one chain is consistent. This is consistency check is recommended to allow using any of the packets in the queue to decide whether all fragments of a packet are received or not. Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Martin Hundebøll <martin@hundeboll.net> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:35 +02:00
Sven Eckelmann	9f6446c7f9	batman-adv: update copyright years for 2015 Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>	2015-05-29 10:13:35 +02:00
Simon Wunderlich	70e717762d	batman-adv: Start new development cycle Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2015-05-29 10:13:35 +02:00
David S. Miller	5aab0e8a45	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2015-05-28 1) Fix a race in xfrm_state_lookup_byspi, we need to take the refcount before we release xfrm_state_lock. From Li RongQing. 2) Fix IV generation on ESN state. We used just the low order sequence numbers for IV generation on ESN, as a result the IV can repeat on the same state. Fix this by using the high order sequence number bits too and make sure to always initialize the high order bits with zero. These patches are serious stable candidates. Fixes from Herbert Xu. 3) Fix the skb->mark handling on vti. We don't reset skb->mark in skb_scrub_packet anymore, so vti must care to restore the original value back after it was used to lookup the vti policy and state. Fixes from Alexander Duyck. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-28 20:41:35 -07:00
David S. Miller	a74eab639e	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2015-05-28 1) Remove xfrm_queue_purge as this is the same as skb_queue_purge. 2) Optimize policy and state walk. 3) Use a sane return code if afinfo registration fails. 4) Only check fori a acquire state if the state is not valid. 5) Remove a unnecessary NULL check before xfrm_pol_hold as it checks the input for NULL. 6) Return directly if the xfrm hold queue is empty, avoid to take a lock as it is nothing to do in this case. 7) Optimize the inexact policy search and allow for matching of policies with priority ~0U. All from Li RongQing. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-28 20:23:01 -07:00
Alexander Duyck	d55c670cbc	ip_vti/ip6_vti: Preserve skb->mark after rcv_cb call The vti6_rcv_cb and vti_rcv_cb calls were leaving the skb->mark modified after completing the function. This resulted in the original skb->mark value being lost. Since we only need skb->mark to be set for xfrm_policy_check we can pull the assignment into the rcv_cb calls and then just restore the original mark after xfrm_policy_check has been completed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-28 06:23:32 +02:00
Alexander Duyck	049f8e2e28	xfrm: Override skb->mark with tunnel->parm.i_key in xfrm_input This change makes it so that if a tunnel is defined we just use the mark from the tunnel instead of the mark from the skb header. By doing this we can avoid the need to set skb->mark inside of the tunnel receive functions. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-28 06:23:31 +02:00
Alexander Duyck	cd5279c194	ip_vti/ip6_vti: Do not touch skb->mark on xmit Instead of modifying skb->mark we can simply modify the flowi_mark that is generated as a result of the xfrm_decode_session. By doing this we don't need to actually touch the skb->mark and it can be preserved as it passes out through the tunnel. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-28 06:23:31 +02:00
Herbert Xu	957e0fe629	mac80211: Switch to new AEAD interface This patch makes use of the new AEAD interface which uses a single SG list instead of separate lists for the AD and plain text. Tested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2015-05-28 11:23:20 +08:00
Herbert Xu	25528fdae4	mac802154: Switch to new AEAD interface This patch makes use of the new AEAD interface which uses a single SG list instead of separate lists for the AD and plain text. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2015-05-28 11:23:20 +08:00
Herbert Xu	000ae7b269	esp6: Switch to new AEAD interface This patch makes use of the new AEAD interface which uses a single SG list instead of separate lists for the AD and plain text. The IV generation is also now carried out through normal AEAD methods. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2015-05-28 11:23:20 +08:00
Herbert Xu	7021b2e1cd	esp4: Switch to new AEAD interface This patch makes use of the new AEAD interface which uses a single SG list instead of separate lists for the AD and plain text. The IV generation is also now carried out through normal AEAD methods. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2015-05-28 11:23:20 +08:00
Herbert Xu	69b0137f61	ipsec: Add IV generator information to xfrm_state This patch adds IV generator information to xfrm_state. This is currently obtained from our own list of algorithm descriptions. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2015-05-28 11:23:20 +08:00
Herbert Xu	165ecc6373	xfrm: Add IV generator information to xfrm_algo_desc This patch adds IV generator information for each AEAD and block cipher to xfrm_algo_desc. This will be used to access the new AEAD interface. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2015-05-28 11:23:19 +08:00
Luis R. Rodriguez	9c27847dda	kernel/params: constify struct kernel_param_ops uses Most code already uses consts for the struct kernel_param_ops, sweep the kernel for the last offending stragglers. Other than include/linux/moduleparam.h and kernel/params.c all other changes were generated with the following Coccinelle SmPL patch. Merge conflicts between trees can be handled with Coccinelle. In the future git could get Coccinelle merge support to deal with patch --> fail --> grammar --> Coccinelle --> new patch conflicts automatically for us on patches where the grammar is available and the patch is of high confidence. Consider this a feature request. Test compiled on x86_64 against: * allnoconfig * allmodconfig * allyesconfig @ const_found @ identifier ops; @@ const struct kernel_param_ops ops = { }; @ const_not_found depends on !const_found @ identifier ops; @@ -struct kernel_param_ops ops = { +const struct kernel_param_ops ops = { }; Generated-by: Coccinelle SmPL Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Junio C Hamano <gitster@pobox.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: cocci@systeme.lip6.fr Cc: linux-kernel@vger.kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-05-28 11:32:10 +09:30
Linus Torvalds	8f98bcdf8f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Don't use MMIO on certain iwlwifi devices otherwise we get a firmware crash. 2) Don't corrupt the GRO lists of mac80211 contexts by doing sends via timer interrupt, from Johannes Berg. 3) SKB tailroom is miscalculated in AP_VLAN crypto code, from Michal Kazior. 4) Fix fw_status memory leak in iwlwifi, from Haim Dreyfuss. 5) Fix use after free in iwl_mvm_d0i3_enable_tx(), from Eliad Peller. 6) JIT'ing of large BPF programs is broken on x86, from Alexei Starovoitov. 7) EMAC driver ethtool register dump size is miscalculated, from Ivan Mikhaylov. 8) Fix PHY initial link mode when autonegotiation is disabled in amd-xgbe, from Tom Lendacky. 9) Fix NULL deref on SOCK_DEAD socket in AF_UNIX and CAIF protocols, from Mark Salyzyn. 10) credit_bytes not initialized properly in xen-netback, from Ross Lagerwall. 11) Fallback from MSI-X to INTx interrupts not handled properly in mlx4 driver, fix from Benjamin Poirier. 12) Perform ->attach() after binding dev->qdisc in packet scheduler, otherwise we can crash. From Cong WANG. 13) Don't clobber data in sctp_v4_map_v6(). From Jason Gunthorpe. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits) sctp: Fix mangled IPv4 addresses on a IPv6 listening socket net_sched: invoke ->attach() after setting dev->qdisc xen-netfront: properly destroy queues when removing device mlx4_core: Fix fallback from MSI-X to INTx xen/netback: Properly initialize credit_bytes net: netxen: correct sysfs bin attribute return code tools: bpf_jit_disasm: fix segfault on disabled debugging log output unix/caif: sk_socket can disappear when state is unlocked amd-xgbe-phy: Fix initial mode when autoneg is disabled net: dp83640: fix improper double spin locking. net: dp83640: reinforce locking rules. net: dp83640: fix broken calibration routine. net: stmmac: create one debugfs dir per net-device net/ibm/emac: fix size of emac dump memory areas x86: bpf_jit: fix compilation of large bpf programs net: phy: bcm7xxx: Fix 7425 PHY ID and flags iwlwifi: mvm: avoid use-after-free on iwl_mvm_d0i3_enable_tx() iwlwifi: mvm: clean net-detect info if device was reset during suspend iwlwifi: mvm: take the UCODE_DOWN reference when resuming iwlwifi: mvm: BT Coex - duplicate the command if sent ASYNC ...	2015-05-27 13:41:13 -07:00
Eric Dumazet	ed2dfd9009	tcp/dccp: warn user for preferred ip_local_port_range After commit `07f4c90062` ("tcp/dccp: try to not exhaust ip_local_port_range in connect()") it is advised to have an even number of ports described in /proc/sys/net/ipv4/ip_local_port_range This means start/end values should have a different parity. Let's warn sysadmins of this, so that they can update their settings if they want to. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 14:35:36 -04:00
Eric Dumazet	e2baad9e4b	tcp: connect() from bound sockets can be faster __inet_hash_connect() does not use its third argument (port_offset) if socket was already bound to a source port. No need to perform useless but expensive md5 computations. Reported-by: Crestez Dan Leonard <cdleonard@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 14:30:10 -04:00
WANG Cong	86e363dc3b	net_sched: invoke ->attach() after setting dev->qdisc For mq qdisc, we add per tx queue qdisc to root qdisc for display purpose, however, that happens too early, before the new dev->qdisc is finally set, this causes q->list points to an old root qdisc which is going to be freed right before assigning with a new one. Fix this by moving ->attach() after setting dev->qdisc. For the record, this fixes the following crash: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756 ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20 ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000 Call Trace: [<ffffffff81a44e7f>] dump_stack+0x4c/0x65 [<ffffffff810790da>] warn_slowpath_common+0x9c/0xb6 [<ffffffff814e725b>] ? __list_del_entry+0x5a/0x98 [<ffffffff81079162>] warn_slowpath_fmt+0x46/0x48 [<ffffffff81820eb0>] ? dev_graft_qdisc+0x5e/0x6a [<ffffffff814e725b>] __list_del_entry+0x5a/0x98 [<ffffffff814e72a7>] list_del+0xe/0x2d [<ffffffff81822f05>] qdisc_list_del+0x1e/0x20 [<ffffffff81820cd1>] qdisc_destroy+0x30/0xd6 [<ffffffff81822676>] qdisc_graft+0x11d/0x243 [<ffffffff818233c1>] tc_get_qdisc+0x1a6/0x1d4 [<ffffffff810b5eaf>] ? mark_lock+0x2e/0x226 [<ffffffff817ff8f5>] rtnetlink_rcv_msg+0x181/0x194 [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19 [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19 [<ffffffff817ff774>] ? __rtnl_unlock+0x17/0x17 [<ffffffff81855dc6>] netlink_rcv_skb+0x4d/0x93 [<ffffffff817ff756>] rtnetlink_rcv+0x26/0x2d [<ffffffff818544b2>] netlink_unicast+0xcb/0x150 [<ffffffff81161db9>] ? might_fault+0x59/0xa9 [<ffffffff81854f78>] netlink_sendmsg+0x4fa/0x51c [<ffffffff817d6e09>] sock_sendmsg_nosec+0x12/0x1d [<ffffffff817d8967>] sock_sendmsg+0x29/0x2e [<ffffffff817d8cf3>] ___sys_sendmsg+0x1b4/0x23a [<ffffffff8100a1b8>] ? native_sched_clock+0x35/0x37 [<ffffffff810a1d83>] ? sched_clock_local+0x12/0x72 [<ffffffff810a1fd4>] ? sched_clock_cpu+0x9e/0xb7 [<ffffffff810def2a>] ? current_kernel_time+0xe/0x32 [<ffffffff810b4bc5>] ? lock_release_holdtime.part.29+0x71/0x7f [<ffffffff810ddebf>] ? read_seqcount_begin.constprop.27+0x5f/0x76 [<ffffffff810b6292>] ? trace_hardirqs_on_caller+0x17d/0x199 [<ffffffff811b14d5>] ? __fget_light+0x50/0x78 [<ffffffff817d9808>] __sys_sendmsg+0x42/0x60 [<ffffffff817d9838>] SyS_sendmsg+0x12/0x1c [<ffffffff81a50e97>] system_call_fastpath+0x12/0x6f ---[ end trace ef29d3fb28e97ae7 ]--- For long term, we probably need to clean up the qdisc_graft() code in case it hides other bugs like this. Fixes: `95dc19299f` ("pkt_sched: give visibility to mq slave qdiscs") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 14:09:55 -04:00
Eric Dumazet	07f4c90062	tcp/dccp: try to not exhaust ip_local_port_range in connect() A long standing problem on busy servers is the tiny available TCP port range (/proc/sys/net/ipv4/ip_local_port_range) and the default sequential allocation of source ports in connect() system call. If a host is having a lot of active TCP sessions, chances are very high that all ports are in use by at least one flow, and subsequent bind(0) attempts fail, or have to scan a big portion of space to find a slot. In this patch, I changed the starting point in __inet_hash_connect() so that we try to favor even [1] ports, leaving odd ports for bind() users. We still perform a sequential search, so there is no guarantee, but if connect() targets are very different, end result is we leave more ports available to bind(), and we spread them all over the range, lowering time for both connect() and bind() to find a slot. This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range is even, ie if start/end values have different parity. Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to 32768 - 60999 (instead of 32768 - 61000) There is no change on security aspects here, only some poor hashing schemes could be eventually impacted by this change. [1] : The odd/even property depends on ip_local_port_range values parity Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 13:30:44 -04:00
Alexander Aring	b69644c1c7	nl802154: add support to set cca ed level This patch adds support for setting the current cca ed level value over nl802154. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Varka Bhadram <varkabhadram@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-27 19:29:42 +02:00
Alexander Aring	e4390592a4	nl802154: add support for cca ed level info This patch adds information about the current cca ed level when the phy is dumped over nl802154. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Varka Bhadram <varkabhadram@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-27 19:29:42 +02:00
Florian Westphal	d6b915e29f	ip_fragment: don't forward defragmented DF packet We currently always send fragments without DF bit set. Thus, given following setup: mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280 A R1 R2 B Where R1 and R2 run linux with netfilter defragmentation/conntrack enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1 will respond with icmp too big error if one of these fragments exceeded 1400 bytes. However, if R1 receives fragment sizes 1200 and 100, it would forward the reassembled packet without refragmenting, i.e. R2 will send an icmp error in response to a packet that was never sent, citing mtu that the original sender never exceeded. The other minor issue is that a refragmentation on R1 will conceal the MTU of R2-B since refragmentation does not set DF bit on the fragments. This modifies ip_fragment so that we track largest fragment size seen both for DF and non-DF packets, and set frag_max_size to the largest value. If the DF fragment size is larger or equal to the non-df one, we will consider the packet a path mtu probe: We set DF bit on the reassembled skb and also tag it with a new IPCB flag to force refragmentation even if skb fits outdev mtu. We will also set DF bit on each fragment in this case. Joint work with Hannes Frederic Sowa. Reported-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 13:03:31 -04:00
Florian Westphal	c5501eb340	net: ipv4: avoid repeated calls to ip_skb_dst_mtu helper ip_skb_dst_mtu is small inline helper, but its called in several places. before: 17061 44 0 17105 42d1 net/ipv4/ip_output.o after: 16805 44 0 16849 41d1 net/ipv4/ip_output.o Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 13:03:30 -04:00
Varka Bhadram	dec169eccc	ieee802154: fix typo for file name Signed-off-by: Varka Bhadram <varkab@cdac.in> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-27 13:32:46 +02:00
Varka Bhadram	0f999b09f5	ieee802154: add set transmit power support This patch adds transmission power setting support for IEEE-802.15.4 devices via nl802154. Signed-off-by: Varka Bhadram <varkab@cdac.in> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-27 13:29:25 +02:00
David S. Miller	ffa915d071	ipv4: Fix fib_trie.c build, missing linux/vmalloc.h include. We used to get this indirectly I supposed, but no longer do. Either way, an explicit include should have been done in the first place. net/ipv4/fib_trie.c: In function '__node_free_rcu': >> net/ipv4/fib_trie.c:293:3: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration] vfree(n); ^ net/ipv4/fib_trie.c: In function 'tnode_alloc': >> net/ipv4/fib_trie.c:312:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration] return vzalloc(size); ^ >> net/ipv4/fib_trie.c:312:3: warning: return makes pointer from integer without a cast cc1: some warnings being treated as errors Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 00:19:03 -04:00
Eric Dumazet	d6a4e26afb	tcp: tcp_tso_autosize() minimum is one packet By making sure sk->sk_gso_max_segs minimal value is one, and sysctl_tcp_min_tso_segs minimal value is one as well, tcp_tso_autosize() will return a non zero value. We can then revert `843925f33f` ("tcp: Do not apply TSO segment limit to non-TSO packets") and save few cpu cycles in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 23:21:29 -04:00
Mark Salyzyn	b48732e4a4	unix/caif: sk_socket can disappear when state is unlocked got a rare NULL pointer dereference in clear_bit Signed-off-by: Mark Salyzyn <salyzyn@android.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> ---- v2: switch to sock_flag(sk, SOCK_DEAD) and added net/caif/caif_socket.c v3: return -ECONNRESET in upstream caller of wait function for SOCK_DEAD Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 23:19:29 -04:00
Eric Dumazet	095dc8e0c3	tcp: fix/cleanup inet_ehash_locks_alloc() If tcp ehash table is constrained to a very small number of buckets (eg boot parameter thash_entries=128), then we can crash if spinlock array has more entries. While we are at it, un-inline inet_ehash_locks_alloc() and make following changes : - Budget 2 cache lines per cpu worth of 'spinlocks' - Try to kmalloc() the array to avoid extra TLB pressure. (Most servers at Google allocate 8192 bytes for this hash table) - Get rid of various #ifdef Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 19:48:46 -04:00
Jon Paul Maloy	f3903bcc00	tipc: fix bug in link protocol message create function In commit `dd3f9e70f5` ("tipc: add packet sequence number at instant of transmission") we made a change with the consequence that packets in the link backlog queue don't contain valid sequence numbers. However, when we create a link protocol message, we still use the sequence number of the first packet in the backlog, if there is any, as "next_sent" indicator in the message. This may entail unnecessary retransissions or stale packet transmission when there is very low traffic on the link. This commit fixes this issue by only using the current value of tipc_link::snd_nxt as indicator. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 19:43:03 -04:00
David S. Miller	fe9066ade6	We have three more fixes: * AP_VLAN tailroom calculation fix, the bug leads to warnings along with dropped packets * NAPI context issue, calling napi_gro_receive() from a timer (obviously) can lead to crashes * remain-on-channel combining leads to dropped requests and not being able to finish certain operations, so remove it -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJVZB7ZAAoJEDBSmw7B7bqr/qkQAKW5RHFfo8TtjUB7pl/iWqc3 /gyym9tisy4hc7OOr8BDTQdUvqzY4fRMhuTAvjPLXCqdV2Isz1IetiogiJRMglNs 3v83QkmEI8vMQ3lP6Y+2Hnz9tw1zNaVXHGwIKPjk9YLrzsBV0AJoGqn8qo4OBfWl JjkHLM6/0PVDy5UDF95cRyM6+L0XJdPdVS/YRLslp5Tda8fgTbH+dMwLnzjQZbIu ZanFpAuRxx35g/Zg6vAsRhlva/zrucphteaiJGAa6a3NgH9Z4tDlGHRveHQOgNYt xHKNcvOgegaFNcEY8ftKMcQ/RIVJjxXr6nPYnQyFnG0aAxYePysNx8TJaUX2dxq7 +O1RlYHwJpRRUmScSsDFDa/CmQcpUxgloUfCmkWS611g1LZFnpVOcXEZRJg/c9lm hO6mH19OYlDiWeE3ZhKeYJNxmpWvPM4bxhswHcYfLG+vA93kLTYQ/xGi/0YfMKl+ +UCTPbdiXdRyYLzixiu/NWcKwWDH2pHAH1pjimH+r2266lQYs2Jsk8436uQKhZxI D4l3ethujDcmMzO+ZTzLHjWbNdO2fC4R1LIF/Eg/sDe7g11dQItWYrAVddQFIkfb /A5VV1DbGW33tpj4QUXfjuQ65I6rLOq4NbTY3j4/lPSiDkqIpEfwjzuc3/UZeuZl Z8cu8Qxf7jdOfihbnUFR =PS0g -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2015-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== We have three more fixes: * AP_VLAN tailroom calculation fix, the bug leads to warnings along with dropped packets * NAPI context issue, calling napi_gro_receive() from a timer (obviously) can lead to crashes * remain-on-channel combining leads to dropped requests and not being able to finish certain operations, so remove it ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 19:38:53 -04:00
Alexander Aring	fc4f805243	nl802154: fix cca mode wpan phy flag This patch fix the handling to call cca mode setting. If the phy isn't flag then the driver doesn't support this setting. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reported-by: Varka Bhadram <varkabhadram@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 23:45:15 +02:00
Lennert Buytenhek	641459ca33	mac802154: mac802154_mlme_start_req() optimisation. mac802154_mlme_start_req() calls ieee802154_mlme_ops(dev)->llsec->set_params() on the net_device passed into it, however, this net_device will always be a mac802154 net_device, so just call mac802154_set_params() directly instead. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 20:26:10 +02:00
Lennert Buytenhek	66a3297f6d	ieee802154 socket: No need to check for ARPHRD_IEEE802154 in raw_bind(). ieee802154_get_dev() only returns devices that have dev->type == ARPHRD_IEEE802154, therefore, there is no need to check this again in raw_bind(). Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 20:26:10 +02:00
Lennert Buytenhek	01c8d2bbd4	ieee802154: Remove 802.15.4/6LoWPAN checks for interface MTU. In the past, 802.15.4 interfaces and 6LoWPAN interfaces used the same dev->type (ARPHRD_IEEE802154), and 802.15.4 interfaces were distinguished from 6LoWPAN interfaces by their differing dev->mtu. 6LoWPAN interfaces have their own ARPHRD type now, so there is no longer any need to check dev->mtu to distinguish 802.15.4 devices from 6LoWPAN devices. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 20:26:09 +02:00
Pablo Neira Ayuso	ed6c4136f1	netfilter: nf_tables: add netdev table to filter from ingress This allows us to create netdev tables that contain ingress chains. Use skb_header_pointer() as we may see shared sk_buffs at this stage. This change provides access to the existing nf_tables features from the ingress hook. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-26 18:41:23 +02:00
Pablo Neira Ayuso	ebddf1a8d7	netfilter: nf_tables: allow to bind table to net_device This patch adds the internal NFT_AF_NEEDS_DEV flag to indicate that you must attach this table to a net_device. This change is required by the follow up patch that introduces the new netdev table. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-26 18:41:17 +02:00
Pablo Neira Ayuso	529985de20	netfilter: default CONFIG_NETFILTER_INGRESS to y Useful to compile-test all options. Suggested-by: Alexei Stavoroitov <ast@plumgrid.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-26 18:41:06 +02:00
Florian Westphal	2f06550b3b	netfilter: remove unused comefrom hookmask argument Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-26 18:40:30 +02:00
Lennert Buytenhek	c032705ebf	ieee802154 socket: Return EMSGSIZE from raw_sendmsg() if packet too big. The proper return code for trying to send a packet that exceeds the outgoing interface's MTU is EMSGSIZE, not EINVAL, so patch ieee802154's raw_sendmsg() to do the right thing. (Its dgram_sendmsg() was already returning EMSGSIZE for this case.) Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 18:07:39 +02:00
Lennert Buytenhek	e34fd879f5	mac802154: Avoid rtnl deadlock in mac802154_wpan_ioctl(). ->ndo_do_ioctl() can be entered with the rtnl lock already held, for example when sending a wext ioctl to a device (in which case the rtnl lock is taken by wext_ioctl_dispatch()), but mac802154_wpan_ioctl() currently unconditionally takes the rtnl lock on entry, which can cause deadlocks. To fix this, bail out of mac802154_wpan_ioctl() before taking the rtnl lock if the ioctl cmd is not one of the cmds we implement. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 18:03:01 +02:00
Johannes Berg	80279fb7ba	cfg80211: properly send NL80211_ATTR_DISCONNECTED_BY_AP in disconnect When we disconnect from the AP, drivers call cfg80211_disconnect(). This doesn't know whether the disconnection was initiated locally or by the AP though, which can cause problems with the supplicant, for example with WPS. This issue obviously doesn't show up with any mac80211 based driver since mac80211 doesn't call this function. Fix this by requiring drivers to indicate whether the disconnect is locally generated or not. I've tried to update the drivers, but may not have gotten the values correct, and some drivers may currently not be able to report correct values. In case of doubt I left it at false, which is the current behaviour. For libertas, make adjustments as indicated by Dan Williams. Reported-by: Matthieu Mauger <matthieux.mauger@intel.com> Tested-by: Matthieu Mauger <matthieux.mauger@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-26 15:21:27 +02:00
Geert Uytterhoeven	069d4a7b58	netfilter: ebtables: fix comment grammar s/stongly inspired on/strongly inspired by/ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2015-05-26 15:10:01 +02:00
Eric Dumazet	05c985436d	net: fix inet_proto_csum_replace4() sparse errors make C=2 CF=-D__CHECK_ENDIAN__ net/core/utils.o ... net/core/utils.c:307:72: warning: incorrect type in argument 2 (different base types) net/core/utils.c:307:72: expected restricted __wsum [usertype] addend net/core/utils.c:307:72: got restricted __be32 [usertype] from net/core/utils.c:308:34: warning: incorrect type in argument 2 (different base types) net/core/utils.c:308:34: expected restricted __wsum [usertype] addend net/core/utils.c:308:34: got restricted __be32 [usertype] to net/core/utils.c:310:70: warning: incorrect type in argument 2 (different base types) net/core/utils.c:310:70: expected restricted __wsum [usertype] addend net/core/utils.c:310:70: got restricted __be32 [usertype] from net/core/utils.c:310:77: warning: incorrect type in argument 2 (different base types) net/core/utils.c:310:77: expected restricted __wsum [usertype] addend net/core/utils.c:310:77: got restricted __be32 [usertype] to net/core/utils.c:312:72: warning: incorrect type in argument 2 (different base types) net/core/utils.c:312:72: expected restricted __wsum [usertype] addend net/core/utils.c:312:72: got restricted __be32 [usertype] from net/core/utils.c:313:35: warning: incorrect type in argument 2 (different base types) net/core/utils.c:313:35: expected restricted __wsum [usertype] addend net/core/utils.c:313:35: got restricted __be32 [usertype] to Note we can use csum_replace4() helper Fixes: `58e3cac561` ("net: optimise inet_proto_csum_replace4()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 22:56:47 -04:00
Eric Dumazet	68319052d1	net: remove a sparse error in secure_dccpv6_sequence_number() make C=2 CF=-D__CHECK_ENDIAN__ net/core/secure_seq.o net/core/secure_seq.c:157:50: warning: restricted __be32 degrades to integer Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 22:55:37 -04:00
Florian Grandel	f72186d22a	Bluetooth: mgmt: fix typos A few comments had minor typos. These are being fixed. Signed-off-by: Florian Grandel <fgrandel@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-26 03:57:56 +02:00
Wilson Kok	eb8d7baae2	bridge: skip fdb add if the port shouldn't learn Check in fdb_add_entry() if the source port should learn, similar check is used in br_fdb_update. Note that new fdb entries which are added manually or as local ones are still permitted. This patch has been tested by running traffic via a bridge port and switching the port's state, also by manually adding/removing entries from the bridge's fdb. Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 20:29:54 -04:00
Eric Dumazet	d496958145	pktgen: remove one sparse error net/core/pktgen.c:2672:43: warning: incorrect type in assignment (different base types) net/core/pktgen.c:2672:43: expected unsigned short [unsigned] [short] [usertype] <noident> net/core/pktgen.c:2672:43: got restricted __be16 [usertype] protocol Let's use proper struct ethhdr instead of hard coding everything. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 20:27:50 -04:00
Eric Dumazet	7f1598678d	ipv6: ipv6_select_ident() returns a __be32 ipv6_select_ident() returns a 32bit value in network order. Fixes: `286c2349f6` ("ipv6: Clean up ipv6_select_ident() and ip6_fragment()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 20:27:11 -04:00
Nicholas Mc Guire	005e8709c6	irda: use msecs_to_jiffies for conversion to jiffies API compliance scanning with coccinelle flagged: ./net/irda/timer.c:63:35-37: use of msecs_to_jiffies probably perferable Converting milliseconds to jiffies by "val * HZ / 1000" technically is not a clean solution as it does not handle all corner cases correctly. By changing the conversion to use msecs_to_jiffies(val) conversion is correct in all cases. Further the () around the arithmetic expression was dropped. Patch was compile tested for x86_64_defconfig + CONFIG_IRDA=m Patch is against 4.1-rc4 (localversion-next is -next-20150522) Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 17:46:21 -04:00
Linus Lüssing	6ae4ae8e51	bridge: allow setting hash_max + multicast_router if interface is down Network managers like netifd (used in OpenWRT for instance) try to configure interface options after creation but before setting the interface up. Unfortunately the sysfs / bridge currently only allows to configure the hash_max and multicast_router options when the bridge interface is up. But since br_multicast_init() doesn't start any timers and only sets default values and initializes timers it should be save to reconfigure the default values after that, before things actually get active after the bridge is set up. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 17:28:01 -04:00
Florian Westphal	485fca664d	ipv6: don't increase size when refragmenting forwarded ipv6 skbs since commit `6aafeef03b` ("netfilter: push reasm skb through instead of original frag skbs") we will end up sometimes re-fragmenting skbs that we've reassembled. ipv6 defrag preserves the original skbs using the skb frag list, i.e. as long as the skb frag list is preserved there is no problem since we keep original geometry of fragments intact. However, in the rare case where the frag list is munged or skb is linearized, we might send larger fragments than what we originally received. A router in the path might then send packet-too-big errors even if sender never sent fragments exceeding the reported mtu: mtu 1500 - 1500:1400 - 1400:1280 - 1280 A R1 R2 B 1 - A sends to B, fragment size 1400 2 - R2 sends pkttoobig error for 1280 3 - A sends to B, fragment size 1280 4 - R2 sends pkttoobig error for 1280 again because it sees fragments of size 1400. make sure ip6_fragment always caps MTU at largest packet size seen when defragmented skb is forwarded. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 17:22:23 -04:00
Martin KaFai Lau	d52d3997f8	ipv6: Create percpu rt6_info After the patch 'ipv6: Only create RTF_CACHE routes after encountering pmtu exception', we need to compensate the performance hit (bouncing dst->__refcnt). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:35 -04:00
Martin KaFai Lau	83a09abd1a	ipv6: Break up ip6_rt_copy() This patch breaks up ip6_rt_copy() into ip6_rt_copy_init() and ip6_rt_cache_alloc(). In the later patch, we need to create a percpu rt6_info copy. Hence, refactor the common rt6_info init codes to ip6_rt_copy_init(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	8d0b94afdc	ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister This patch keeps track of the DST_NOCACHE routes in a list and replaces its dev with loopback during the iface down/unregister event. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	3da59bd945	ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set This patch always creates RTF_CACHE clone with DST_NOCACHE when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to the fl6->daddr. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Tested-by: Julian Anastasov <ja@ssi.bg> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	48e8aa6e31	ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags The neighbor look-up used to depend on the rt6i_gateway (if there is a gateway) or the rt6i_dst (if it is a RTF_CACHE clone) as the nexthop address. Note that rt6i_dst is set to fl6->daddr for the RTF_CACHE clone where fl6->daddr is the one used to do the route look-up. Now, we only create RTF_CACHE clone after encountering exception. When doing the neighbor look-up with a route that is neither a gateway nor a RTF_CACHE clone, the daddr in skb will be used as the nexthop. In some cases, the daddr in skb is not the one used to do the route look-up. One example is in ip_vs_dr_xmit_v6() where the real nexthop server address is different from the one in the skb. This patch is going to follow the IPv4 approach and ask the ip6_pol_route() callers to set the FLOWI_FLAG_KNOWN_NH properly. In the next patch, ip6_pol_route() will honor the FLOWI_FLAG_KNOWN_NH and create a RTF_CACHE clone. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Tested-by: Julian Anastasov <ja@ssi.bg> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	b197df4f0f	ipv6: Add rt6_get_cookie() function Instead of doing the rt6->rt6i_node check whenever we need to get the route's cookie. Refactor it into rt6_get_cookie(). It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also percpu rt6_info later. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:34 -04:00
Martin KaFai Lau	45e4fd2668	ipv6: Only create RTF_CACHE routes after encountering pmtu exception This patch creates a RTF_CACHE routes only after encountering a pmtu exception. After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6 tree, the rt->rt6i_node->fn_sernum is bumped which will fail the ip6_dst_check() and trigger a relookup. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:33 -04:00
Martin KaFai Lau	8b9df26577	ipv6: Combine rt6_alloc_cow and rt6_alloc_clone A prep work for creating RTF_CACHE on exception only. After this patch, the same condition (rt->rt6i_flags & (RTF_NONEXTHOP \| RTF_GATEWAY)) is checked twice. This redundancy will be removed in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:33 -04:00
Martin KaFai Lau	2647a9b070	ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst. Also, rt6i_gateway is always set to the nexthop while the nexthop could be a gateway or the rt6i_dst.addr. After removing the rt6i_dst and rt6i_src dependency in the last patch, we also need to stop the caller from depending on rt6i_gateway and RTF_ANYCAST. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:33 -04:00
Martin KaFai Lau	fd0273d793	ipv6: Remove external dependency on rt6i_dst and rt6i_src This patch removes the assumptions that the returned rt is always a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the destination and source address. The dst and src can be recovered from the calling site. We may consider to rename (rt6i_dst, rt6i_src) to (rt6i_key_dst, rt6i_key_src) later. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:32 -04:00
Martin KaFai Lau	286c2349f6	ipv6: Clean up ipv6_select_ident() and ip6_fragment() This patch changes the ipv6_select_ident() signature to return a fragment id instead of taking a whole frag_hdr as a param to only set the frag_hdr->identification. It also cleans up ip6_fragment() to obtain the fragment id at the beginning instead of using multiple "if" later to check fragment id has been generated or not. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 13:25:32 -04:00
Florian Westphal	cf82624432	ip: reject too-big defragmented DF-skb when forwarding Send icmp pmtu error if we find that the largest fragment of df-skb exceeded the output path mtu. The ip output path will still catch this later on but we can avoid the forward/postrouting hook traversal by rejecting right away. This is what ipv6 already does. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:08:48 -04:00
Hannes Frederic Sowa	2b514574f7	net: af_unix: implement splice for stream af_unix sockets unix_stream_recvmsg is refactored to unix_stream_read_generic in this patch and enhanced to deal with pipe splicing. The refactoring is inneglible, we mostly have to deal with a non-existing struct msghdr argument. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:59 -04:00
Hannes Frederic Sowa	a60e3cc7c9	net: make skb_splice_bits more configureable Prepare skb_splice_bits to be able to deal with AF_UNIX sockets. AF_UNIX sockets don't use lock_sock/release_sock and thus we have to use a callback to make the locking and unlocking configureable. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:59 -04:00
Hannes Frederic Sowa	869e7c6248	net: af_unix: implement stream sendpage support This patch implements sendpage support for AF_UNIX SOCK_STREAM sockets. This is also required for a complete splice implementation. The implementation is a bit tricky because we append to already existing skbs and so have to hold unix_sk->readlock to protect the reading side from either advancing UNIXCB.consumed or freeing the skb at the socket receive tail. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:58 -04:00
Hannes Frederic Sowa	be12a1fe29	net: skbuff: add skb_append_pagefrags and use it Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:58 -04:00
Linus Torvalds	086e8ddb56	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull two Ceph fixes from Sage Weil: "These fix an issue with the RBD notifications when there are topology changes in the cluster" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" libceph: request a new osdmap if lingering request maps to no osd	2015-05-23 11:28:25 -07:00
Alexander Aring	c947f7e1e3	mac802154: remove mib lock This patch removes the mib lock. The new locking mechanism is to protect the mib values with the rtnl lock. Note that this isn't always necessary if we have an interface up the most mib values are readonly (e.g. address settings). With this behaviour we can remove locking in hotpath like frame parsing completely. It depends on context if we need to hold the rtnl lock or not, this makes the callbacks of ieee802154_mlme_ops unnecessary because these callbacks hols always the locks. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-23 17:57:08 +02:00
Alexander Aring	344f8c119d	mac802154: use atomic ops for sequence incrementation This patch will use atomic operations for sequence number incrementation while MAC header generation. Upper layers like af_802154 or 6LoWPAN could call this function in a parallel context while generating 802.15.4 MAC header before queuing into wpan interfaces transmit queue. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-23 17:57:08 +02:00
Alexander Aring	4a3a8c0c3a	mac802154: remove pib lock This patch removes the pib lock which is now replaced by rtnl lock. The new interface already use the rtnl lock only. Nevertheless this patch will fix issues while using new and old interface at the same time. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-23 17:57:08 +02:00
Alexander Aring	4a669f7d72	mac802154: fix hold rtnl while ioctl This patch fixes an issue to set address configuration with ioctl. Accessing the mib requires rtnl lock and the ndo_do_ioctl doesn't hold the rtnl lock while this callback is called. This patch do that manually. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Reported-by: Matteo Petracca <matteo.petracca@sssup.it> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-23 17:57:07 +02:00
David S. Miller	36583eb54d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-23 01:22:35 -04:00
Jesper Dangaard Brouer	4020726479	pktgen: make /proc/net/pktgen/pgctrl report fail on invalid input Giving /proc/net/pktgen/pgctrl an invalid command just returns shell success and prints a warning in dmesg. This is not very useful for shell scripting, as it can only detect the error by parsing dmesg. Instead return -EINVAL when the command is unknown, as this provides userspace shell scripting a way of detecting this. Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl output this version number which would allow to detect this small semantic change, and (2) because the pktgen version tag have not been updated since 2010. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 23:59:16 -04:00
Jesper Dangaard Brouer	d079abd181	pktgen: adjust spacing in proc file interface output Too many spaces were introduced in commit `63adc6fb8a` ("pktgen: cleanup checkpatch warnings"), thus misaligning "src_min:" to other columns. Fixes: `63adc6fb8a` ("pktgen: cleanup checkpatch warnings") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 23:59:16 -04:00
Eric Dumazet	93a33a584e	bridge: fix lockdep splat Following lockdep splat was reported : [ 29.382286] =============================== [ 29.382315] [ INFO: suspicious RCU usage. ] [ 29.382344] 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 Not tainted [ 29.382380] ------------------------------- [ 29.382409] net/bridge/br_private.h:626 suspicious rcu_dereference_check() usage! [ 29.382455] other info that might help us debug this: [ 29.382507] rcu_scheduler_active = 1, debug_locks = 0 [ 29.382549] 2 locks held by swapper/0/0: [ 29.382576] #0: (((&p->forward_delay_timer))){+.-...}, at: [<ffffffff81139f75>] call_timer_fn+0x5/0x4f0 [ 29.382660] #1: (&(&br->lock)->rlock){+.-...}, at: [<ffffffffa0450dc1>] br_forward_delay_timer_expired+0x31/0x140 [bridge] [ 29.382754] stack backtrace: [ 29.382787] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 [ 29.382838] Hardware name: LENOVO 422916G/LENOVO, BIOS A1KT53AUS 04/07/2015 [ 29.382882] 0000000000000000 3ebfc20364115825 ffff880666603c48 ffffffff81892d4b [ 29.382943] 0000000000000000 ffffffff81e124e0 ffff880666603c78 ffffffff8110bcd7 [ 29.383004] ffff8800785c9d00 ffff88065485ac58 ffff880c62002800 ffff880c5fc88ac0 [ 29.383065] Call Trace: [ 29.383084] <IRQ> [<ffffffff81892d4b>] dump_stack+0x4c/0x65 [ 29.383130] [<ffffffff8110bcd7>] lockdep_rcu_suspicious+0xe7/0x120 [ 29.383178] [<ffffffffa04520f9>] br_fill_ifinfo+0x4a9/0x6a0 [bridge] [ 29.383225] [<ffffffffa045266b>] br_ifinfo_notify+0x11b/0x4b0 [bridge] [ 29.383271] [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge] [ 29.383320] [<ffffffffa0450de8>] br_forward_delay_timer_expired+0x58/0x140 [bridge] [ 29.383371] [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge] [ 29.383416] [<ffffffff8113a033>] call_timer_fn+0xc3/0x4f0 [ 29.383454] [<ffffffff81139f75>] ? call_timer_fn+0x5/0x4f0 [ 29.383493] [<ffffffff8110a90f>] ? lock_release_holdtime.part.29+0xf/0x200 [ 29.383541] [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge] [ 29.383587] [<ffffffff8113a6a4>] run_timer_softirq+0x244/0x490 [ 29.383629] [<ffffffff810b68cc>] __do_softirq+0xec/0x670 [ 29.383666] [<ffffffff810b70d5>] irq_exit+0x145/0x150 [ 29.383703] [<ffffffff8189f506>] smp_apic_timer_interrupt+0x46/0x60 [ 29.383744] [<ffffffff8189d523>] apic_timer_interrupt+0x73/0x80 [ 29.383782] <EOI> [<ffffffff816f131f>] ? cpuidle_enter_state+0x5f/0x2f0 [ 29.383832] [<ffffffff816f131b>] ? cpuidle_enter_state+0x5b/0x2f0 Problem here is that br_forward_delay_timer_expired() is a timer handler, calling br_ifinfo_notify() which assumes either rcu_read_lock() or RTNL are held. Simplest fix seems to add rcu read lock section. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Reported-by: Dominick Grift <dac.override@gmail.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 16:23:56 -04:00
Arun Parameswaran	f96dee13b8	net: core: 'ethtool' issue with querying phy settings When trying to configure the settings for PHY1, using commands like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to modify other settings apart from the speed of the PHY1, in the above case. The ethtool seems to query the settings for PHY0, and use this as the base to apply the new settings to the PHY1. This is causing the other settings of the PHY 1 to be wrongly configured. The issue is caused by the '_ethtool_get_settings()' API, which gets called because of the 'ETHTOOL_GSET' command, is clearing the 'cmd' pointer (of type 'struct ethtool_cmd') by calling memset. This clears all the parameters (if any) passed for the 'ETHTOOL_GSET' cmd. So the driver's callback is always invoked with 'cmd->phy_address' as '0'. The '_ethtool_get_settings()' is called from other files in the 'net/core'. So the fix is applied to the 'ethtool_get_settings()' which is only called in the context of the 'ethtool'. Signed-off-by: Arun Parameswaran <aparames@broadcom.com> Reviewed-by: Ray Jui <rjui@broadcom.com> Reviewed-by: Scott Branden <sbranden@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 16:14:17 -04:00
Thadeu Lima de Souza Cascardo	47cc84ce0c	bridge: fix parsing of MLDv2 reports When more than a multicast address is present in a MLDv2 report, all but the first address is ignored, because the code breaks out of the loop if there has not been an error adding that address. This has caused failures when two guests connected through the bridge tried to communicate using IPv6. Neighbor discoveries would not be transmitted to the other guest when both used a link-local address and a static address. This only happens when there is a MLDv2 querier in the network. The fix will only break out of the loop when there is a failure adding a multicast address. The mdb before the patch: dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp dev ovirtmgmt port bond0.86 grp ff02::2 temp After the patch: dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp dev ovirtmgmt port bond0.86 grp ff02::fb temp dev ovirtmgmt port bond0.86 grp ff02::2 temp dev ovirtmgmt port bond0.86 grp ff02::d temp dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp dev ovirtmgmt port bond0.86 grp ff02::16 temp dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp Fixes: `08b202b672` ("bridge br_multicast: IPv6 MLD support.") Reported-by: Rik Theys <Rik.Theys@esat.kuleuven.be> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Tested-by: Rik Theys <Rik.Theys@esat.kuleuven.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 15:08:20 -04:00
Michal Kubeček	d4e64c2909	ipv4: fill in table id when replacing a route When replacing an IPv4 route, tb_id member of the new fib_alias structure is not set in the replace code path so that the new route is ignored. Fixes: `0ddcf43d5d` ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 14:33:17 -04:00
David S. Miller	572152adfb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contain Netfilter fixes for your net tree, they are: 1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash. This problem is due to wrong order in the per-net registration and netlink socket events. Patch from Francesco Ruggeri. 2) Make sure that counters that userspace pass us are higher than 0 in all the x_tables frontends. Discovered via Trinity, patch from Dave Jones. 3) Revert a patch for br_netfilter to rely on the conntrack status bits. This breaks stateless IPv6 NAT transformations. Patch from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 14:25:45 -04:00
Eric W. Biederman	381c759d99	ipv4: Avoid crashing in ip_error ip_error does not check if in_dev is NULL before dereferencing it. IThe following sequence of calls is possible: CPU A CPU B ip_rcv_finish ip_route_input_noref() ip_route_input_slow() inetdev_destroy() dst_input() With the result that a network device can be destroyed while processing an input packet. A crash was triggered with only unicast packets in flight, and forwarding enabled on the only network device. The error condition was created by the removal of the network device. As such it is likely the that error code was -EHOSTUNREACH, and the action taken by ip_error (if in_dev had been accessible) would have been to not increment any counters and to have tried and likely failed to send an icmp error as the network device is going away. Therefore handle this weird case by just dropping the packet if !in_dev. It will result in dropping the packet sooner, and will not result in an actual change of behavior. Fixes: `251da41301` ("ipv4: Cache ip_error() routes even when not forwarding.") Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Tested-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 14:23:40 -04:00
Jiri Pirko	12c227ec89	flow_dissector: do not break if ports are not needed in flowlabel This restored previous behaviour. If caller does not want ports to be filled, we should not break. Fixes: `06635a35d1` ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 13:59:02 -04:00
Eric Dumazet	d654976cbf	tcp: fix a potential deadlock in tcp_get_info() Taking socket spinlock in tcp_get_info() can deadlock, as inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i], while packet processing can use the reverse locking order. We could avoid this locking for TCP_LISTEN states, but lockdep would certainly get confused as all TCP sockets share same lockdep classes. [ 523.722504] ====================================================== [ 523.728706] [ INFO: possible circular locking dependency detected ] [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted [ 523.739202] ------------------------------------------------------- [ 523.745474] ss/18032 is trying to acquire lock: [ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360 [ 523.758129] [ 523.758129] but task is already holding lock: [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0 [ 523.774661] [ 523.774661] which lock already depends on the new lock. [ 523.774661] [ 523.782850] [ 523.782850] the existing dependency chain (in reverse order) is: [ 523.790326] -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}: [ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270 [ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50 [ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110 [ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350 [ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500 [ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0 [ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10 [ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0 [ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0 [ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0 [ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420 Lets use u64_sync infrastructure instead. As a bonus, 64bit arches get optimized, as these are nop for them. Fixes: `0df48c26d8` ("tcp: add tcpi_bytes_acked to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 13:46:06 -04:00
Marcelo Ricardo Leitner	2efd055c53	tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info This patch tracks the total number of inbound and outbound segments on a TCP socket. One may use this number to have an idea on connection quality when compared against the retransmissions. RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut These are a 32bit field each and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->segs_out was placed near tp->snd_nxt for good data locality and minimal performance impact, while tp->segs_in was placed near tp->bytes_received for the same reason. Join work with Eric Dumazet. Note that received SYN are accounted on the listener, but sent SYNACK are not accounted. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 23:25:21 -04:00
Florian Westphal	48ed7b26fa	ipv6: reject locally assigned nexthop addresses ip -6 addr add dead::1/128 dev eth0 sleep 5 ip -6 route add default via dead::1/128 -> fails ip -6 addr add dead::1/128 dev eth0 ip -6 route add default via dead::1/128 -> succeeds reason is that if (nonsensensical) route above is added, dead::1 is still subject to DAD, so the route lookup will pick eth0 as outdev due to the prefix route that is added before DAD work is started. Add explicit test that checks if nexthop gateway is a local address. Link: https://bugzilla.redhat.com/show_bug.cgi?id=1167969 Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 23:23:38 -04:00
Eric Dumazet	946f9eb226	tcp: improve REUSEADDR/NOREUSEADDR cohabitation inet_csk_get_port() randomization effort tends to spread sockets on all the available range (ip_local_port_range) This is unfortunate because SO_REUSEADDR sockets have less requirements than non SO_REUSEADDR ones. If an application uses SO_REUSEADDR hint, it is to try to allow source ports being shared. So instead of picking a random port number in ip_local_port_range, lets try first in first half of the range. This gives more chances to use upper half of the range for the sockets with strong requirements (not using SO_REUSEADDR) Note this patch does not add a new sysctl, and only changes the way we try to pick port number. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Flavio Leitner <fbl@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 18:55:32 -04:00
Eric Dumazet	f5af1f57a2	inet_hashinfo: remove bsocket counter We no longer need bsocket atomic counter, as inet_csk_get_port() calls bind_conflict() regardless of its value, after commit `2b05ad33e1` ("tcp: bind() fix autoselection to share ports") This patch removes overhead of maintaining this counter and double inet_csk_get_port() calls under pressure. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Flavio Leitner <fbl@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 18:55:32 -04:00
Jason Baron	ce5ec44099	tcp: ensure epoll edge trigger wakeup when write queue is empty We currently rely on the setting of SOCK_NOSPACE in the write() path to ensure that we wake up any epoll edge trigger waiters when acks return to free space in the write queue. However, if we fail to allocate even a single skb in the write queue, we could end up waiting indefinitely. Fix this by explicitly issuing a wakeup when we detect the condition of an empty write queue and a return value of -EAGAIN. This allows userspace to re-try as we expect this to be a temporary failure. I've tested this approach by artificially making sk_stream_alloc_skb() return NULL periodically. In that case, epoll edge trigger waiters will hang indefinitely in epoll_wait() without this patch. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 18:52:47 -04:00
Daniel Borkmann	c78e1746d3	net: sched: fix call_rcu() race on classifier module unloads Vijay reported that a loop as simple as ... while true; do tc qdisc add dev foo root handle 1: prio tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1 tc qdisc del dev foo root rmmod cls_u32 done ... will panic the kernel. Moreover, he bisected the change apparently introducing it to `78fd1d0ab0` ("netlink: Re-add locking to netlink_lookup() and seq walker"). The removal of synchronize_net() from the netlink socket triggering the qdisc to be removed, seems to have uncovered an RCU resp. module reference count race from the tc API. Given that RCU conversion was done after `e341694e3e` ("netlink: Convert netlink_lookup() to use RCU protected hash table") which added the synchronize_net() originally, occasion of hitting the bug was less likely (not impossible though): When qdiscs that i) support attaching classifiers and, ii) have at least one of them attached, get deleted, they invoke tcf_destroy_chain(), and thus call into ->destroy() handler from a classifier module. After RCU conversion, all classifier that have an internal prio list, unlink them and initiate freeing via call_rcu() deferral. Meanhile, tcf_destroy() releases already reference to the tp->ops->owner module before the queued RCU callback handler has been invoked. Subsequent rmmod on the classifier module is then not prevented since all module references are already dropped. By the time, the kernel invokes the RCU callback handler from the module, that function address is then invalid. One way to fix it would be to add an rcu_barrier() to unregister_tcf_proto_ops() to wait for all pending call_rcu()s to complete. synchronize_rcu() is not appropriate as under heavy RCU callback load, registered call_rcu()s could be deferred longer than a grace period. In case we don't have any pending call_rcu()s, the barrier is allowed to return immediately. Since we came here via unregister_tcf_proto_ops(), there are no users of a given classifier anymore. Further nested call_rcu()s pointing into the module space are not being done anywhere. Only cls_bpf_delete_prog() may schedule a work item, to unlock pages eventually, but that is not in the range/context of cls_bpf anymore. Fixes: `25d8c0d55f` ("net: rcu-ify tcf_proto") Fixes: `9888faefe1` ("net: sched: cls_basic use RCU") Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 18:48:18 -04:00
Alexei Starovoitov	04fd61ab36	bpf: allow bpf programs to tail-call other bpf programs introduce bpf_tail_call(ctx, &jmp_table, index) helper function which can be used from BPF programs like: int bpf_prog(struct pt_regs ctx) { ... bpf_tail_call(ctx, &jmp_table, index); ... } that is roughly equivalent to: int bpf_prog(struct pt_regs ctx) { ... if (jmp_table[index]) return (jmp_table[index])(ctx); ... } The important detail that it's not a normal call, but a tail call. The kernel stack is precious, so this helper reuses the current stack frame and jumps into another BPF program without adding extra call frame. It's trivially done in interpreter and a bit trickier in JITs. In case of x64 JIT the bigger part of generated assembler prologue is common for all programs, so it is simply skipped while jumping. Other JITs can do similar prologue-skipping optimization or do stack unwind before jumping into the next program. bpf_tail_call() arguments: ctx - context pointer jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table index - index in the jump table Since all BPF programs are idenitified by file descriptor, user space need to populate the jmp_table with FDs of other BPF programs. If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere and program execution continues as normal. New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can populate this jmp_table array with FDs of other bpf programs. Programs can share the same jmp_table array or use multiple jmp_tables. The chain of tail calls can form unpredictable dynamic loops therefore tail_call_cnt is used to limit the number of calls and currently is set to 32. Use cases: Acked-by: Daniel Borkmann <daniel@iogearbox.net> ========== - simplify complex programs by splitting them into a sequence of small programs - dispatch routine For tracing and future seccomp the program may be triggered on all system calls, but processing of syscall arguments will be different. It's more efficient to implement them as: int syscall_entry(struct seccomp_data ctx) { bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number /); ... default: process unknown syscall ... } int sys_write_event(struct seccomp_data ctx) {...} int sys_read_event(struct seccomp_data ctx) {...} syscall_jmp_table[__NR_write] = sys_write_event; syscall_jmp_table[__NR_read] = sys_read_event; For networking the program may call into different parsers depending on packet format, like: int packet_parser(struct __sk_buff skb) { ... parse L2, L3 here ... __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol)); bpf_tail_call(skb, &ipproto_jmp_table, ipproto); ... default: process unknown protocol ... } int parse_tcp(struct __sk_buff skb) {...} int parse_udp(struct __sk_buff skb) {...} ipproto_jmp_table[IPPROTO_TCP] = parse_tcp; ipproto_jmp_table[IPPROTO_UDP] = parse_udp; - for TC use case, bpf_tail_call() allows to implement reclassify-like logic - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table are atomic, so user space can build chains of BPF programs on the fly Implementation details: ======================= - high performance of bpf_tail_call() is the goal. It could have been implemented without JIT changes as a wrapper on top of BPF_PROG_RUN() macro, but with two downsides: . all programs would have to pay performance penalty for this feature and tail call itself would be slower, since mandatory stack unwind, return, stack allocate would be done for every tailcall. . tailcall would be limited to programs running preempt_disabled, since generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would need to be either global per_cpu variable accessed by helper and by wrapper or global variable protected by locks. In this implementation x64 JIT bypasses stack unwind and jumps into the callee program after prologue. - bpf_prog_array_compatible() ensures that prog_type of callee and caller are the same and JITed/non-JITed flag is the same, since calling JITed program from non-JITed is invalid, since stack frames are different. Similarly calling kprobe type program from socket type program is invalid. - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map' abstraction, its user space API and all of verifier logic. It's in the existing arraymap.c file, since several functions are shared with regular array map. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 17:07:59 -04:00
Daniel Borkmann	e7582bab5d	net: dev: reduce both ingress hook ifdefs Reduce ifdef pollution slightly, no functional change. We can simply remove the extra alternative definition of handle_ing() and nf_ingress(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 16:58:53 -04:00
Eric Dumazet	eb9344781a	tcp: add a force_schedule argument to sk_stream_alloc_skb() In commit `8e4d980ac2` ("tcp: fix behavior for epoll edge trigger") we fixed a possible hang of TCP sockets under memory pressure, by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule() if no packet is in socket write queue. It turns out there are other cases where we want to force memory schedule : tcp_fragment() & tso_fragment() need to split a big TSO packet into two smaller ones. If we block here because of TCP memory pressure, we can effectively block TCP socket from sending new data. If no further ACK is coming, this hang would be definitive, and socket has no chance to effectively reduce its memory usage. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 16:56:40 -04:00
Erik Kline	765c9c639f	neigh: Better handling of transition to NUD_PROBE state [1] When entering NUD_PROBE state via neigh_update(), perhaps received from userspace, correctly (re)initialize the probes count to zero. This is useful for forcing revalidation of a neighbor (for example if the host is attempting to do DNA [IPv4 4436, IPv6 6059]). [2] Notify listeners when a neighbor goes into NUD_PROBE state. By sending notifications on entry to NUD_PROBE state listeners get more timely warnings of imminent connectivity issues. The current notifications on entry to NUD_STALE have somewhat limited usefulness: NUD_STALE is a perfectly normal state, as is NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after a neighbor reachability problem has been confirmed (typically after three probes). Signed-off-by: Erik Kline <ek@google.com> Acked-By: Lorenzo Colitti <lorenzo@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-21 16:52:17 -04:00
Herbert Xu	407d34ef29	xfrm: Always zero high-order sequence number bits As we're now always including the high bits of the sequence number in the IV generation process we need to ensure that they don't contain crap. This patch ensures that the high sequence bits are always zeroed so that we don't leak random data into the IV. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-21 06:56:23 +02:00
Doug Ledford	175e8efe69	Merge branches 'bart-srp', 'generic-errors', 'ira-cleanups' and 'mwang-v8' into k.o/for-4.2	2015-05-20 16:12:40 -04:00
Ira Weiny	5d9fb04406	IB/core: Change rdma_protocol_iboe to roce After discussion upstream, it was agreed to transition the usage of iboe in the kernel to roce. This keeps our terminology consistent with what was finalized in the IBTA Annex 16 and IBTA Annex 17 publications. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-05-20 15:58:19 -04:00
Ilya Dryomov	521a04d06a	Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" This reverts commit `ba9d114ec5`. .. which introduced a regression that prevented all lingering requests requeued in kick_requests() from ever being sent to the OSDs, resulting in a lot of missed notifies. In retrospect it's pretty obvious that r_req_lru_item item in the case of lingering requests can be used not only for notarget, but also for unsent linkage due to how tightly actual map and enqueue operations are coupled in __map_request(). The assertion that was being silenced is taken care of in the previous ("libceph: request a new osdmap if lingering request maps to no osd") commit: by always kicking homeless lingering requests we ensure that none of them ends up on the notarget list outside of the critical section guarded by request_mutex. Cc: stable@vger.kernel.org # 3.18+, needs `b049453221` "libceph: request a new osdmap if lingering request maps to no osd" Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>	2015-05-20 21:02:46 +03:00
Ilya Dryomov	b049453221	libceph: request a new osdmap if lingering request maps to no osd This commit does two things. First, if there are any homeless lingering requests, we now request a new osdmap even if the osdmap that is being processed brought no changes, i.e. if a given lingering request turned homeless in one of the previous epochs and remained homeless in the current epoch. Not doing so leaves us with a stale osdmap and as a result we may miss our window for reestablishing the watch and lose notifies. MON=1 OSD=1: # cat linger-needmap.sh #!/bin/bash rbd create --size 1 test DEV=$(rbd map test) ceph osd out 0 rbd map dne/dne # obtain a new osdmap as a side effect (!) sleep 1 ceph osd in 0 rbd resize --size 2 test # rbd info test \| grep size -> 2M # blockdev --getsize $DEV -> 1M N.B.: Not obtaining a new osdmap in between "osd out" and "osd in" above is enough to make it miss that resize notify, but that is a bug^Wlimitation of ceph watch/notify v1. Second, homeless lingering requests are now kicked just like those lingering requests whose mapping has changed. This is mainly to recognize that a homeless lingering request makes no sense and to preserve the invariant that a registered lingering request is not sitting on any of r_req_lru_item lists. This spares us a WARN_ON, which commit `ba9d114ec5` ("libceph: clear r_req_lru_item in __unregister_linger_request()") tried to fix the _wrong_ way. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>	2015-05-20 21:02:14 +03:00
Michal Kubeček	2759647247	ipv6: fix ECMP route replacement When replacing an IPv6 multipath route with "ip route replace", i.e. NLM_F_CREATE \| NLM_F_REPLACE, fib6_add_rt2node() replaces only first matching route without fixing its siblings, resulting in corrupted siblings linked list; removing one of the siblings can then end in an infinite loop. IPv6 ECMP implementation is a bit different from IPv4 so that route replacement cannot work in exactly the same way. This should be a reasonable approximation: 1. If the new route is ECMP-able and there is a matching ECMP-able one already, replace it and all its siblings (if any). 2. If the new route is ECMP-able and no matching ECMP-able route exists, replace first matching non-ECMP-able (if any) or just add the new one. 3. If the new route is not ECMP-able, replace first matching non-ECMP-able route (if any) or add the new route. We also need to remove the NLM_F_REPLACE flag after replacing old route(s) by first nexthop of an ECMP route so that each subsequent nexthop does not replace previous one. Fixes: `51ebd31815` ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-20 12:02:26 -04:00
Michal Kubeček	35f1b4e96b	ipv6: do not delete previously existing ECMP routes if add fails If adding a nexthop of an IPv6 multipath route fails, comment in ip6_route_multipath() says we are going to delete all nexthops already added. However, current implementation deletes even the routes it hasn't even tried to add yet. For example, running ip route add 1234:5678::/64 \ nexthop via fe80::aa dev dummy1 \ nexthop via fe80::bb dev dummy1 \ nexthop via fe80::cc dev dummy1 twice results in removing all routes first command added. Limit the second (delete) run to nexthops that succeeded in the first (add) run. Fixes: `51ebd31815` ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-20 12:02:25 -04:00
Arik Nemtsov	c5a71688e1	mac80211: disconnect TDLS stations on STA CSA When a station does a channel switch, it's not well defined what its TDLS peers would do. Avoid a situation when the local side marks a potentially disconnected peer as a TDLS peer. Keeping peers connected through CSA is doubly problematic with the upcoming TDLS WIDER-BW feature which allows peers to widen the BSS channel. The new channel transitioned-to might not be compatible and would require a re-negotiation anyway. Make sure to disallow new TDLS link during CSA. Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-20 15:14:54 +02:00
Michal Kazior	f9dca80b98	mac80211: fix AP_VLAN crypto tailroom calculation Some splats I was seeing: (a) WARNING: CPU: 1 PID: 0 at /devel/src/linux/net/mac80211/wep.c:102 ieee80211_wep_add_iv (b) WARNING: CPU: 1 PID: 0 at /devel/src/linux/net/mac80211/wpa.c:73 ieee80211_tx_h_michael_mic_add (c) WARNING: CPU: 3 PID: 0 at /devel/src/linux/net/mac80211/wpa.c:433 ieee80211_crypto_ccmp_encrypt I've seen (a) and (b) with ath9k hw crypto and (c) with ath9k sw crypto. All of them were related to insufficient skb tailroom and I was able to trigger these with ping6 program. AP_VLANs may inherit crypto keys from parent AP. This wasn't considered and yielded problems in some setups resulting in inability to transmit data because mac80211 wouldn't resize skbs when necessary and subsequently drop some packets due to insufficient tailroom. For efficiency purposes don't inspect both AP_VLAN and AP sdata looking for tailroom counter. Instead update AP_VLAN tailroom counters whenever their master AP tailroom counter changes. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-20 15:10:11 +02:00
Johannes Berg	252ec2b3aa	mac80211: don't split remain-on-channel for coalescing Due to remain-on-channel scheduling delays, when we split an ROC while coalescing, we'll usually get a picture like this: existing ROC: \|------------------\| current time: ^ new ROC: \|------\| \|-------\| If the expected response frames are then transmitted by the peer in the hole between the two fragments of the new ROC, we miss them and the process (e.g. ANQP query) fails. mac80211 expects that the window to miss something is small: existing ROC: \|------------------\| new ROC: \|------\|\|-------\| but that's normally not the case. To avoid this problem, coalesce only if the new ROC's duration is <= the remaining time on the existing one: existing ROC: \|------------------\| new ROC: \|-----\| and never split a new one but schedule it afterwards instead: existing ROC: \|------------------\| new ROC: \|-------------\| type=bugfix bug=not-tracked fixes=unknown Reported-by: Matti Gottlieb <matti.gottlieb@intel.com> Reviewed-by: EliadX Peller <eliad@wizery.com> Reviewed-by: Matti Gottlieb <matti.gottlieb@intel.com> Tested-by: Matti Gottlieb <matti.gottlieb@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-20 15:09:22 +02:00
Michal Kazior	464daaf04c	mac80211: check fast-xmit on station change Drivers with fast-xmit (e.g. ath10k) running in AP_VLAN setups would fail to communicate with connected 4addr stations. The reason was when new station associates it first goes into master AP interface. It is not until later that a dedicated AP_VLAN is created for it and the station itself is moved there. After that Tx directed at the station should use 4addr header. However fast-xmit wasn't recalculated and 3addr header remained to be used. This in turn caused the connected 4addr stations to drop packets coming from the AP until some other event would cause fast-xmit to recalculate for that station (which could never come). Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-20 15:08:36 +02:00
Lars-Peter Clausen	262918d847	cfg80211: Switch to PM ops Use dev_pm_ops instead of the legacy suspend/resume callbacks for the wiphy class suspend and resume operations. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-20 15:00:12 +02:00
Lars-Peter Clausen	28f297a7af	net: rfkill: Switch to PM ops Use dev_pm_ops instead of the legacy suspend/resume callbacks for the rfkill class suspend and resume operations. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-20 15:00:00 +02:00
Florian Westphal	faecbb45eb	Revert "netfilter: bridge: query conntrack about skb dnat" This reverts commit `c055d5b03b`. There are two issues: 'dnat_took_place' made me think that this is related to -j DNAT/MASQUERADE. But thats only one part of the story. This is also relevant for SNAT when we undo snat translation in reverse/reply direction. Furthermore, I originally wanted to do this mainly to avoid storing ipv6 addresses once we make DNAT/REDIRECT work for ipv6 on bridges. However, I forgot about SNPT/DNPT which is stateless. So we can't escape storing address for ipv6 anyway. Might as well do it for ipv4 too. Reported-and-tested-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-20 13:51:25 +02:00
Dave Jones	1086bbe97a	netfilter: ensure number of counters is >0 in do_replace() After improving setsockopt() coverage in trinity, I started triggering vmalloc failures pretty reliably from this code path: warn_alloc_failed+0xe9/0x140 __vmalloc_node_range+0x1be/0x270 vzalloc+0x4b/0x50 __do_replace+0x52/0x260 [ip_tables] do_ipt_set_ctl+0x15d/0x1d0 [ip_tables] nf_setsockopt+0x65/0x90 ip_setsockopt+0x61/0xa0 raw_setsockopt+0x16/0x60 sock_common_setsockopt+0x14/0x20 SyS_setsockopt+0x71/0xd0 It turns out we don't validate that the num_counters field in the struct we pass in from userspace is initialized. The same problem also exists in ebtables, arptables, ipv6, and the compat variants. Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-20 13:46:49 +02:00
Francesco Ruggeri	3bfe049807	netfilter: nfnetlink_{log,queue}: Register pernet in first place nfnetlink_{log,queue}_init() register the netlink callback nf*_rcv_nl_event before registering the pernet_subsys, but the callback relies on data structures allocated by pernet init functions. When nfnetlink_{log,queue} is loaded, if a netlink message is received after the netlink callback is registered but before the pernet_subsys is registered, the kernel will panic in the sequence nfulnl_rcv_nl_event nfnl_log_pernet net_generic BUG_ON(id == 0) where id is nfnl_log_net_id. The panic can be easily reproduced in 4.0.3 by: while true ;do modprobe nfnetlink_log ; rmmod nfnetlink_log ; done & while true ;do ip netns add dummy ; ip netns del dummy ; done & This patch moves register_pernet_subsys to earlier in nfnetlink_log_init. Notice that the BUG_ON hit in 4.0.3 was recently removed in `2591ffd308` ["netns: remove BUG_ONs from net_generic()"]. Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-20 13:46:48 +02:00
Johannes Berg	94c78cb452	mac80211: fix memory leak My recent change here introduced a possible memory leak if the driver registers an invalid cipher schemes. This won't really happen in practice, but fix the leak nonetheless. Fixes: `e3a55b5399` ("mac80211: validate cipher scheme PN length better") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-20 11:37:38 +02:00
Daniel Borkmann	492135557d	tcp: add rfc3168, section 6.1.1.1. fallback This work as a follow-up of commit `f7b3bec6f5` ("net: allow setting ecn via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing ECN connections. In other words, this work adds a retry with a non-ECN setup SYN packet, as suggested from the RFC on the first timeout: [...] A host that receives no reply to an ECN-setup SYN within the normal SYN retransmission timeout interval MAY resend the SYN and any subsequent SYN retransmissions with CWR and ECE cleared. [...] Schematic client-side view when assuming the server is in tcp_ecn=2 mode, that is, Linux default since 2009 via commit `255cac91c3` ("tcp: extend ECN sysctl to allow server-side only ECN"): 1) Normal ECN-capable path: SYN ECE CWR -----> <----- SYN ACK ECE ACK -----> 2) Path with broken middlebox, when client has fallback: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN -----> <----- SYN ACK ACK -----> In case we would not have the fallback implemented, the middlebox drop point would basically end up as: SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) SYN ECE CWR ----X crappy middlebox drops packet (timeout, rtx) In any case, it's rather a smaller percentage of sites where there would occur such additional setup latency: it was found in end of 2014 that ~56% of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the fallback would mitigate with a slight latency trade-off. Recent related paper on this topic: Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth, Gorry Fairhurst, and Richard Scheffenegger: "Enabling Internet-Wide Deployment of Explicit Congestion Notification." Proc. PAM 2015, New York. http://ecn.ethz.ch/ecn-pam15.pdf Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168, section 6.1.1.1. fallback on timeout. For users explicitly not wanting this which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that allows for disabling the fallback. tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but rather we let tcp_ecn_rcv_synack() take that over on input path in case a SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent ECN being negotiated eventually in that case. Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch> Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch> Cc: Eric Dumazet <edumazet@google.com> Cc: Dave That <dave.taht@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 16:53:37 -04:00
David S. Miller	892bd6291a	This has just a single fix, for a WEP tailroom check problem that leads to dropped frames. -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJVWuR+AAoJEDBSmw7B7bqrgRYP/1A/g6osMT4DG/WecYleDlip 1De0c1S4rgHVI+/ZvK4JcyYbjHYwKhbXHBtsNHV+J9GehqXyWn0BJPkSnxZ3HdZV M5CHSgtN2OBfHJ03OTpduvdNzKjpVOCf2PWKFnhJhDzYdfa9qh9kKDwRGeDcHvfc ++vVs+bMzjhnWj2y0TpEs1fQcd69MrR9Af2ptftOrusuVkDxShKrgY4xj1d+OVyC FggUn/oj6/CgGVn8KV1hld+Cb1Tk1/D9uksXYZepHNo4qb0M8T8BBWIQCpdbK4Ge qAG8w7/suLGqb8VU5k0jM4Uqbn5l9cm7PX1PQrxCdyFHMf3kojR8LgI33Xqm4d40 9HxnXLlDoaawTOiAJIG1HMEzawriWfxSly3hS1Q/B/FGo68C2KIg9h5/w98GNfIB PNE41GopCwQlmhORGXxpzwf/jJ5mL9V6PjxUnKpsd/BlbUlKLmFnx7JABicLl/Ps 292l2yZR9Jrzaf8njmGoIyYb+AREvJF4zQu9rduiro56+rCvGvFJZ8xfwGsRvNTH f/HILhW+GDPlJp4StCvKQxm0bWJ6feRiPCYr2JRViMQmp6hX6AMYfVcf7jcpIWio uTB9FAW6XGPrltm+1IeyWICbiF0VpuqpPl8V3UTiDgIBdk4wqD5kr3HSW6zb1w9L KJYFbpC9igqGk+fc3u4b =DNh/ -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== This has just a single fix, for a WEP tailroom check problem that leads to dropped frames. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 16:44:25 -04:00
David S. Miller	b7a3a8e31f	This just has a few fixes: * LED throughput trigger was crashing * fast-xmit wasn't treating QoS changes in IBSS correctly * TDLS could use the wrong channel definition * using a reserved channel context could use the wrong channel width -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJVWuPvAAoJEDBSmw7B7bqr/A4P/0TqzkCC5L2qJvi3a6QNFxvf s3riQMJ8WQeUxCRNgFNeeeNUAgSJn3hhiINGrjRmkwXXxYC4mbCwM0YXNT+WhSRL /Kx4mRJr7u5ZU0olW+KRvIV5CyTsbr9zVnaraCh5NV43nT87ZVZRBKC9vz2UkSM5 AsN6fUvhWMGhhHoGGDqtjRBjve8Xs5iKiEcE1iQTzLOPnFP3dKtB1zKKiA0JCQs4 OjxkQ7uaF0T1IfkMFr0gyzgQi4A8iPoMKV3qcRIH/QZN5dpJ6DR1dgaU50CrzQ+R JD9W09ifF9U8GnvQU/baJHKCxEvnQWO2XwlV4+mV6bXF1j5Ng4LRiXntIeu2d3T3 5JuvPV9cNJb8dSTzsYw+TRJg73hStlJCAjVMJ7hiOMQc1YCCY9Exrff0pWzJPJfE NygIkMHXymcy66yL3b7DIIXro5jHNVGVoHq3vMB+W+/EcEDFN6L9LeCzUVo+oKjl Qg4kC7VHDjcdt0f7Vgv2Cal76ZVfCZaq74QZV1cySF2sCiD27LnAAfoHVeMY979K qBsCRqhkBlc7ntnstv6tGz9LfG8ro+Fv548HIUDG80capZl6N6FR6g+8hYIuvJwu 2abJq36bp/NAGDt43UofmtDxyZNyvoKmzcQKSdn2QpryGKQ3uRcDHG/I+WD0yWX9 4WFNEm86sXmfL/Eyu0lU =iUFe -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== This just has a few fixes: * LED throughput trigger was crashing * fast-xmit wasn't treating QoS changes in IBSS correctly * TDLS could use the wrong channel definition * using a reserved channel context could use the wrong channel width ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 16:43:17 -04:00
Yuchung Cheng	b7b0ed910c	tcp: don't over-send F-RTO probes After sending the new data packets to probe (step 2), F-RTO may incorrectly send more probes if the next ACK advances SND_UNA and does not sack new packet. However F-RTO RFC 5682 probes at most once. This bug may cause sender to always send new data instead of repairing holes, inducing longer HoL blocking on the receiver for the application. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 16:36:57 -04:00
Yuchung Cheng	da34ac7626	tcp: only undo on partial ACKs in CA_Loss Undo based on TCP timestamps should only happen on ACKs that advance SND_UNA, according to the Eifel algorithm in RFC 3522: Section 3.2: (4) If the value of the Timestamp Echo Reply field of the acceptable ACK's Timestamps option is smaller than the value of RetransmitTS, then proceed to step (5), Section Terminology: We use the term 'acceptable ACK' as defined in [RFC793]. That is an ACK that acknowledges previously unacknowledged data. This is because upon receiving an out-of-order packet, the receiver returns the last timestamp that advances RCV_NXT, not the current timestamp of the packet in the DUPACK. Without checking the flag, the DUPACK will cause tcp_packet_delayed() to return true and tcp_try_undo_loss() will revert cwnd reduction. Note that we check the condition in CA_Recovery already by only calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or tcp_try_undo_recovery() if snd_una crosses high_seq. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 16:36:57 -04:00
Henning Rogge	33b4b015e1	net/ipv6/udp: Fix ipv6 multicast socket filter regression Commit <5cf3d46192fc> ("udp: Simplify__udp_lib_mcast_deliver") simplified the filter for incoming IPv6 multicast but removed the check of the local socket address and the UDP destination address. This patch restores the filter to prevent sockets bound to a IPv6 multicast IP to receive other UDP traffic link unicast. Signed-off-by: Henning Rogge <hrogge@gmail.com> Fixes: `5cf3d46192` ("udp: Simplify__udp_lib_mcast_deliver") Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 16:34:43 -04:00
Eric B Munson	aea0929e51	tcp: Return error instead of partial read for saved syn headers Currently the getsockopt() requesting the cached contents of the syn packet headers will fail silently if the caller uses a buffer that is too small to contain the requested data. Rather than fail silently and discard the headers, getsockopt() should return an error and report the required size to hold the data. Signed-off-by: Eric B Munson <emunson@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 16:33:34 -04:00
Johan Hedberg	011c391a09	Bluetooth: Add debug logs for legacy SMP crypto functions To help debug legacy SMP crypto functions add debug logs of the various values involved. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 21:07:29 +02:00
Arnd Bergmann	73e85ed36a	mac802154: select CRYPTO when needed The mac802154 subsystem uses functions from the crypto layer and correctly selects the individual crypto algorithms, but fails to build when the crypto layer is disabled altogether: crypto/built-in.o: In function `crypto_ctr_free': :(.text+0x80): undefined reference to `crypto_drop_spawn' crypto/built-in.o: In function `crypto_rfc3686_free': :(.text+0xac): undefined reference to `crypto_drop_spawn' crypto/built-in.o: In function `crypto_ctr_crypt': :(.text+0x2f0): undefined reference to `blkcipher_walk_virt_block' :(.text+0x2f8): undefined reference to `crypto_inc' To solve that, this patch also selects the core crypto code, like all other users of that code do. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 19:35:48 +02:00
Thomas Gleixner	c3b5d3cea5	Merge branch 'linus' into timers/core Make sure the upstream fixes are applied before adding further modifications.	2015-05-19 16:12:32 +02:00
Johannes Berg	22d3a3c829	mac80211: don't use napi_gro_receive() outside NAPI context No matter how the driver manages its NAPI context, there's no way sending frames to it from a timer can be correct, since it would corrupt the internal GRO lists. To avoid that, always use the non-NAPI path when releasing frames from the timer. Cc: stable@vger.kernel.org Reported-by: Jean Trivelly <jean.trivelly@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-19 15:46:21 +02:00
Alexander Aring	3862eba691	mac802154: tx: allow xmit complete from hard irq Replace consume_skb with dev_consume_skb_any in ieee802154_xmit_complete which can be called in hard irq and other contexts. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:45 +02:00
Alexander Aring	0e66545701	nl802154: add support for dump phy capabilities This patch add support to nl802154 to dump all phy capabilities which is inside the wpan_phy_supported struct. Also we introduce a new method to dumping supported channels. The new method will offer a easier interface and has lesser netlink traffic. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:44 +02:00
Alexander Aring	65318680c9	ieee802154: add iftypes capability This patch adds capability flags for supported interface types. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:42 +02:00
Alexander Aring	edea8f7c75	cfg802154: introduce wpan phy flags This patch introduce a flag property for the wpan phy structure. The current flag settings in ieee802154_hw are accessable in mac802154 layer only which is okay for flags which indicates MAC handling which are done by phy. For real PHY layer settings like cca mode, transmit power, cca energy detection level. The difference between these flags are that the MAC handling flags are only handled in mac802154/HardMac layer e.g. on an interface up. The phy settings are direct netlink calls from nl802154 into the driver layer and the nl802154 need to have a chance to check if the driver supports this handling before sending to the next layer. We also check now on PHY flags while dumping and setting pib attributes. In comparing with MIB attributes the 802.15.4 gives us an default value which we assume when a transceiver implement less functionality. In case of MIB settings the nl802154 layer doesn't need to check on the ieee802154_hw flags then. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:42 +02:00
Alexander Aring	8329fcf11f	mac802154: remove check if operation is supported This patch removes the check if operation is supported by driver layer. This is done now by capabilities flags, if these are valid then the driver should support the operation, otherwise a WARN_ON occurs. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:42 +02:00
Alexander Aring	791021bf13	mac802154: check for really changes This patch adds check if the value is really changed inside pib/mib. If a transceiver do support only one value for e.g. max_be then this will also handle that the driver layer doesn't need to care about handling to set one value only. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:42 +02:00
Alexander Aring	fea3318d20	ieee802154: add several phy supported handling This patch adds support for phy supported handling for all other already existing handling 802.15.4 functionality. We assume now a fully 802.15.4 complaint transceiver at phy allocation. If a transceiver can support 802.15.4 default values only, then the values should be overwirtten by values the transceiver supports. If the transceiver doesn't set the according hardware flags, we assume the 802.15.4 defaults now which cannot be changed. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:42 +02:00
Alexander Aring	72f655e44d	ieee802154: introduce wpan_phy_supported This patch introduce the wpan_phy_supported struct for wpan_phy. There is currently no way to check if a transceiver can handle IEEE 802.15.4 complaint values. With this struct we can check before if the transceiver supports these values before sending to driver layer. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Acked-by: Varka Bhadram <varkabhadram@gmail.com> Cc: Alan Ott <alan@signal11.us> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:42 +02:00
Alexander Aring	32b23550ad	ieee802154: change cca ed level to mbm This patch change the handling of cca energy detection level from dbm to mbm. This prepares to handle floating point cca energy detection levels values. The old netlink 802.15.4 will convert the dbm value to mbm for handling backward compatibility. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:42 +02:00
Alexander Aring	e2eb173aaa	ieee802154: change transmit power to mbm This patch change the handling of transmit power level from dbm to mbm. This prepares to handle floating point transmit power levels values. The old netlink 802.15.4 will convert the dbm value to mbm for handling backward compatibility. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:41 +02:00
Alexander Aring	1a19cb680b	ieee802154: change transmit power to s32 This patch change the transmit power from s8 to s32. This prepares to store a mbm value instead dbm inside the transmit power variable. The old interface keep the a s8 dbm value, which should be backward compatibility when assign s8 to s32. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:41 +02:00
Alexander Aring	673692faf3	ieee802154: move validation check out of softmac This patch moves the value validation out of softmac layer. We need to be sure now that this value is accepted by the transceiver/mac802154 or "possible" hardmac drivers before calling rdev-ops. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:41 +02:00
Alexander Aring	0cf0879acd	nl802154: cleanup invalid argument handling This patch cleanups the -EINVAL cases by combining them in one condition. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-19 11:44:41 +02:00
Andy Zhou	49d16b23cd	bridge_netfilter: No ICMP packet on IPv4 fragmentation error When bridge netfilter re-fragments an IP packet for output, all packets that can not be re-fragmented to their original input size should be silently discarded. However, current bridge netfilter output path generates an ICMP packet with 'size exceeded MTU' message for such packets, this is a bug. This patch refactors the ip_fragment() API to allow two separate use cases. The bridge netfilter user case will not send ICMP, the routing output will, as before. Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 00:15:39 -04:00
Andy Zhou	8bc04864ac	IPv4: skip ICMP for bridge contrack users when defrag expires users in [IP_DEFRAG_CONNTRACK_BRIDGE_IN, __IP_DEFRAG_CONNTRACK_BR_IN] should not ICMP message also. Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 00:15:27 -04:00
Andy Zhou	5cf4228082	ipv4: introduce frag_expire_skip_icmp() Improve readability of skip ICMP for de-fragmentation expiration logic. This change will also make the logic easier to maintain when the following patches in this series are applied. Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 00:15:26 -04:00
David S. Miller	456cdf53ef	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2015-05-17 A couple more Bluetooth updates for 4.1: - New USB IDs for ath3k & btusb - Fix for remote name resolving during device discovery Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-18 16:15:31 -04:00
David S. Miller	0bc4c07046	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. Briefly speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and Serget Popovich, more incremental updates to make br_netfilter a better place from Florian Westphal, ARP support to the x_tables mark match / target from and context Zhang Chunyu and the addition of context to know that the x_tables runs through nft_compat. More specifically, they are: 1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik. 2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef. 3) Use skb->network_header to calculate the transport offset in ip_set_get_ip{4,6}_port(). From Alexander Drozdov. 4) Reduce memory consumption per element due to size miscalculation, this patch and follow up patches from Sergey Popovich. 5) Expand nomatch field from 1 bit to 8 bits to allow to simplify mtype_data_reset_flags(), also from Sergey. 6) Small clean for ipset macro trickery. 7) Fix error reporting when both ip_set_get_hostipaddr4() and ip_set_get_extensions() from per-set uadt functions. 8) Simplify IPSET_ATTR_PORT netlink attribute validation. 9) Introduce HOST_MASK instead of hardcoded 32 in ipset. 10) Return true/false instead of 0/1 in functions that return boolean in the ipset code. 11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute. 12) Allow to dereference from ext_() ipset macros. 13) Get rid of incorrect definitions of HKEY_DATALEN. 14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match. 15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal. 16) Release nf_bridge_info after POSTROUTING since this is only needed from the physdev match, also from Florian. 17) Reduce size of ipset code by deinlining ip_set_put_extensions(), from Denys Vlasenko. 18) Oneliner to add ARP support to the x_tables mark match/target, from Zhang Chunyu. 19) Add context to know if the x_tables extension runs from nft_compat, to address minor problems with three existing extensions. 20) Correct return value in several seqfile _show() functions in the netfilter tree, from Joe Perches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-18 14:47:36 -04:00
Sagi Grimberg	3c88f3dcff	RDS: Switch to generic logging helpers Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-05-18 13:44:23 -04:00
Sagi Grimberg	76357c715f	xprtrdma, svcrdma: Switch to generic logging helpers Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-05-18 13:44:23 -04:00
Michael Wang	bc0f1d7153	IB/Verbs: Use management helper rdma_cap_read_multi_sge() Introduce helper rdma_cap_read_multi_sge() to help us check if the port of an IB device support RDMA Read Multiple Scatter-Gather Entries. Signed-off-by: Michael Wang <yun.wang@profitbricks.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Tested-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-05-18 13:35:05 -04:00
Michael Wang	3de2c31ce7	IB/Verbs: Reform IB-ulp xprtrdma Use raw management helpers to reform IB-ulp xprtrdma. Signed-off-by: Michael Wang <yun.wang@profitbricks.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Tested-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-05-18 13:35:04 -04:00
Li RongQing	8faf491e64	xfrm: optimise to search the inexact policy list The policies are organized into list by priority ascent of policy, so it is unnecessary to continue to loop the policy if the priority of current looped police is larger than or equal priority which is from the policy_bydst list. This allows to match policy with ~0U priority in inexact list too. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-18 10:31:56 +02:00
Herbert Xu	b9fbe709de	netlink: Use random autobind rover Currently we use a global rover to select a port ID that is unique. This used to work consistently when it was protected with a global lock. However as we're now lockless, the global rover can exhibit pathological behaviour should multiple threads all stomp on it at the same time. Granted this will eventually resolve itself but the process is suboptimal. This patch replaces the global rover with a pseudorandom starting point to avoid this issue. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 23:43:31 -04:00
Florent Fourcot	21858cd02d	tcp/ipv6: fix flow label setting in TIME_WAIT state commit `1d13a96c74` ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT") added the flow label in the last TCP packets. Unfortunately, it was not casted properly. This patch replace the buggy shift with be32_to_cpu/cpu_to_be32. Fixes: `1d13a96c74` ("ipv6: tcp: fix flowlabel value in ACK messages") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 23:41:59 -04:00
WANG Cong	de133464c9	netns: make nsid_lock per net The spinlock is used to protect netns_ids which is per net, so there is no need to use a global spinlock. Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 23:41:11 -04:00
Jiri Pirko	74b80e841b	flow_dissector: remove bogus return in tipc section Fixes: `06635a35d1` ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 23:38:23 -04:00
Samudrala, Sridhar	45d4122ca7	switchdev: add support for fdb add/del/dump via switchdev_port_obj ops. - introduce port fdb obj and generic switchdev_port_fdb_add/del/dump() - use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops. - add support for fdb obj in switchdev_port_obj_add/del/dump() - switch rocker to implement fdb ops via switchdev_ops v3: updated to sync with named union changes. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:49:09 -04:00
Eric Dumazet	b66e91ccbc	tcp: halves tcp_mem[] limits Allowing tcp to use ~19% of physical memory is way too much, and allowed bugs to be hidden. Add to this that some drivers use a full page per incoming frame, so real cost can be twice the advertized one. Reduce tcp_mem by 50 % as a first step to sanity. tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:45:49 -04:00
Eric Dumazet	76dfa60820	tcp: allow one skb to be received per socket under memory pressure While testing tight tcp_mem settings, I found tcp sessions could be stuck because we do not allow even one skb to be received on them. By allowing one skb to be received, we introduce fairness and eventuallu force memory hogs to release their allocation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:45:49 -04:00
Eric Dumazet	8e4d980ac2	tcp: fix behavior for epoll edge trigger Under memory pressure, tcp_sendmsg() can fail to queue a packet while no packet is present in write queue. If we return -EAGAIN with no packet in write queue, no ACK packet will ever come to raise EPOLLOUT. We need to allow one skb per TCP socket, and make sure that tcp sockets can release their forward allocations under pressure. This is a followup to commit `790ba4566c` ("tcp: set SOCK_NOSPACE under memory pressure") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:45:48 -04:00
Eric Dumazet	b8da51ebb1	tcp: introduce tcp_under_memory_pressure() Introduce an optimized version of sk_under_memory_pressure() for TCP. Our intent is to use it in fast paths. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:45:48 -04:00
Eric Dumazet	a6c5ea4ccf	tcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule() We plan to use sk_forced_wmem_schedule() in input path as well, so make it non static and rename it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:45:48 -04:00
Eric Dumazet	1a24e04e4b	net: fix sk_mem_reclaim_partial() sk_mem_reclaim_partial() goal is to ensure each socket has one SK_MEM_QUANTUM forward allocation. This is needed both for performance and better handling of memory pressure situations in follow up patches. SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes as some arches have 64KB pages. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:45:48 -04:00
Nicolas Dichtel	ed2a80ab7b	rtnl/bond: don't send rtnl msg for unregistered iface Before the patch, the command 'ip link add bond2 type bond mode 802.3ad' causes the kernel to send a rtnl message for the bond2 interface, with an ifindex 0. 'ip monitor' shows: 0: bond2: <BROADCAST,MULTICAST,MASTER> mtu 1500 state DOWN group default link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff 9: bond2@NONE: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff [snip] The patch fixes the spotted bug by checking in bond driver if the interface is registered before calling the notifier chain. It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the future. Fixes: `d4261e5650` ("bonding: create netlink event when bonding option is changed") CC: Jiri Pirko <jiri@resnulli.us> Reported-by: Julien Meunier <julien.meunier@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:43:07 -04:00
Willem de Bruijn	4633c9e07b	net-packet: fix null pointer exception in rollover mode Rollover can be enabled as flag or mode. Allocate state in both cases. This solves a NULL pointer exception in fanout_demux_rollover on referencing po->rollover if using mode rollover. Also make sure that in rollover mode each silo is tried (contrary to rollover flag, where the main socket is excluded after an initial try_self). Tested: Passes tools/testing/net/psock_fanout.c, which tests both modes and flag. My previous tests were limited to bench_rollover, which only stresses the flag. The test now completes safely. it still gives an error for mode rollover, because it does not expect the new headroom (ROOM_NORMAL) requirement. I will send a separate patch to the test. Fixes: `0648ab70af` ("packet: rollover prepare: per-socket state") Signed-off-by: Willem de Bruijn <willemb@google.com> ---- I should have run this test and caught this before submission, of course. Apologies for the oversight. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 22:41:38 -04:00
Eric Dumazet	ba6d05641c	netfilter: synproxy: fix sparse errors Fix verbose sparse errors : make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/ipt_SYNPROXY.o Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 13:08:29 -04:00
Eric Dumazet	252a8fbe81	ipip: fix one sparse error make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/ipip.o CHECK net/ipv4/ipip.c net/ipv4/ipip.c:254:27: warning: incorrect type in assignment (different base types) net/ipv4/ipip.c:254:27: expected restricted __be32 [addressable] [usertype] o_key net/ipv4/ipip.c:254:27: got restricted __be16 [addressable] [usertype] i_flags Fixes: `3b7b514f44` ("ipip: fix a regression in ioctl") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-17 13:08:29 -04:00
Joe Perches	861fb1078f	netfilter: Use correct return for seq_show functions Using seq_has_overflowed doesn't produce the right return value. Either 0 or -1 is, but 0 is much more common and works well when seq allocation retries. I believe this doesn't matter as the initial allocation is always sufficient, this is just a correctness patch. Miscellanea: o Don't use strlen, use *ptr to determine if a string should be emitted like all the other tests here o Delete unnecessary return statements Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-17 17:25:35 +02:00
Herbert Xu	c0bb07df7d	netlink: Reset portid after netlink_insert failure The commit `c5adde9468` ("netlink: eliminate nl_sk_hash_lock") breaks the autobind retry mechanism because it doesn't reset portid after a failed netlink_insert. This means that should autobind fail the first time around, then the socket will be stuck in limbo as it can never be bound again since it already has a non-zero portid. Fixes: `c5adde9468` ("netlink: eliminate nl_sk_hash_lock") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-16 17:08:57 -04:00
David S. Miller	1d6057019e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix a leak in IPVS, the sysctl table is not released accordingly when destroying a netns, patch from Tommi Rantala. 2) Fix a build error when TPROXY and socket are built-in but IPv6 defrag is compiled as module, from Florian Westphal. 3) Fix TCP tracket wrt. RFC5961 challenge ACK when in LAST_ACK state, patch from Jesper Dangaard Brouer. 4) Fix a bogus WARN_ON() in nf_tables when deleting a set element that stores a map, from Mirek Kratochvil. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-16 16:40:22 -04:00
Mirek Kratochvil	960bd2c264	netfilter: nf_tables: fix bogus warning in nft_data_uninit() The values 0x00000000-0xfffffeff are reserved for userspace datatype. When, deleting set elements with maps, a bogus warning is triggered. WARNING: CPU: 0 PID: 11133 at net/netfilter/nf_tables_api.c:4481 nft_data_uninit+0x35/0x40 [nf_tables]() This fixes the check accordingly to enum definition in include/linux/netfilter/nf_tables.h Fixes: https://bugzilla.netfilter.org/show_bug.cgi?id=1013 Signed-off-by: Mirek Kratochvil <exa.exa@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-15 22:07:30 +02:00
Jesper Dangaard Brouer	b3cad287d1	conntrack: RFC5961 challenge ACK confuse conntrack LAST-ACK transition In compliance with RFC5961, the network stack send challenge ACK in response to spurious SYN packets, since commit `0c228e833c` ("tcp: Restore RFC5961-compliant behavior for SYN packets"). This pose a problem for netfilter conntrack in state LAST_ACK, because this challenge ACK is (falsely) seen as ACKing last FIN, causing a false state transition (into TIME_WAIT). The challenge ACK is hard to distinguish from real last ACK. Thus, solution introduce a flag that tracks the potential for seeing a challenge ACK, in case a SYN packet is let through and current state is LAST_ACK. When conntrack transition LAST_ACK to TIME_WAIT happens, this flag is used for determining if we are expecting a challenge ACK. Scapy based reproducer script avail here: https://github.com/netoptimizer/network-testing/blob/master/scapy/tcp_hacks_3WHS_LAST_ACK.py Fixes: `0c228e833c` ("tcp: Restore RFC5961-compliant behavior for SYN packets") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-15 20:50:56 +02:00
Florian Westphal	595ca5880b	netfilter: avoid build error if TPROXY/SOCKET=y && NF_DEFRAG_IPV6=m With TPROXY=y but DEFRAG_IPV6=m we get build failure: net/built-in.o: In function `tproxy_tg_init': net/netfilter/xt_TPROXY.c:588: undefined reference to `nf_defrag_ipv6_enable' If DEFRAG_IPV6 is modular, TPROXY must be too. (or both must be builtin). This enforces =m for both. Reported-and-tested-by: Liu Hua <liusdu@126.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-15 20:18:27 +02:00
Pablo Neira Ayuso	55917a21d0	netfilter: x_tables: add context to know if extension runs from nft_compat Currently, we have four xtables extensions that cannot be used from the xt over nft compat layer. The problem is that they need real access to the full blown xt_entry to validate that the rule comes with the right dependencies. This check was introduced to overcome the lack of sufficient userspace dependency validation in iptables. To resolve this problem, this patch introduces a new field to the xt_tgchk_param structure that tell us if the extension is run from nft_compat context. The three affected extensions are: 1) CLUSTERIP, this target has been superseded by xt_cluster. So just bail out by returning -EINVAL. 2) TCPMSS. Relax the checking when used from nft_compat. If used with the wrong configuration, it will corrupt !syn packets by adding TCP MSS option. 3) ebt_stp. Relax the check to make sure it uses the reserved destination MAC address for STP. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>	2015-05-15 20:14:07 +02:00
Frederic Danis	cffd2eedf9	Bluetooth: Fix calls to __hci_cmd_sync() Remove test of command reply status as it is already performed by __hci_cmd_sync(). __hci_cmd_sync_ev() function already returns an error if it got a non-zero status either through a Command Complete or a Command Status event. For both of these events the status is collected up in the event handlers called by hci_event_packet() and then passed as the second parameter to req_complete_skb(). The req_complete_skb() callback in turn is hci_req_sync_complete() for __hci_cmd_sync_ev() which stores the status in hdev->req_result. The hdev->req_result is then further converted through bt_to_errno() back in __hci_cmd_sync_ev(). Signed-off-by: Frederic Danis <frederic.danis@linux.intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-15 16:04:49 +02:00
Roopa Prabhu	eea39946a1	rename RTNH_F_EXTERNAL to RTNH_F_OFFLOAD RTNH_F_EXTERNAL today is printed as "offload" in iproute2 output. This patch renames the flag to be consistent with what the user sees. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:45:39 -04:00
Florian Westphal	3365495c18	net: core: set qdisc pkt len before tc_classify commit `d2788d3488` ("net: sched: further simplify handle_ing") removed the call to qdisc_enqueue_root(). However, after this removal we no longer set qdisc pkt length. This breaks traffic policing on ingress. This is the minimum fix: set qdisc pkt length before tc_classify. Only setting the length does remove support for 'stab' on ingress, but as Alexei pointed out: "Though it was allowed to add qdisc_size_table to ingress, it's useless. Nothing takes advantage of recomputed qdisc_pkt_len". Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that would result in qdisc_pkt_len_init to no longer get inlined due to the additional 2nd call site. ingress policing is rare and GRO doesn't really work that well with police on ingress, as we see packets > mtu and drop skbs that -- without aggregation -- would still have fitted the policier budget. Thus to have reliable/smooth ingress policing GRO has to be turned off. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Fixes: `d2788d3488` ("net: sched: further simplify handle_ing") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:44:40 -04:00
Nicolas Dichtel	0c58a2db91	netns: fix unbalanced spin_lock on error Unlock was missing on error path. Fixes: `95f38411df` ("netns: use a spin_lock to protect nsid management") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:36:31 -04:00
Alexander Duyck	c24a59649f	ip_tunnel: Report Rx dropped in ip_tunnel_get_stats64 The rx_dropped stat wasn't being reported when ip_tunnel_get_stats64 was called. This was leading to some confusing results in my debug as I was seeing rx_errors increment but no other value which pointed me toward the type of error being seen. This change corrects that by using netdev_stats_to_stats64 to copy all available dev stats instead of just the few that were hand picked. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:30:54 -04:00
Vlad Yasevich	e87a468eb9	ipv6: Fix udp checksums with raw sockets It was reported that trancerout6 would cause a kernel to crash when trying to compute checksums on raw UDP packets. The cause was the check in __ip6_append_data that would attempt to use partial checksums on the packet. However, raw sockets do not initialize partial checksum fields so partial checksums can't be used. Solve this the same way IPv4 does it. raw sockets pass transhdrlen value of 0 to ip_append_data which causes the checksum to be computed in software. Use the same check in ip6_append_data (check transhdrlen). Reported-by: Wolfgang Walter <linux@stwm.de> CC: Wolfgang Walter <linux@stwm.de> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:27:03 -04:00
Eric Dumazet	91dd93f956	netlink: move nl_table in read_mostly section netlink sockets creation and deletion heavily modify nl_table_users and nl_table_lock. If nl_table is sharing one cache line with one of them, netlink performance is really bad on SMP. ffffffff81ff5f00 B nl_table ffffffff81ff5f0c b nl_table_users Putting nl_table in read_mostly section increased performance of my open/delete netlink sockets test by about 80 % This came up while diagnosing a getaddrinfo() problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 17:49:06 -04:00
Willem de Bruijn	54d7c01d3e	packet: fix warnings in rollover lock contention Avoid two xchg calls whose return values were unused, causing a warning on some architectures. The relevant variable is a hint and read without mutual exclusion. This fix makes all writers hold the receive_queue lock. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 17:40:54 -04:00
Ying Xue	fa787ae062	tipc: use sock_create_kern interface to create kernel socket After commit `eeb1bd5c40` ("net: Add a struct net parameter to sock_create_kern"), we should use sock_create_kern() to create kernel socket as the interface doesn't reference count struct net any more. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 13:39:33 -04:00
Brian Haley	dd3aa3b5fb	cls_flower: Fix compile error Fix compile error in net/sched/cls_flower.c net/sched/cls_flower.c: In function ‘fl_set_key’: net/sched/cls_flower.c:240:3: error: implicit declaration of function ‘tcf_change_indev’ [-Werror=implicit-function-declaration] err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]); Introduced in `77b9900ef5` Fixes: `77b9900ef5` ("tc: introduce Flower classifier") Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 13:34:35 -04:00
Jon Paul Maloy	dd3f9e70f5	tipc: add packet sequence number at instant of transmission Currently, the packet sequence number is updated and added to each packet at the moment a packet is added to the link backlog queue. This is wasteful, since it forces the code to traverse the send packet list packet by packet when adding them to the backlog queue. It would be better to just splice the whole packet list into the backlog queue when that is the right action to do. In this commit, we do this change. Also, since the sequence numbers cannot now be assigned to the packets at the moment they are added the backlog queue, we do instead calculate and add them at the moment of transmission, when the backlog queue has to be traversed anyway. We do this in the function tipc_link_push_packet(). Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	f21e897ecc	tipc: improve link congestion algorithm The link congestion algorithm used until now implies two problems. - It is too generous towards lower-level messages in situations of high load by giving "absolute" bandwidth guarantees to the different priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted 20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the available bandwidth. But, in the absence of higher level traffic, the ratio between two distinct levels becomes unreasonable. E.g. if there is only LOW and MEDIUM traffic on a system, the former is guaranteed 1/3 of the bandwidth, and the latter 2/3. This again means that if there is e.g. one LOW user and 10 MEDIUM users, the former will have 33.3% of the bandwidth, and the others will have to compete for the remainder, i.e. each will end up with 6.7% of the capacity. - Packets of type MSG_BUNDLER are created at SYSTEM importance level, but only after the packets bundled into it have passed the congestion test for their own respective levels. Since bundled packets don't result in incrementing the level counter for their own importance, only occasionally for the SYSTEM level counter, they do in practice obtain SYSTEM level importance. Hence, the current implementation provides a gap in the congestion algorithm that in the worst case may lead to a link reset. We now refine the congestion algorithm as follows: - A message is accepted to the link backlog only if its own level counter, and all superior level counters, permit it. - The importance of a created bundle packet is set according to its contents. A bundle packet created from messges at levels LOW to CRITICAL is given importance level CRITICAL, while a bundle created from a SYSTEM level message is given importance SYSTEM. In the latter case only subsequent SYSTEM level messages are allowed to be bundled into it. This solves the first problem described above, by making the bandwidth guarantee relative to the total number of users at all levels; only the upper limit for each level remains absolute. In the example described above, the single LOW user would use 1/11th of the bandwidth, the same as each of the ten MEDIUM users, but he still has the same guarantee against starvation as the latter ones. The fix also solves the second problem. If the CRITICAL level is filled up by bundle packets of that level, no lower level packets will be accepted any more. Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	cd4eee3c2e	tipc: simplify link supervision checkpointing We change the sequence number checkpointing that is performed by the timer in order to discover if the peer is active. Currently, we store a checkpoint of the next expected sequence number "rcv_nxt" at each timer expiration, and compare it to the current expected number at next timeout expiration. Instead, we now use the already existing field "silent_intv_cnt" for this task. We step the counter at each timeout expiration, and zero it at each valid received packet. If no valid packet has been received from the peer after "abort_limit" number of silent timer intervals, the link is declared faulty and reset. We also remove the multiple instances of timer activation from inside the FSM function "link_state_event()", and now do it at only one place; at the end of the timer function itself. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	a97b9d3fa9	tipc: rename fields in struct tipc_link We rename some fields in struct tipc_link, in order to give them more descriptive names: next_in_no -> rcv_nxt next_out_no-> snd_nxt fsm_msg_cnt-> silent_intv_cnt cont_intv -> keepalive_intv last_retransmitted -> last_retransm There are no functional changes in this commit. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	e4bf4f7696	tipc: simplify packet sequence number handling Although the sequence number in the TIPC protocol is 16 bits, we have until now stored it internally as an unsigned 32 bits integer. We got around this by always doing explicit modulo-65535 operations whenever we need to access a sequence number. We now make the incoming and outgoing sequence numbers to unsigned 16-bit integers, and remove the modulo operations where applicable. We also move the arithmetic inline functions for 16 bit integers to core.h, and the function buf_seqno() to msg.h, so they can easily be accessed from anywhere in the code. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	a6bf70f792	tipc: simplify include dependencies When we try to add new inline functions in the code, we sometimes run into circular include dependencies. The main problem is that the file core.h, which really should be at the root of the dependency chain, instead is a leaf. I.e., core.h includes a number of header files that themselves should be allowed to include core.h. In reality this is unnecessary, because core.h does not need to know the full signature of any of the structs it refers to, only their type declaration. In this commit, we remove all dependencies from core.h towards any other tipc header file. As a consequence of this change, we can now move the function tipc_own_addr(net) from addr.c to addr.h, and make it inline. There are no functional changes in this commit. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:45 -04:00
Jon Paul Maloy	75b44b018e	tipc: simplify link timer handling Prior to this commit, the link timer has been running at a "continuity interval" of configured link tolerance/4. When a timer wakes up and discovers that there has been no sign of life from the peer during the previous interval, it divides its own timer interval by another factor four, and starts sending one probe per new interval. When the configured link tolerance time has passed without answer, i.e. after 16 unacked probes, the link is declared faulty and reset. This is unnecessary complex. It is sufficient to continue with the original continuity interval, and instead reset the link after four missed probe responses. This makes the timer handling in the link simpler, and opens up for some planned later changes in this area. This commit implements this change. Reviewed-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:45 -04:00
Jon Paul Maloy	b1c29f6b10	tipc: simplify resetting and disabling of bearers Since commit 4b475e3f2f8e4e241de101c8240f1d74d0470494 ("tipc: eliminate delayed link deletion at link failover") the extra boolean parameter "shutting_down" is not any longer needed for the functions bearer_disable() and tipc_link_delete_list(). Furhermore, the function tipc_link_reset_links(), called from bearer_reset() is now unnecessary. We can just as well delete all the links, as we do in bearer_disable(), and start over with creating new links. This commit introduces those changes. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:45 -04:00
Zhang Chunyu	12b7ed29bd	netfilter: xt_MARK: Add ARP support Add arpt_MARK to xt_mark. The corresponding userspace update is available at: http://git.netfilter.org/arptables/commit/?id=4bb2f8340783fd3a3f70aa6f8807428a280f8474 Signed-off-by: Zhang Chunyu <zhangcy@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-14 13:00:27 +02:00
Denys Vlasenko	a3b1c1eb50	netfilter: ipset: deinline ip_set_put_extensions() On x86 allyesconfig build: The function compiles to 489 bytes of machine code. It has 25 callsites. text data bss dec hex filename 82441375 22255384 20627456 125324215 7784bb7 vmlinux.before 82434909 22255384 20627456 125317749 7783275 vmlinux Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> CC: Eric W. Biederman <ebiederm@xmission.com> CC: David S. Miller <davem@davemloft.net> CC: Jan Engelhardt <jengelh@medozas.de> CC: Jiri Pirko <jpirko@redhat.com> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-14 12:51:19 +02:00
Florian Westphal	a9fcc6a41d	netfilter: bridge: free nf_bridge info on xmit nf_bridge information is only needed for -m physdev, so we can always free it after POST_ROUTING. This has the advantage that allocation and free will typically happen on the same cpu. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-14 12:43:49 +02:00
Florian Westphal	7fb48c5bc3	netfilter: bridge: neigh_head and physoutdev can't be used at same time The neigh_header is only needed when we detect DNAT after prerouting and neigh cache didn't have a mac address for us. The output port has not been chosen yet so we can re-use the storage area, bringing struct size down to 32 bytes on x86_64. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-14 12:43:48 +02:00
Wesley Kuo	177d0506a9	Bluetooth: Fix remote name event return directly. This patch fixes hci_remote_name_evt dose not resolve name during discovery status is RESOLVING. Before simultaneous dual mode scan enabled, hci_check_pending_name will set discovery status to STOPPED eventually. Signed-off-by: Wesley Kuo <wesley.kuo@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2015-05-14 10:35:04 +02:00
Pablo Neira	e687ad60af	netfilter: add netfilter ingress hook after handle_ing() under unique static key This patch adds the Netfilter ingress hook just after the existing tc ingress hook, that seems to be the consensus solution for this. Note that the Netfilter hook resides under the global static key that enables ingress filtering. Nonetheless, Netfilter still also has its own static key for minimal impact on the existing handle_ing(). * Without this patch: Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags) 16086246pps 7721Mb/sec (7721398080bps) errors: 100000000 42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker 5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb * With this patch: Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags) 16090536pps 7723Mb/sec (7723457280bps) errors: 100000000 41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker 5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb * Without this patch + tc ingress: tc filter add dev eth4 parent ffff: protocol ip prio 1 \ u32 match ip dst 4.3.2.1/32 Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags) 10788648pps 5178Mb/sec (5178551040bps) errors: 100000000 40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.77% kpktgend_0 [cls_u32] [k] u32_classify 5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker 3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify 2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb * With this patch + tc ingress: tc filter add dev eth4 parent ffff: protocol ip prio 1 \ u32 match ip dst 4.3.2.1/32 Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags) 10743194pps 5156Mb/sec (5156733120bps) errors: 100000000 42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.70% kpktgend_0 [cls_u32] [k] u32_classify 5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker 2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify 1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk Note that the results are very similar before and after. I can see gcc gets the code under the ingress static key out of the hot path. Then, on that cold branch, it generates the code to accomodate the netfilter ingress static key. My explanation for this is that this reduces the pressure on the instruction cache for non-users as the new code is out of the hot path, and it comes with minimal impact for tc ingress users. Using gcc version 4.8.4 on: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 [...] L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 8192K Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:05 -04:00
Pablo Neira	1cf51900f8	net: add CONFIG_NET_INGRESS to enable ingress filtering This new config switch enables the ingress filtering infrastructure that is controlled through the ingress_needed static key. This prepares the introduction of the Netfilter ingress hook that resides under this unique static key. Note that CONFIG_SCH_INGRESS automatically selects this, that should be no problem since this also depends on CONFIG_NET_CLS_ACT. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:05 -04:00
Pablo Neira	f719148346	netfilter: add hook list to nf_hook_state Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:05 -04:00
Vlad Yasevich	be346ffaad	vlan: Correctly propagate promisc\|allmulti flags in notifier. Currently vlan notifier handler will try to update all vlans for a device when that device comes up. A problem occurs, however, when the vlan device was set to promiscuous, but not by the user (ex: a bridge). In that case, dev->gflags are not updated. What results is that the lower device ends up with an extra promiscuity count. Here are the backtraces that prove this: [62852.052179] [<ffffffff814fe248>] __dev_set_promiscuity+0x38/0x1e0 [62852.052186] [<ffffffff8160bcbb>] ? _raw_spin_unlock_bh+0x1b/0x40 [62852.052188] [<ffffffff814fe4be>] ? dev_set_rx_mode+0x2e/0x40 [62852.052190] [<ffffffff814fe694>] dev_set_promiscuity+0x24/0x50 [62852.052194] [<ffffffffa0324795>] vlan_dev_open+0xd5/0x1f0 [8021q] [62852.052196] [<ffffffff814fe58f>] __dev_open+0xbf/0x140 [62852.052198] [<ffffffff814fe88d>] __dev_change_flags+0x9d/0x170 [62852.052200] [<ffffffff814fe989>] dev_change_flags+0x29/0x60 The above comes from the setting the vlan device to IFF_UP state. [62852.053569] [<ffffffff814fe248>] __dev_set_promiscuity+0x38/0x1e0 [62852.053571] [<ffffffffa032459b>] ? vlan_dev_set_rx_mode+0x2b/0x30 [8021q] [62852.053573] [<ffffffff814fe8d5>] __dev_change_flags+0xe5/0x170 [62852.053645] [<ffffffff814fe989>] dev_change_flags+0x29/0x60 [62852.053647] [<ffffffffa032334a>] vlan_device_event+0x18a/0x690 [8021q] [62852.053649] [<ffffffff8161036c>] notifier_call_chain+0x4c/0x70 [62852.053651] [<ffffffff8109d456>] raw_notifier_call_chain+0x16/0x20 [62852.053653] [<ffffffff814f744d>] call_netdevice_notifiers+0x2d/0x60 [62852.053654] [<ffffffff814fe1a3>] __dev_notify_flags+0x33/0xa0 [62852.053656] [<ffffffff814fe9b2>] dev_change_flags+0x52/0x60 [62852.053657] [<ffffffff8150cd57>] do_setlink+0x397/0xa40 And this one comes from the notification code. What we end up with is a vlan with promiscuity count of 1 and and a physical device with a promiscuity count of 2. They should both have a count 1. To resolve this issue, vlan code can use dev_get_flags() api which correctly masks promiscuity and allmulti flags. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 00:54:32 -04:00
Alexander Duyck	a080e7bd0a	net: Reserve skb headroom and set skb->dev even if using __alloc_skb When I had inlined __alloc_rx_skb into __netdev_alloc_skb and __napi_alloc_skb I had overlooked the fact that there was a return in the __alloc_rx_skb. As a result we weren't reserving headroom or setting the skb->dev in certain cases. This change corrects that by adding a couple of jump labels to jump to depending on __alloc_skb either succeeding or failing. Fixes: `9451980a66` ("net: Use cached copy of pfmemalloc to avoid accessing page") Reported-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 18:07:24 -04:00
John W. Linville	d37d29c305	geneve_core: identify as driver library in modules description Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:13 -04:00
John W. Linville	11e1fa46b4	geneve: Rename support library as geneve_core net/ipv4/geneve.c -> net/ipv4/geneve_core.c This name better reflects the purpose of the module. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:13 -04:00
John W. Linville	35d32e8fe4	geneve: move definition of geneve_hdr() to geneve.h This is a static inline with identical definitions in multiple places... Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:13 -04:00
John W. Linville	125907ae5e	geneve: remove MODULE_ALIAS_RTNL_LINK from net/ipv4/geneve.c This file is essentially a library for implementing the geneve encapsulation protocol. The file does not register any rtnl_link_ops, so the MODULE_ALIAS_RTNL_LINK macro is inappropriate here. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:12 -04:00
Willem de Bruijn	a9b6391814	packet: rollover statistics Rollover indicates exceptional conditions. Export a counter to inform socket owners of this state. If no socket with sufficient room is found, rollover fails. Also count these events. Finally, also count when flows are rolled over early thanks to huge flow detection, to validate its correctness. Tested: Read counters in bench_rollover on all other tests in the patchset Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:43:00 -04:00
Willem de Bruijn	3b3a5b0aab	packet: rollover huge flows before small flows Migrate flows from a socket to another socket in the fanout group not only when the socket is full. Start migrating huge flows early, to divert possible 4-tuple attacks without affecting normal traffic. Introduce fanout_flow_is_huge(). This detects huge flows, which are defined as taking up more than half the load. It does so cheaply, by storing the rxhashes of the N most recent packets. If over half of these are the same rxhash as the current packet, then drop it. This only protects against 4-tuple attacks. N is chosen to fit all data in a single cache line. Tested: Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input. lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s cpu rx rx.k drop.k rollover r.huge r.failed 0 14 14 0 0 0 0 1 20 20 0 0 0 0 2 16 16 0 0 0 0 3 6168824 6168824 0 4867721 4867721 0 4 4867741 4867741 0 0 0 0 5 12 12 0 0 0 0 6 15 15 0 0 0 0 7 17 17 0 0 0 0 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:43:00 -04:00
Willem de Bruijn	2ccdbaa6d5	packet: rollover lock contention avoidance Rollover has to call packet_rcv_has_room on sockets in the fanout group to find a socket to migrate to. This operation is expensive especially if the packet sockets use rings, when a lock has to be acquired. Avoid pounding on the lock by all sockets by temporarily marking a socket as "under memory pressure" when such pressure is detected. While set, only the socket owner may call packet_rcv_has_room on the socket. Once it detects normal conditions, it clears the flag. The socket is not used as a victim by any other socket in the meantime. Under reasonably balanced load, each socket writer frequently calls packet_rcv_has_room and clears its own pressure field. As a backup for when the socket is rarely written to, also clear the flag on reading (packet_recvmsg, packet_poll) if this can be done cheaply (i.e., without calling packet_rcv_has_room). This is only for edge cases. Tested: Ran bench_rollover: a process with 8 sockets in a single fanout group, each pinned to a single cpu that receives one nic recv interrupt. RPS and RFS are disabled. The benchmark uses packet rx_ring, which has to take a lock when determining whether a socket has room. Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread uniformly across the packet sockets (and inserted an iptables rule to drop in PREROUTING to avoid protocol stack processing). Without this patch, all sockets try to migrate traffic to neighbors, causing lock contention when searching for a non- empty neighbor. The lock is the top 9 entries. perf record -a -g sleep 5 - 17.82% bench_rollover [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock - 99.00% spin_lock + 81.77% packet_rcv_has_room.isra.41 + 18.23% tpacket_rcv + 0.84% packet_rcv_has_room.isra.41 + 5.20% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock + 5.15% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock + 5.14% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock + 5.12% ksoftirqd/7 [kernel.kallsyms] [k] _raw_spin_lock + 5.12% ksoftirqd/5 [kernel.kallsyms] [k] _raw_spin_lock + 5.10% ksoftirqd/4 [kernel.kallsyms] [k] _raw_spin_lock + 4.66% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock + 4.45% ksoftirqd/3 [kernel.kallsyms] [k] _raw_spin_lock + 1.55% bench_rollover [kernel.kallsyms] [k] packet_rcv_has_room.isra.41 On net-next with this patch, this lock contention is no longer a top entry. Most time is spent in the actual read function. Next up are other locks: + 15.52% bench_rollover bench_rollover [.] reader + 4.68% swapper [kernel.kallsyms] [k] memcpy_erms + 2.77% swapper [kernel.kallsyms] [k] packet_lookup_frame.isra.51 + 2.56% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms + 2.16% swapper [kernel.kallsyms] [k] tpacket_rcv + 1.93% swapper [kernel.kallsyms] [k] mlx4_en_process_rx_cq Looking closer at the remaining _raw_spin_lock, the cost of probing in rollover is now comparable to the cost of taking the lock later in tpacket_rcv. - 1.51% swapper [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 33.41% packet_rcv_has_room + 28.15% tpacket_rcv + 19.54% enqueue_to_backlog + 6.45% __free_pages_ok + 2.78% packet_rcv_fanout + 2.13% fanout_demux_rollover + 2.01% netif_receive_skb_internal Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:43:00 -04:00
Willem de Bruijn	9954729bc3	packet: rollover only to socket with headroom Only migrate flows to sockets that have sufficient headroom, where sufficient is defined as having at least 25% empty space. The kernel has three different buffer types: a regular socket, a ring with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The latter two do not expose a read pointer to the kernel, so headroom is not computed easily. All three needs a different implementation to estimate free space. Tested: Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input. bench_rollover has as many sockets as there are NIC receive queues in the system. Each socket is owned by a process that is pinned to one of the receive cpus. RFS is disabled. RPS is enabled with an identity mapping (cpu x -> cpu x), to count drops with softnettop. lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s Press [Enter] to exit cpu rx rx.k drop.k rollover r.huge r.failed 0 16 16 0 0 0 0 1 21 21 0 0 0 0 2 5227502 5227502 0 0 0 0 3 18 18 0 0 0 0 4 6083289 6083289 0 5227496 0 0 5 22 22 0 0 0 0 6 21 21 0 0 0 0 7 9 9 0 0 0 0 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:42:59 -04:00
Willem de Bruijn	0648ab70af	packet: rollover prepare: per-socket state Replace rollover state per fanout group with state per socket. Future patches will add fields to the new structure. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:42:59 -04:00
Willem de Bruijn	ad377cab49	packet: rollover prepare: move code out of callsites packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover logic into the callee to simplify these callsites, especially with upcoming changes. The main differences between the two callsites is that the FLAG variant tests whether the socket previously selected by another mode (RR, RND, HASH, ..) has room before migrating flows, whereas the rollover mode has no original socket to test. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:42:59 -04:00
Eric Dumazet	7d771aaac7	ipv4: __ip_local_out_sk() is static __ip_local_out_sk() is only used from net/ipv4/ip_output.c net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not declared. Should it be static? Fixes: `7026b1ddb6` ("netfilter: Pass socket pointer down through okfn().") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:21:33 -04:00
Eric Dumazet	216f8bb9f6	tcp/dccp: tw_timer_handler() is static tw_timer_handler() is only used from net/ipv4/inet_timewait_sock.c Fixes: `789f558cfb` ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:21:33 -04:00
Jiri Pirko	77b9900ef5	tc: introduce Flower classifier This patch introduces a flow-based filter. So far, the very essential packet fields are supported. This patch is only the first step. There is a lot of potential performance improvements possible to implement. Also a lot of features are missing now. They will be addressed in follow-up patches. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:48 -04:00
Jiri Pirko	59346afe7a	flow_dissector: change port array into src, dst tuple Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:47 -04:00
Jiri Pirko	67a900cc04	flow_dissector: introduce support for Ethernet addresses Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:47 -04:00
Jiri Pirko	b924933cbb	flow_dissector: introduce support for ipv6 addressses So far, only hashes made out of ipv6 addresses could be dissected. This patch introduces support for dissection of full ipv6 addresses. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:47 -04:00
Jiri Pirko	06635a35d1	flow_dissect: use programable dissector in skb_flow_dissect and friends Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:47 -04:00
Jiri Pirko	fbff949e3b	flow_dissector: introduce programable flow_dissector Introduce dissector infrastructure which allows user to specify which parts of skb he wants to dissect. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:47 -04:00
Jiri Pirko	0db89b8b32	flow_dissector: fix doc for skb_get_poff Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:46 -04:00
Jiri Pirko	638b2a699f	net: move netdev_pick_tx and dependencies to net/core/dev.c next to its user. No relation to flow_dissector so it makes no sense to have it in flow_dissector.c Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:46 -04:00
Jiri Pirko	5605c76240	net: move __skb_tx_hash to dev.c __skb_tx_hash function has no relation to flow_dissect so just move it to dev.c Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:46 -04:00
Jiri Pirko	d4fd327571	flow_dissector: fix doc for __skb_get_hash and remove couple of empty lines Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:46 -04:00
Jiri Pirko	10b89ee43e	net: move *skb_get_poff declarations into correct header Since these functions are defined in flow_dissector.c, move header declarations from skbuff.h into flow_dissector.h Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:45 -04:00
Jiri Pirko	1bd758eb1c	net: change name of flow_dissector header to match the .c file name add couple of empty lines on the way. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:45 -04:00
Florian Westphal	e578d9c025	net: sched: use counter to break reclassify loops Seems all we want here is to avoid endless 'goto reclassify' loop. tc_classify_compat even resets this counter when something other than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't break hypothetical loops induced by something other than perpetual TC_ACT_RECLASSIFY return values. skb_act_clone is now identical to skb_clone, so just use that. Tested with following (bogus) filter: tc filter add dev eth0 parent ffff: \ protocol ip u32 match u32 0 0 police rate 10Kbit burst \ 64000 mtu 1500 action reclassify Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:08:14 -04:00
David S. Miller	b04096ff33	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Four minor merge conflicts: 1) qca_spi.c renamed the local variable used for the SPI device from spi_device to spi, meanwhile the spi_set_drvdata() call got moved further up in the probe function. 2) Two changes were both adding new members to codel params structure, and thus we had overlapping changes to the initializer function. 3) 'net' was making a fix to sk_release_kernel() which is completely removed in 'net-next'. 4) In net_namespace.c, the rtnl_net_fill() call for GET operations had the command value fixed, meanwhile 'net-next' adjusted the argument signature a bit. This also matches example merge resolutions posted by Stephen Rothwell over the past two days. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 14:31:43 -04:00
Scott Feldman	42275bd8fc	switchdev: don't use anonymous union on switchdev attr/obj structs Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions which was causing build issues on the newly added switchdev attr/obj structs. Fix this by using named union on structs. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 14:20:59 -04:00
Scott Feldman	7a7ee5312d	switchdev: sparse warning: pass ipv4 fib dst as network-byte order And let driver convert it to host-byte order as needed. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 12:26:27 -04:00
Scott Feldman	22c1f67ea5	switchdev: sparse warning: make __switchdev_port_obj_add static Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 12:26:27 -04:00
Jozsef Kadlecsik	a9756e6f63	netfilter: ipset: Use better include files in xt_set.c Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 18:21:13 +02:00
Sergey Popovich	1823fb79e5	netfilter: ipset: Improve preprocessor macros checks Check if mandatory MTYPE, HTYPE and HOST_MASK macros defined. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 18:21:13 +02:00
Sergey Popovich	58cc06daea	netfilter: ipset: Fix hashing for ipv6 sets HKEY_DATALEN remains defined after first inclusion of ip_set_hash_gen.h, so it is incorrectly reused for IPv6 code. Undefine HKEY_DATALEN in ip_set_hash_gen.h at the end. Also remove some useless defines of HKEY_DATALEN in ip_set_hash_{ip{,mark,port},netiface}.c as ip_set_hash_gen.h defines it correctly for such set types anyway. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 18:21:12 +02:00
Sergey Popovich	037261866c	netfilter: ipset: Check for comment netlink attribute length Ensure userspace supplies string not longer than IPSET_MAX_COMMENT_SIZE. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:47 +02:00
Sergey Popovich	728a7e6903	netfilter: ipset: Return bool values instead of int Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:47 +02:00
Sergey Popovich	cabfd139aa	netfilter: ipset: Use HOST_MASK literal to represent host address CIDR len Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:47 +02:00
Sergey Popovich	d25472e470	netfilter: ipset: Check IPSET_ATTR_PORT only once We do not need to check tb[IPSET_ATTR_PORT] != NULL before retrieving port, as this attribute is known to exist due to ip_set_attr_netorder() returning true only when attribute exists and it is in network byte order. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:46 +02:00
Sergey Popovich	8e55d2e590	netfilter: ipset: Return ipset error instead of bool Statement ret = func1() \|\| func2() returns 0 when both func1() and func2() return 0, or 1 if func1() or func2() returns non-zero. However in our case func1() and func2() returns error code on failure, so it seems good to propagate such error codes, rather than returning 1 in case of failure. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:46 +02:00
Sergey Popovich	43ef29c91a	netfilter: ipset: Preprocessor directices cleanup * Undefine mtype_data_reset_elem before defining. * Remove duplicated mtype_gc_init undefine, move mtype_gc_init define closer to mtype_gc define. * Use htype instead of HTYPE in IPSET_TOKEN(HTYPE, _create)(). * Remove PF definition from sets: no more used. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:46 +02:00
Sergey Popovich	2b67d6e01d	netfilter: ipset: No need to make nomatch bitfield We do not store cidr packed with no match, so there is no need to make nomatch bitfield. This simplifies mtype_data_reset_flags() a bit. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:45 +02:00
Sergey Popovich	caed0ed35b	netfilter: ipset: Properly calculate extensions offsets and total length Offsets and total length returned by the ip_set_elem_len() calculated incorrectly as initial set element length (i.e. len parameter) is used multiple times in offset calculations, also affecting set element total length. Use initial set element length as start offset, do not add aligned extension offset to the offset. Return offset as total length of the set element. This reduces memory requirements on per element basic for the hash:* type of sets. For example output from 'ipset -terse list test-1' on 64-bit PC, where test-1 is generated via following script: #!/bin/bash set_name='test-1' ipset create "$set_name" hash:net family inet \ timeout 10800 counters comment \ hashsize 65536 maxelem 65536 declare -i o3 o4 fmt="add $set_name 192.168.%u.%u\n" for ((o3 = 0; o3 < 256; o3++)); do for ((o4 = 0; o4 < 256; o4++)); do printf "$fmt" $o3 $o4 done done \|ipset -exist restore BEFORE this patch is applied # ipset -terse list test-1 Name: test-1 Type: hash:net Revision: 6 Header: family inet hashsize 65536 maxelem 65536 timeout 10800 counters comment Size in memory: 26348440 and AFTER applying patch # ipset -terse list test-1 Name: test-1 Type: hash:net Revision: 6 Header: family inet hashsize 65536 maxelem 65536 timeout 10800 counters comment Size in memory: 7706392 References: 0 Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:45 +02:00
Alexander Drozdov	3e4e8d126c	netfilter: ipset: make ip_set_get_ip*_port to use skb_network_offset All the ipset functions respect skb->network_header value, except for ip_set_get_ip4_port() & ip_set_get_ip6_port(). The functions should use skb_network_offset() to get the transport header offset. Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:45 +02:00
Jozsef Kadlecsik	22496f098b	netfilter: ipset: Give a better name to a macro in ip_set_core.c Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:44 +02:00
Jozsef Kadlecsik	2006aa4a8c	netfilter: ipset: Fix sparse warning "warning: cast to restricted __be32" warnings are fixed Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-05-13 13:25:44 +02:00
Herbert Xu	6d7258ca93	esp6: Use high-order sequence number bits for IV generation I noticed we were only using the low-order bits for IV generation when ESN is enabled. This is very bad because it means that the IV can repeat. We must use the full 64 bits. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-13 09:34:54 +02:00
Herbert Xu	64aa42338e	esp4: Use high-order sequence number bits for IV generation I noticed we were only using the low-order bits for IV generation when ESN is enabled. This is very bad because it means that the IV can repeat. We must use the full 64 bits. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-13 09:34:53 +02:00
Linus Torvalds	110bc76729	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from Avri Altman. 2) Use the correct FW API for scan completions in iwlwifi, from Avraham Stern. 3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad Kaufman. 4) rhashtable conversion of mac80211 station table was buggy, the virtual interface was not taken into account. Fix from Johannes Berg. 5) Fix deadlock in rtlwifi by not using a zero timeout for usb_control_msg(), from Larry Finger. 6) Update reordering state before calculating loss detection, from Yuchung Cheng. 7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter. 8) Fix extended frame handling in xiling_can driver, from Jeppe Ledet-Pedersen. 9) Fix CODEL packet scheduler behavior in the presence of TSO packets, from Eric Dumazet. 10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck. 11) macvlan needs to propagate promisc settings down the the lower device, from Vlad Yasevich. 12) igb driver can oops when changing number of rings, from Toshiaki Makita. 13) Source specific default routes not handled properly in ipv6, from Markus Stenberg. 14) Use after free in tc_ctl_tfilter(), from WANG Cong. 15) Use softirq spinlocking in netxen driver, from Tony Camuso. 16) Two ARM bpf JIT fixes from Nicolas Schichan. 17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from Mathias Kretschmer. 18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei Starovoitov. 19) ll_temac driver DMA maps TX packet header with incorrect length, fix from Michal Simek. 20) We removed pm_qos bits from netdevice.h, but some indirect references remained. Kill them. From David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits) net: Remove remaining remnants of pm_qos from netdevice.h e1000e: Add pm_qos header net: phy: micrel: Fix regression in kszphy_probe net: ll_temac: Fix DMA map size bug x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions netns: return RTM_NEWNSID instead of RTM_GETNSID on a get Update be2net maintainers' email addresses net_sched: gred: use correct backlog value in WRED mode pppoe: drop pppoe device in pppoe_unbind_sock_work net: qca_spi: Fix possible race during probe net: mdio-gpio: Allow for unspecified bus id af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT). bnx2x: limit fw delay in kdump to 5s after boot ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits. ARM: net fix emit_udiv() for BPF_ALU \| BPF_DIV \| BPF_K intruction. mpls: Change reserved label names to be consistent with netbsd usbnet: avoid integer overflow in start_xmit netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2) net: xgene_enet: Set hardware dependency net: amd-xgbe: Add hardware dependency ...	2015-05-12 21:10:38 -07:00
Ying Xue	9449c3cd90	net: make skb_dst_pop routine static As xfrm_output_one() is the only caller of skb_dst_pop(), we should make skb_dst_pop() localized. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 23:19:49 -04:00
Alexei Starovoitov	9eea922264	pktgen: fix packet generation pkt_gen->last_ok was not set properly, so after the first burst pktgen instead of allocating new packet, will reuse old one, advance eth_type_trans further, which would mean the stack will be seeing very short bogus packets. Fixes: `62f64aed62` ("pktgen: introduce xmit_mode '<start_xmit\|netif_receive>'") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 23:09:52 -04:00
Denys Vlasenko	a2029240e5	net: deinline netif_tx_stop_all_queues(), remove WARN_ON in netif_tx_stop_queue() These functions compile to 60 bytes of machine code each. With this .config: http://busybox.net/~vda/kernel_config there are 617 calls of netif_tx_stop_queue() and 49 calls of netif_tx_stop_all_queues() in vmlinux. To fix this, remove WARN_ON in netif_tx_stop_queue() as suggested by davem, and deinline netif_tx_stop_all_queues(). Change in code size is about 20k: text data bss dec hex filename 82426986 22255416 20627456 125309858 77813a2 vmlinux.before 82406248 22255416 20627456 125289120 777c2a0 vmlinux gcc-4.7.2 still creates deinlined version of netif_tx_stop_queue sometimes: $ nm --size-sort vmlinux \| grep netif_tx_stop_queue \| wc -l 190 ffffffff81b558a8 <netif_tx_stop_queue>: ffffffff81b558a8: 55 push %rbp ffffffff81b558a9: 48 89 e5 mov %rsp,%rbp ffffffff81b558ac: f0 80 8f e0 01 00 00 lock orb $0x1,0x1e0(%rdi) ffffffff81b558b3: 01 ffffffff81b558b4: 5d pop %rbp ffffffff81b558b5: c3 retq This needs additional fixing. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Alexei Starovoitov <alexei.starovoitov@gmail.com> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Joe Perches <joe@perches.com> CC: David S. Miller <davem@davemloft.net> CC: Jiri Pirko <jpirko@redhat.com> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 23:05:35 -04:00
Nicolas Dichtel	e3d8ecb70e	netns: return RTM_NEWNSID instead of RTM_GETNSID on a get Usually, RTM_NEWxxx is returned on a get (same as a dump). Fixes: `0c7aecd4bd` ("netns: add rtnl cmd to add and get peer netns ids") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:53:25 -04:00
Scott Feldman	7889cbee83	switchdev: remove NETIF_F_HW_SWITCH_OFFLOAD feature flag Roopa said remove the feature flag for this series and she'll work on bringing it back if needed at a later date. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:55 -04:00
Scott Feldman	58c2cb16b1	switchdev: convert fib_ipv4_add/del over to switchdev_port_obj_add/del The IPv4 FIB ops convert nicely to the switchdev objs and we're left with only four switchdev ops: port get/set and port add/del. Other objs will follow, such as FDB. So go ahead and convert IPv4 FIB over to switchdev obj for consistency, anticipating more objs to come. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:55 -04:00
Scott Feldman	8793d0a664	switchdev: add new switchdev_port_bridge_getlink Like bridge_setlink, add switchdev wrapper to handle bridge_getlink and call into port driver to get port attrs. For now, only BR_LEARNING and BR_LEARNING_SYNC are returned. To add more, we'll probably want to break away from ndo_dflt_bridge_getlink() and build the netlink skb directly in the switchdev code. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:55 -04:00
Scott Feldman	8508025c59	bridge: revert br_dellink change back to original This is revert of: commit `68e331c785` ("bridge: offload bridge port attributes to switch asic if feature flag set") Restore br_dellink back to original and don't call into SELF port driver. rtnetlink.c:bridge_dellink() already does a call into port driver for SELF. bridge vlan add/del cmd defaults to MASTER. From man page for bridge vlan add/del cmd: self the vlan is configured on the specified physical device. Required if the device is the bridge device. master the vlan is configured on the software bridge (default). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:55 -04:00
Scott Feldman	87a5dae59e	switchdev: remove unused switchdev_port_bridge_dellink Now we can remove old wrappers for dellink. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:55 -04:00
Scott Feldman	5c34e02214	switchdev: add new switchdev_port_bridge_dellink Same change as setlink. Provide the wrapper op for SELF ndo_bridge_dellink and call into the switchdev driver to delete afspec VLANs. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:55 -04:00
Scott Feldman	41c498b935	bridge: restore br_setlink back to original This is revert of: commit `68e331c785` ("bridge: offload bridge port attributes to switch asic if feature flag set") Restore br_setlink back to original and don't call into SELF port driver. rtnetlink.c:bridge_setlink() already does a call into port driver for SELF. bridge set link cmd defaults to MASTER. From man page for bridge link set cmd: self link setting is configured on specified physical device master link setting is configured on the software bridge (default) The link setting has two values: the device-side value and the software bridge-side value. These are independent and settable using the bridge link set cmd by specifying some combination of [master] \| [self]. Furthermore, the device-side and bridge-side settings have their own initial value, viewable from bridge -d link show cmd. Restoring br_setlink back to original makes rocker (the only in-kernel user of SELF link settings) work as first implement: two-sided values. It's true that when both MASTER and SELF are specified from the command, two netlink notifications are generated, one for each side of the settings. The user-space app can distiquish between the two notifications by observing the MASTER or SELF flag. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:54 -04:00
Scott Feldman	e71f220b34	switchdev: remove old switchdev_port_bridge_setlink New attr-based bridge_setlink can recurse lower devs and recover on err, so remove old wrapper (including ndo_dflt_switchdev_port_bridge_setlink). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:54 -04:00
Scott Feldman	47f8328bb1	switchdev: add new switchdev bridge setlink Add new switchdev_port_bridge_setlink that can be used by drivers implementing .ndo_bridge_setlink to set switchdev bridge attributes. Basically turn the raw rtnl_bridge_setlink netlink into switchdev attr sets. Proper netlink attr policy checking is done on the protinfo part of the netlink msg. Currently, for protinfo, only bridge port attrs BR_LEARNING and BR_LEARNING_SYNC are parsed and passed to port driver. For afspec, VLAN objs are passed so switchdev driver can set VLANs assigned to SELF. To illustrate with iproute2 cmd, we have: bridge vlan add vid 10 dev sw1p1 self master To add VLAN 10 to port sw1p1 for both the bridge (master) and the device (self). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:54 -04:00
Scott Feldman	491d0f1533	switchdev: introduce switchdev add/del obj ops Like switchdev attr get/set, add new switchdev obj add/del. switchdev objs will be things like VLANs or FIB entries, so add/del fits better for objects than get/set used for attributes. Use same two-phase prepare-commit transaction model as in attr set. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:53 -04:00
Scott Feldman	3563606258	switchdev: convert STP update to switchdev attr set STP update is just a settable port attribute, so convert switchdev_port_stp_update to an attr set. For DSA, the prepare phase is skipped and STP updates are only done in the commit phase. This is because currently the DSA drivers don't need to allocate any memory for STP updates and the STP update will not fail to HW (unless something horrible goes wrong on the MDIO bus, in which case the prepare phase wouldn't have been able to predict anyway). Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:53 -04:00
Scott Feldman	f8e20a9f87	switchdev: convert parent_id_get to switchdev attr get Switch ID is just a gettable port attribute. Convert switchdev op switchdev_parent_id_get to a switchdev attr. Note: for sysfs and netlink interfaces, SWITCHDEV_ATTR_PORT_PARENT_ID is called with SWITCHDEV_F_NO_RECUSE to limit switch ID user-visiblity to only port netdevs. So when a port is stacked under bond/bridge, the user can only query switch id via the switch ports, but not via the upper devices Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:53 -04:00
Scott Feldman	3094333d90	switchdev: introduce get/set attrs ops Add two new swdev ops for get/set switch port attributes. Most swdev interactions on a port are gets or sets on port attributes, so rather than adding ops for each attribute, let's define clean get/set ops for all attributes, and then we can have clear, consistent rules on how attributes propagate on stacked devs. Add the basic algorithms for get/set attr ops. Use the same recusive algo to walk lower devs we've used for STP updates, for example. For get, compare attr value for each lower dev and only return success if attr values match across all lower devs. For sets, set the same attr value for all lower devs. We'll use a two-phase prepare-commit transaction model for sets. In the first phase, the driver(s) are asked if attr set is OK. If all OK, the commit attr set in second phase. A driver would NACK the prepare phase if it can't set the attr due to lack of resources or support, within it's control. RTNL lock must be held across both phases because we'll recurse all lower devs first in prepare phase, and then recurse all lower devs again in commit phase. If any lower dev fails the prepare phase, we need to abort the transaction for all lower devs. If lower dev recusion isn't desired, allow a flag SWITCHDEV_F_NO_RECURSE to indicate get/set only work on port (lowest) device. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:53 -04:00
Jiri Pirko	9d47c0a2d9	switchdev: s/swdev_/switchdev_/ Turned out that "switchdev" sticks. So just unify all related terms to use this prefix. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:53 -04:00
Jiri Pirko	ebb9a03a59	switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/ Turned out that "switchdev" sticks. So just unify all related terms to use this prefix. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:52 -04:00
David Ward	a3eb95f891	net_sched: gred: add TCA_GRED_LIMIT attribute In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop parameters configured, then packets for the default VQ are not subjected to RED and are only dropped if the queue is larger than the net_device's tx_queue_len. This behavior is useful for WRED mode, since these packets will still influence the calculated average queue length and (therefore) the drop probability for all of the other VQs. However, for some drivers tx_queue_len is zero. In other cases the user may wish to make the limit the same for all VQs (including the default VQ with no drop parameters). This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit, in bytes, during qdisc setup. (This limit is in bytes to be consistent with the drop parameters.) The default limit is the same as for a bfifo queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are configured with a smaller limit than the GRED queue limit, that VQ will still observe the smaller limit instead. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:22:49 -04:00
Alexander Duyck	181edb2bfa	net: Add skb_free_frag to replace use of put_page in freeing skb->head This change adds a function called skb_free_frag which is meant to compliment the function netdev_alloc_frag. The general idea is to enable a more lightweight version of page freeing since we don't actually need all the overhead of a put_page, and we don't quite fit the model of __free_pages. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 10:39:26 -04:00
Alexander Duyck	b63ae8ca09	mm/net: Rename and move page fragment handling from net/ to mm/ This change moves the __alloc_page_frag functionality out of the networking stack and into the page allocation portion of mm. The idea it so help make this maintainable by placing it with other page allocation functions. Since we are moving it from skbuff.c to page_alloc.c I have also renamed the basic defines and structure from netdev_alloc_cache to page_frag_cache to reflect that this is now part of a different kernel subsystem. I have also added a simple __free_page_frag function which can handle freeing the frags based on the skb->head pointer. The model for this is based off of __free_pages since we don't actually need to deal with all of the cases that put_page handles. I incorporated the virt_to_head_page call and compound_order into the function as it actually allows for a signficant size reduction by reducing code duplication. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 10:39:26 -04:00
Alexander Duyck	0e39250845	net: Store virtual address instead of page in netdev_alloc_cache This change makes it so that we store the virtual address of the page in the netdev_alloc_cache instead of the page pointer. The idea behind this is to avoid multiple calls to page_address since the virtual address is required for every access, but the page pointer is only needed at allocation or reset of the page. While I was at it I also reordered the netdev_alloc_cache structure a bit so that the size is always 16 bytes by dropping size in the case where PAGE_SIZE is greater than or equal to 32KB. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 10:39:26 -04:00
Alexander Duyck	9451980a66	net: Use cached copy of pfmemalloc to avoid accessing page While testing I found that the testing for pfmemalloc in build_skb was rather expensive. I found the issue to be two-fold. First we have to get from the virtual address to the head page and that comes at the cost of something like 11 cycles. Then there is the cost for reading pfmemalloc out of the head page which can be cache cold due to the fact that put_page_testzero is likely invalidating the cache-line on one or more CPUs as the fragments can be shared. To avoid this extra expense I have added a pfmemalloc member to the netdev_alloc_cache. I then pushed pieces of __alloc_rx_skb into __napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to make use of the cached pfmemalloc value. The result is that my perf traces show a reduction from 9.28% overhead to 3.7% for the code covered by build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with the packet being dropped instead of being handed to napi_gro_receive. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 10:39:26 -04:00
Eric Dumazet	b396cca6fa	net: sched: deprecate enqueue_root() Only left enqueue_root() user is netem, and it looks not necessary : qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 14:17:32 -04:00
David Ward	145a42b3a9	net_sched: gred: use correct backlog value in WRED mode In WRED mode, the backlog for a single virtual queue (VQ) should not be used to determine queue behavior; instead the backlog is summed across all VQs. This sum is currently used when calculating the average queue lengths. It also needs to be used when determining if the queue's hard limit has been reached, or when reporting each VQ's backlog via netlink. q->backlog will only be used if the queue switches out of WRED mode. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 13:26:26 -04:00
Johannes Berg	658358cec9	mac80211: fix throughput LED trigger As I was testing with hwsim, I missed that my previous commit to make LED work depend on activation broke the code because I missed removing the old trigger struct and some code was still using it, now erroneously, causing crashes. Fix this by always using the correct struct. Reported-by: Felix Fietkau <nbd@openwrt.org> Tested-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2015-05-11 19:16:04 +02:00
Daniel Borkmann	d2788d3488	net: sched: further simplify handle_ing Ingress qdisc has no other purpose than calling into tc_classify() that executes attached classifier(s) and action(s). It has a 1:1 relationship to dev->ingress_queue. After having commit `087c1a601a` ("net: sched: run ingress qdisc without locks") removed the central ingress lock, one major contention point is gone. The extra indirection layers however, are not necessary for calling into ingress qdisc. pktgen calling locally into netif_receive_skb() with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps. We can redirect the private classifier list to the netdev directly, without changing any classifier API bits (!) and execute on that from handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed, ingress qdisc doesn't have a queue and thus dev_deactivate_queue() is also not applicable, ingress_cl_list provides similar behaviour. In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc. One next possible step is the removal of the dev's ingress (dummy) netdev_queue, and to only have the list member in the netdevice itself. Note, the filter chain is RCU protected and individual filter elements are being kfree'd by sched subsystem after RCU grace period. RCU read lock is being held by __netif_receive_skb_core(). Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 11:10:35 -04:00
Daniel Borkmann	c9e99fd078	net: sched: consolidate handle_ing and ing_filter Given quite some code has been removed from ing_filter(), we can just consolidate that function into handle_ing() and get rid of a few instructions at the same time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 11:10:34 -04:00
Eric W. Biederman	affb9792f1	net: kill sk_change_net and sk_release_kernel These functions are no longer needed and no longer used kill them. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 10:50:18 -04:00
Eric W. Biederman	13d3078e22	netlink: Create kernel netlink sockets in the proper network namespace Utilize the new functionality of sk_alloc so that nothing needs to be done to suprress the reference counting on kernel sockets. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 10:50:18 -04:00
Eric W. Biederman	26abe14379	net: Modify sk_alloc to not reference count the netns of kernel sockets. Now that sk_alloc knows when a kernel socket is being allocated modify it to not reference count the network namespace of kernel sockets. Keep track of if a socket needs reference counting by adding a flag to struct sock called sk_net_refcnt. Update all of the callers of sock_create_kern to stop using sk_change_net and sk_release_kernel as those hacks are no longer needed, to avoid reference counting a kernel socket. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 10:50:18 -04:00

... 6 7 8 9 10 ...

38272 Commits