linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-16 08:02:17 +00:00

Author	SHA1	Message	Date
Roopa Prabhu	94a57f1f8a	mpls: find_outdev: check for err ptr in addition to NULL check find_outdev calls inet{,6}_fib_lookup_dev() or dev_get_by_index() to find the output device. In case of an error, inet{,6}_fib_lookup_dev() returns error pointer and dev_get_by_index() returns NULL. But the function only checks for NULL and thus can end up calling dev_put on an ERR_PTR. This patch adds an additional check for err ptr after the NULL check. Before: Trying to add an mpls route with no oif from user, no available path to 10.1.1.8 and no default route: $ip -f mpls route add 100 as 200 via inet 10.1.1.8 [ 822.337195] BUG: unable to handle kernel NULL pointer dereference at 00000000000003a3 [ 822.340033] IP: [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182 [ 822.340033] PGD 1db38067 PUD 1de9e067 PMD 0 [ 822.340033] Oops: 0000 [#1] SMP [ 822.340033] Modules linked in: [ 822.340033] CPU: 0 PID: 11148 Comm: ip Not tainted 4.5.0-rc7+ #54 [ 822.340033] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 822.340033] task: ffff88001db82580 ti: ffff88001dad4000 task.ti: ffff88001dad4000 [ 822.340033] RIP: 0010:[<ffffffff8148781e>] [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182 [ 822.340033] RSP: 0018:ffff88001dad7a88 EFLAGS: 00010282 [ 822.340033] RAX: ffffffffffffff9b RBX: ffffffffffffff9b RCX: 0000000000000002 [ 822.340033] RDX: 00000000ffffff9b RSI: 0000000000000008 RDI: 0000000000000000 [ 822.340033] RBP: ffff88001ddc9ea0 R08: ffff88001e9f1768 R09: 0000000000000000 [ 822.340033] R10: ffff88001d9c1100 R11: ffff88001e3c89f0 R12: ffffffff8187e0c0 [ 822.340033] R13: ffffffff8187e0c0 R14: ffff88001ddc9e80 R15: 0000000000000004 [ 822.340033] FS: 00007ff9ed798700(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 [ 822.340033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 822.340033] CR2: 00000000000003a3 CR3: 000000001de89000 CR4: 00000000000006f0 [ 822.340033] Stack: [ 822.340033] 0000000000000000 0000000100000000 0000000000000000 0000000000000000 [ 822.340033] 0000000000000000 0801010a00000000 0000000000000000 0000000000000000 [ 822.340033] 0000000000000004 ffffffff8148749b ffffffff8187e0c0 000000000000001c [ 822.340033] Call Trace: [ 822.340033] [<ffffffff8148749b>] ? mpls_rt_alloc+0x2b/0x3e [ 822.340033] [<ffffffff81488e66>] ? mpls_rtm_newroute+0x358/0x3e2 [ 822.340033] [<ffffffff810e7bbc>] ? get_page+0x5/0xa [ 822.340033] [<ffffffff813b7d94>] ? rtnetlink_rcv_msg+0x17e/0x191 [ 822.340033] [<ffffffff8111794e>] ? __kmalloc_track_caller+0x8c/0x9e [ 822.340033] [<ffffffff813c9393>] ? rht_key_hashfn.isra.20.constprop.57+0x14/0x1f [ 822.340033] [<ffffffff813b7c16>] ? __rtnl_unlock+0xc/0xc [ 822.340033] [<ffffffff813cb794>] ? netlink_rcv_skb+0x36/0x82 [ 822.340033] [<ffffffff813b4507>] ? rtnetlink_rcv+0x1f/0x28 [ 822.340033] [<ffffffff813cb2b1>] ? netlink_unicast+0x106/0x189 [ 822.340033] [<ffffffff813cb5b3>] ? netlink_sendmsg+0x27f/0x2c8 [ 822.340033] [<ffffffff81392ede>] ? sock_sendmsg_nosec+0x10/0x1b [ 822.340033] [<ffffffff81393df1>] ? ___sys_sendmsg+0x182/0x1e3 [ 822.340033] [<ffffffff810e4f35>] ? __alloc_pages_nodemask+0x11c/0x1e4 [ 822.340033] [<ffffffff8110619c>] ? PageAnon+0x5/0xd [ 822.340033] [<ffffffff811062fe>] ? __page_set_anon_rmap+0x45/0x52 [ 822.340033] [<ffffffff810e7bbc>] ? get_page+0x5/0xa [ 822.340033] [<ffffffff810e85ab>] ? __lru_cache_add+0x1a/0x3a [ 822.340033] [<ffffffff81087ea9>] ? current_kernel_time64+0x9/0x30 [ 822.340033] [<ffffffff813940c4>] ? __sys_sendmsg+0x3c/0x5a [ 822.340033] [<ffffffff8148f597>] ? entry_SYSCALL_64_fastpath+0x12/0x6a [ 822.340033] Code: 83 08 04 00 00 65 ff 00 48 8b 3c 24 e8 40 7c f2 ff eb 13 48 c7 c3 9f ff ff ff eb 0f 89 ce e8 f1 ae f1 ff 48 89 c3 48 85 db 74 15 <48> 8b 83 08 04 00 00 65 ff 08 48 81 fb 00 f0 ff ff 76 0d eb 07 [ 822.340033] RIP [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182 [ 822.340033] RSP <ffff88001dad7a88> [ 822.340033] CR2: 00000000000003a3 [ 822.435363] ---[ end trace 98cc65e6f6b8bf11 ]--- After patch: $ip -f mpls route add 100 as 200 via inet 10.1.1.8 RTNETLINK answers: Network is unreachable Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reported-by: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-08 12:43:20 -04:00
Jakub Sitnicki	3ba3458fb9	ipv6: Count in extension headers in skb->network_header When sending a UDPv6 message longer than MTU, account for the length of fragmentable IPv6 extension headers in skb->network_header offset. Same as we do in alloc_new_skb path in __ip6_append_data(). This ensures that later on __ip6_make_skb() will make space in headroom for fragmentable extension headers: /* move skb->data to ip header from ext header */ if (skb->data < skb_network_header(skb)) __skb_pull(skb, skb_network_offset(skb)); Prevents a splat due to skb_under_panic: skbuff: skb_under_panic: text:ffffffff8143397b len:2126 put:14 \ head:ffff880005bacf50 data:ffff880005bacf4a tail:0x48 end:0xc0 dev:lo ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:104! invalid opcode: 0000 [#1] KASAN CPU: 0 PID: 160 Comm: reproducer Not tainted 4.6.0-rc2 #65 [...] Call Trace: [<ffffffff813eb7b9>] skb_push+0x79/0x80 [<ffffffff8143397b>] eth_header+0x2b/0x100 [<ffffffff8141e0d0>] neigh_resolve_output+0x210/0x310 [<ffffffff814eab77>] ip6_finish_output2+0x4a7/0x7c0 [<ffffffff814efe3a>] ip6_output+0x16a/0x280 [<ffffffff815440c1>] ip6_local_out+0xb1/0xf0 [<ffffffff814f1115>] ip6_send_skb+0x45/0xd0 [<ffffffff81518836>] udp_v6_send_skb+0x246/0x5d0 [<ffffffff8151985e>] udpv6_sendmsg+0xa6e/0x1090 [...] Reported-by: Ji Jianwen <jiji@redhat.com> Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 22:41:37 -04:00
Jon Paul Maloy	5b7066c3dd	tipc: stricter filtering of packets in bearer layer Resetting a bearer/interface, with the consequence of resetting all its pertaining links, is not an atomic action. This becomes particularly evident in very large clusters, where a lot of traffic may happen on the remaining links while we are busy shutting them down. In extreme cases, we may even see links being re-created and re-established before we are finished with the job. To solve this, we now introduce a solution where we temporarily detach the bearer from the interface when the bearer is reset. This inhibits all packet reception, while sending still is possible. For the latter, we use the fact that the device's user pointer now is zero to filter out which packets can be sent during this situation; i.e., outgoing RESET messages only. This filtering serves to speed up the neighbors' detection of the loss event, and saves us from unnecessary probing. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 17:00:13 -04:00
Jon Paul Maloy	4e801fa14f	tipc: eliminate buffer leak in bearer layer When enabling a bearer we create a 'neigbor discoverer' instance by calling the function tipc_disc_create() before the bearer is actually registered in the list of enabled bearers. Because of this, the very first discovery broadcast message, created by the mentioned function, is lost, since it cannot find any valid bearer to use. Furthermore, the used send function, tipc_bearer_xmit_skb() does not free the given buffer when it cannot find a bearer, resulting in the leak of exactly one send buffer each time a bearer is enabled. This commit fixes this problem by introducing two changes: 1) Instead of attemting to send the discovery message directly, we let tipc_disc_create() return the discovery buffer to the calling function, tipc_enable_bearer(), so that the latter can send it when the enabling sequence is finished. 2) In tipc_bearer_xmit_skb(), as well as in the two other transmit functions at the bearer layer, we now free the indicated buffer or buffer chain when a valid bearer cannot be found. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 17:00:13 -04:00
shamir rabinovitch	579ba85552	RDS: fix congestion map corruption for PAGE_SIZE > 4k When PAGE_SIZE > 4k single page can contain 2 RDS fragments. If 'rds_ib_cong_recv' ignore the RDS fragment offset in to the page it then read the data fragment as far congestion map update and lead to corruption of the RDS connection far congestion map. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:58:28 -04:00
shamir rabinovitch	e98499ac63	RDS: memory allocated must be align to 8 Fix issue in 'rds_ib_cong_recv' when accessing unaligned memory allocated by 'rds_page_remainder_alloc' using uint64_t pointer. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:58:27 -04:00
Alexander Duyck	a0ca153f98	GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU This patch fixes an issue I found in which we were dropping frames if we had enabled checksums on GRE headers that were encapsulated by either FOU or GUE. Without this patch I was barely able to get 1 Gb/s of throughput. With this patch applied I am now at least getting around 6 Gb/s. The issue is due to the fact that with FOU or GUE applied we do not provide a transport offset pointing to the GRE header, nor do we offload it in software as the GRE header is completely skipped by GSO and treated like a VXLAN or GENEVE type header. As such we need to prevent the stack from generating it and also prevent GRE from generating it via any interface we create. Fixes: `c3483384ee` ("gro: Allow tunnel stacking in the case of FOU/GUE") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:56:33 -04:00
Tom Herbert	46aa2f30aa	udp: Remove udp_offloads Now that the UDP encapsulation GRO functions have been moved to the UDP socket we not longer need the udp_offload insfrastructure so removing it. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:53:30 -04:00
Tom Herbert	d92283e338	fou: change to use UDP socket GRO Adapt gue_gro_receive, gue_gro_complete to take a socket argument. Don't set udp_offloads any more. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:53:29 -04:00
Tom Herbert	38fd2af24f	udp: Add socket based GRO and config Add gro_receive and gro_complete to struct udp_tunnel_sock_cfg. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:53:29 -04:00
Tom Herbert	a6024562ff	udp: Add GRO functions to UDP socket This patch adds GRO functions (gro_receive and gro_complete) to UDP sockets. udp_gro_receive is changed to perform socket lookup on a packet. If a socket is found the related GRO functions are called. This features obsoletes using UDP offload infrastructure for GRO (udp_offload). This has the advantage of not being limited to provide offload on a per port basis, GRO is now applied to whatever individual UDP sockets are bound to. This also allows the possbility of "application defined GRO"-- that is we can attach something like a BPF program to a UDP socket to perfrom GRO on an application layer protocol. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:53:29 -04:00
Tom Herbert	63058308cd	udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb Add externally visible functions to lookup a UDP socket by skb. This will be used for GRO in UDP sockets. These functions also check if skb->dst is set, and if it is not skb->dev is used to get dev_net. This allows calling lookup functions before dst has been set on the skbuff. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:53:14 -04:00
Hannes Frederic Sowa	8ced425ee6	tun: use socket locks for sk_{attach,detatch}_filter This reverts commit `5a5abb1fa3` ("tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter") and replaces it to use lock_sock around sk_{attach,detach}_filter. The checks inside filter.c are updated with lockdep_sock_is_held to check for proper socket locks. It keeps the code cleaner by ensuring that only one lock governs the socket filter instead of two independent locks. Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:44:14 -04:00
Hannes Frederic Sowa	1e1d04e678	net: introduce lockdep_is_held and update various places to use it The socket is either locked if we hold the slock spin_lock for lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned != 0). Check for this and at the same time improve that the current thread/cpu is really holding the lock. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:44:14 -04:00
Hannes Frederic Sowa	61881cfb5a	sock: fix lockdep annotation in release_sock During release_sock we use callbacks to finish the processing of outstanding skbs on the socket. We actually are still locked, sk_locked.owned == 1, but we already told lockdep that the mutex is released. This could lead to false positives in lockdep for lockdep_sock_is_held (we don't hold the slock spinlock during processing the outstanding skbs). I took over this patch from Eric Dumazet and tested it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 16:44:14 -04:00
Haishuang Yan	85f1e7c29a	netfilter: ipv6: unnecessary to check whether ip6_route_output() returns NULL ip6_route_output() never returns NULL, so it is not appropriate to check if the return value is NULL. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-04-07 18:53:08 +02:00
Eric Dumazet	8501786929	tcp/dccp: fix inet_reuseport_add_sock() David Ahern reported panics in __inet_hash() caused by my recent commit. The reason is inet_reuseport_add_sock() was still using sk_nulls_for_each_rcu() instead of sk_for_each_rcu(). SO_REUSEPORT enabled listeners were causing an instant crash. While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE flag, as it is inherited from the parent at clone time. Fixes: `3b24d854cb` ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-07 12:02:33 -04:00
Dexuan Cui	0a1a37b6d6	net: add the AF_KCM entries to family name tables This is for the recent kcm driver, which introduces AF_KCM(41) in b7ac4eb(kcm: Kernel Connection Multiplexor module). Signed-off-by: Dexuan Cui <decui@microsoft.com> Cc: Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-06 16:59:01 -04:00
Jiri Benc	a6d5bbf34e	ip_tunnel: implement __iptunnel_pull_header Allow calling of iptunnel_pull_header without special casing ETH_P_TEB inner protocol. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-06 16:50:32 -04:00
Jorgen Hansen	8ab18d71de	VSOCK: Detach QP check should filter out non matching QPs. The check in vmci_transport_peer_detach_cb should only allow a detach when the qp handle of the transport matches the one in the detach message. Testing: Before this change, a detach from a peer on a different socket would cause an active stream socket to register a detach. Reviewed-by: George Zhang <georgezhang@vmware.com> Signed-off-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-06 16:39:09 -04:00
Dave Jones	6ae81ced37	af_packet: tone down the Tx-ring unsupported spew. Trinity and other fuzzers can hit this WARN on far too easily, resulting in a tainted kernel that hinders automated fuzzing. Replace it with a rate-limited printk. Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-06 16:05:20 -04:00
David S. Miller	32fa270c8a	Revert "bridge: Fix incorrect variable assignment on error path in br_sysfs_addbr" This reverts commit `c862cc9b70`. Patch lacks a real-name Signed-off-by. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-06 15:42:45 -04:00
Jeff Mahoney	b4201cc4fc	mac80211: fix "warning: ‘target_metric’ may be used uninitialized" This fixes: net/mac80211/mesh_hwmp.c:603:26: warning: ‘target_metric’ may be used uninitialized in this function target_metric is only consumed when reply = true so no bug exists here, but not all versions of gcc realize it. Initialize to 0 to remove the warning. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 15:10:25 +02:00
Jouni Malinen	4ce2bd9c4c	cfg80211: Allow reassociation to be requested with internal SME If the user space issues a NL80211_CMD_CONNECT with NL80211_ATTR_PREV_BSSID when there is already a connection, allow this to proceed as a reassociation instead of rejecting the new connect command with EALREADY. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> [validate prev_bssid] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 15:09:28 +02:00
Jouni Malinen	ba6fbacf9c	cfg80211: Add option to specify previous BSSID for Connect command This extends NL80211_CMD_CONNECT to allow the NL80211_ATTR_PREV_BSSID attribute to be used similarly to way this was already allowed with NL80211_CMD_ASSOCIATE. This allows user space to request reassociation (instead of association) when already connected to an AP. This provides an option to reassociate within an ESS without having to disconnect and associate with the AP. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:21 +02:00
Felix Fietkau	918fe04b28	mac80211: minstrel_ht: set A-MSDU tx limits based on selected max_prob_rate Prevents excessive A-MSDU aggregation at low data rates or bad conditions. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:20 +02:00
Felix Fietkau	6e0456b545	mac80211: add A-MSDU tx support Requires software tx queueing and fast-xmit support. For good performance, drivers need frag_list support as well. This avoids the need for copying data of aggregated frames. Running without it is only supported for debugging purposes. To avoid performance and packet size issues, the rate control module or driver needs to limit the maximum A-MSDU size by setting max_rc_amsdu_len in struct ieee80211_sta. Signed-off-by: Felix Fietkau <nbd@openwrt.org> [fix locking issue] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:19 +02:00
Johannes Berg	c9c5962b56	mac80211: enable collecting station statistics per-CPU If the driver advertises the new HW flag USE_RSS, make the station statistics on the fast-rx path per-CPU. This will enable calling the RX in parallel, only hitting locking or shared cachelines when the fast-RX path isn't available. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:19 +02:00
Johannes Berg	49ddf8e6e2	mac80211: add fast-rx path The regular RX path has a lot of code, but with a few assumptions on the hardware it's possible to reduce the amount of code significantly. Currently the assumptions on the driver are the following: * hardware/driver reordering buffer (if supporting aggregation) * hardware/driver decryption & PN checking (if using encryption) * hardware/driver did de-duplication * hardware/driver did A-MSDU deaggregation * AP_LINK_PS is used (in AP mode) * no client powersave handling in mac80211 (in client mode) of which some are actually checked per packet: * de-duplication * PN checking * decryption and additionally packets must * not be A-MSDU (have been deaggregated by driver/device) * be data packets * not be fragmented * be unicast * have RFC 1042 header Additionally dynamically we assume: * no encryption or CCMP/GCMP, TKIP/WEP/other not allowed * station must be authorized * 4-addr format not enabled Some data needed for the RX path is cached in a new per-station "fast_rx" structure, so that we only need to look at this and the packet, no other memory when processing packets on the fast RX path. After doing the above per-packet checks, the data path collapses down to a pretty simple conversion function taking advantage of the data cached in the small fast_rx struct. This should speed up the RX processing, and will make it easier to reason about parallelizing RX (for which statistics will need to be per-CPU still.) Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:18 +02:00
Johannes Berg	0f9c5a61d4	mac80211: fix RX u64 stats consistency on 32-bit platforms On 32-bit platforms, the 64-bit counters we keep need to be protected to be consistently read. Use the u64_stats_sync mechanism to do that. In order to not end up with overly long lines, refactor the tidstats assignments a bit. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:17 +02:00
Johannes Berg	4f6b1b3daa	mac80211: fix last RX rate data consistency When storing the last_rate_* values in the RX code, there's nothing to guarantee consistency, so a concurrent reader could see, e.g. last_rate_idx on the new value, but last_rate_flag still on the old, getting completely bogus values in the end. To fix this, I lifted the sta_stats_encode_rate() function from my old rate statistics code, which encodes the entire rate data into a single 16-bit value, avoiding the consistency issue. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:17 +02:00
Johannes Berg	b8da6b6a99	mac80211: add separate last_ack variable Instead of touching the rx_stats.last_rx from the status path, introduce and use a status_stats.last_ack variable. This will make rx_stats.last_rx indicate when the last frame was received, making it available for real "last_rx" and statistics gathering; statistics, when done per-CPU, will need to figure out which place was updated last for those items where the "last" value is exposed. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:16 +02:00
Johannes Berg	2df8bfd724	mac80211: remove rx_stats.last_rx update after sta alloc There's no need to update rx_stats.last_rx after allocating a station since it's already updated during allocation. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:15 +02:00
Johannes Berg	0be6ed1338	mac80211: move averaged values out of rx_stats Move the averaged values out of rx_stats and into rx_stats_avg, to cleanly split them out. The averaged ones cannot be supported for parallel RX in a per-CPU fashion, while the other values can be collected per CPU and then combined/selected when needed. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:15 +02:00
Johannes Berg	8ebaa5b0a7	mac80211: move semicolon out of CALL_RXH macro Move the semicolon, people typically assume that and once line already put a semicolon behind the "call". Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:14 +02:00
Johannes Berg	de8f18d3a8	mac80211: count MSDUs in A-MSDU properly For the RX MSDU statistics, we need to count the number of MSDUs created and accepted from an A-MSDU. Right now, all frames in any A-MSDUs were completely ignored. Fix this by moving the RX MSDU statistics accounting into the deliver function. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:13 +02:00
Johannes Berg	d63b548fff	mac80211: allow passing transmitter station on RX Sometimes drivers already looked up, or know out-of-band from their device, which station transmitted a given RX frame. Allow them to pass the station pointer to mac80211 to save the extra lookup. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-06 13:18:13 +02:00
Aaron Conole	4da46cebbd	net/core/dev: Warn on a too-short GRO frame When signaling that a GRO frame is ready to be processed, the network stack correctly checks length and aborts processing when a frame is less than 14 bytes. However, such a condition is really indicative of a broken driver, and should be loudly signaled, rather than silently dropped as the case is today. Convert the condition to use net_warn_ratelimited() to ensure the stack loudly complains about such broken drivers. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-05 19:58:39 -04:00
Thadeu Lima de Souza Cascardo	b6ee376cb0	ip6_tunnel: set rtnl_link_ops before calling register_netdevice When creating an ip6tnl tunnel with ip tunnel, rtnl_link_ops is not set before ip6_tnl_create2 is called. When register_netdevice is called, there is no linkinfo attribute in the NEWLINK message because of that. Setting rtnl_link_ops before calling register_netdevice fixes that. Fixes: `0b11245722` ("ip6tnl: add support of link creation via rtnl") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-05 19:48:51 -04:00
Bjorn Helgaas	727ceaa49b	Revert "netpoll: Fix extra refcount release in netpoll_cleanup()" This reverts commit `543e3a8da5`. Direct callers of __netpoll_setup() depend on it to set np->dev, so we can't simply move that assignment up to netpoll_stup(). Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-05 19:34:44 -04:00
samanthakumar	627d2d6b55	udp: enable MSG_PEEK at non-zero offset Enable peeking at UDP datagrams at the offset specified with socket option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up to the end of the given datagram. Implement the SO_PEEK_OFF semantics introduced in commit `ef64a54f6e` ("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset on peek, decrease it on regular reads. When peeking, always checksum the packet immediately, to avoid recomputation on subsequent peeks and final read. The socket lock is not held for the duration of udp_recvmsg, so peek and read operations can run concurrently. Only the last store to sk_peek_off is preserved. Signed-off-by: Sam Kumar <samanthakumar@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-05 16:29:37 -04:00
samanthakumar	e6afc8ace6	udp: remove headers from UDP packets before queueing Remove UDP transport headers before queueing packets for reception. This change simplifies a follow-up patch to add MSG_PEEK support. Signed-off-by: Sam Kumar <samanthakumar@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-05 16:29:37 -04:00
Marcelo Ricardo Leitner	e43569e6d3	sctp: flush if we can't fit another DATA chunk There is no point on delaying the packet if we can't fit a single byte of data on it anymore. So lets just reduce the threshold by the amount that a data chunk with 4 bytes (rounding) would use. v2: based on the right tree Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-05 15:39:44 -04:00
Bob Copeland	e596af8279	mac80211: mesh: flush paths outside of plink lock Lockdep warned of a lock dependency between the mesh_plink lock and the internal lock for the rhashtable. The problem is that the rhashtable code uses a spin lock with softirqs enabled, while mesh_plink_timer executes a walk (to flush paths on a state change) inside a softirq with the plink lock held. This leads to the following deadlock if the timer fires while rht lock is held on this CPU, and plink lock is held on another CPU: CPU0 CPU1 ---- ---- lock(&(&ht->lock)->rlock); local_irq_disable(); lock(&(&sta->mesh->plink_lock)->rlock); lock(&(&ht->lock)->rlock); <Interrupt> lock(&(&sta->mesh->plink_lock)->rlock); * DEADLOCK * Fix by waiting until we drop the plink lock to flush paths. Fixes: d48a1b7cd439 ("mac80211: mesh: convert path table to rhashtable") Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:54 +02:00
Bob Copeland	0371a08fbb	mac80211: mesh: fix cleanup for mesh pathtable The mesh path table needs to be around for the entire time the interface is in mesh mode, as users can perform an mpath dump at any time. The existing path table lifetime is instead tied to the mesh BSS which can cause crashes when different MBSSes are joined in the context of a single interface, or when the path table is dumped when no MBSS is joined. Introduce a new function to perform the final teardown of the interface and perform path table cleanup there. We already free the individual path elements when the leaving the mesh so no additional cleanup is needed there. This fixes the following crash: [ 47.753026] BUG: unable to handle kernel paging request at fffffff0 [ 47.753026] IP: [<c0239765>] kthread_data+0xa/0xe [ 47.753026] pde = 00741067 pte = 00000000 [ 47.753026] Oops: 0000 [#4] PREEMPT [ 47.753026] Modules linked in: ppp_generic slhc 8021q garp mrp sch_fq_codel iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat ip_tables ath9k_htc ath5k 8139too ath10k_pci ath10k_core arc4 ath9k ath9k_common ath9k_hw mac80211 ath cfg80211 cpufreq_powersave br_netfilter bridge stp llc ipw usb_wwan sierra_net usbnet af_alg natsemi via_rhine mii iTCO_wdt iTCO_vendor_support gpio_ich sierra coretemp pcspkr i2c_i801 lpc_ich ata_generic ata_piix libata ide_pci_generic piix e1000e igb i2c_algo_bit ptp pps_core [last unloaded: 8139too] [ 47.753026] CPU: 0 PID: 12 Comm: kworker/u2:1 Tainted: G D W 4.5.0-wt-V3 #6 [ 47.753026] Hardware name: To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080016 11/07/2014 [ 47.753026] task: f645a0c0 ti: f6462000 task.ti: f6462000 [ 47.753026] EIP: 0060:[<c0239765>] EFLAGS: 00010002 CPU: 0 [ 47.753026] EIP is at kthread_data+0xa/0xe [ 47.753026] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000 [ 47.753026] ESI: f645a0c0 EDI: f645a2fc EBP: f6463a80 ESP: f6463a78 [ 47.753026] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 47.753026] CR0: 8005003b CR2: 00000014 CR3: 353e5000 CR4: 00000690 [ 47.753026] Stack: [ 47.753026] c0236866 00000000 f6463aac c05768b4 00000009 f6463ba8 f6463ab0 c0247010 [ 47.753026] 00000000 f645a0c0 f6464000 00000009 f6463ba8 f6463ab8 c0576eb2 f645a0c0 [ 47.753026] f6463aec c0228be4 c06335a4 f6463adc f6463ad0 c06c06d4 f6463ae4 c02471b0 [ 47.753026] Call Trace: [ 47.753026] [<c0236866>] ? wq_worker_sleeping+0xb/0x78 [ 47.753026] [<c05768b4>] __schedule+0xda/0x587 [ 47.753026] [<c0247010>] ? vprintk_default+0x12/0x14 [ 47.753026] [<c0576eb2>] schedule+0x72/0x89 [ 47.753026] [<c0228be4>] do_exit+0xb8/0x71d [ 47.753026] [<c02471b0>] ? kmsg_dump+0xa9/0xae [ 47.753026] [<c0203576>] oops_end+0x69/0x70 [ 47.753026] [<c021dcdb>] no_context+0x1bb/0x1c5 [ 47.753026] [<c021de1b>] __bad_area_nosemaphore+0x136/0x140 [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c021de32>] bad_area_nosemaphore+0xd/0x10 [ 47.753026] [<c021e0a1>] __do_page_fault+0x26c/0x320 [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c021e2fa>] do_page_fault+0xb/0xd [ 47.753026] [<c05798f8>] error_code+0x58/0x60 [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c0239765>] ? kthread_data+0xa/0xe [ 47.753026] [<c0236866>] ? wq_worker_sleeping+0xb/0x78 [ 47.753026] [<c05768b4>] __schedule+0xda/0x587 [ 47.753026] [<c0247010>] ? vprintk_default+0x12/0x14 [ 47.753026] [<c0576eb2>] schedule+0x72/0x89 [ 47.753026] [<c0228be4>] do_exit+0xb8/0x71d [ 47.753026] [<c02471b0>] ? kmsg_dump+0xa9/0xae [ 47.753026] [<c0203576>] oops_end+0x69/0x70 [ 47.753026] [<c021dcdb>] no_context+0x1bb/0x1c5 [ 47.753026] [<c021de1b>] __bad_area_nosemaphore+0x136/0x140 [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c021de32>] bad_area_nosemaphore+0xd/0x10 [ 47.753026] [<c021e0a1>] __do_page_fault+0x26c/0x320 [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c021e2fa>] do_page_fault+0xb/0xd [ 47.753026] [<c05798f8>] error_code+0x58/0x60 [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c0239765>] ? kthread_data+0xa/0xe [ 47.753026] [<c0236866>] ? wq_worker_sleeping+0xb/0x78 [ 47.753026] [<c05768b4>] __schedule+0xda/0x587 [ 47.753026] [<c0391e32>] ? put_io_context_active+0x6d/0x95 [ 47.753026] [<c0576eb2>] schedule+0x72/0x89 [ 47.753026] [<c02291f8>] do_exit+0x6cc/0x71d [ 47.753026] [<c0203576>] oops_end+0x69/0x70 [ 47.753026] [<c021dcdb>] no_context+0x1bb/0x1c5 [ 47.753026] [<c021de1b>] __bad_area_nosemaphore+0x136/0x140 [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c021de32>] bad_area_nosemaphore+0xd/0x10 [ 47.753026] [<c021e0a1>] __do_page_fault+0x26c/0x320 [ 47.753026] [<c03b9160>] ? debug_smp_processor_id+0x12/0x16 [ 47.753026] [<c02015e2>] ? __switch_to+0x24/0x40e [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c021e2fa>] do_page_fault+0xb/0xd [ 47.753026] [<c05798f8>] error_code+0x58/0x60 [ 47.753026] [<c021e2ef>] ? vmalloc_sync_all+0x19a/0x19a [ 47.753026] [<c03b59d2>] ? rhashtable_walk_init+0x5c/0x93 [ 47.753026] [<f9843221>] mesh_path_tbl_expire.isra.24+0x19/0x82 [mac80211] [ 47.753026] [<f984408b>] mesh_path_expire+0x11/0x1f [mac80211] [ 47.753026] [<f9842bb7>] ieee80211_mesh_work+0x73/0x1a9 [mac80211] [ 47.753026] [<f98207d1>] ieee80211_iface_work+0x2ff/0x311 [mac80211] [ 47.753026] [<c0235fa3>] process_one_work+0x14b/0x24e [ 47.753026] [<c0236313>] worker_thread+0x249/0x343 [ 47.753026] [<c02360ca>] ? process_scheduled_works+0x24/0x24 [ 47.753026] [<c0239359>] kthread+0x9e/0xa3 [ 47.753026] [<c0578e50>] ret_from_kernel_thread+0x20/0x40 [ 47.753026] [<c02392bb>] ? kthread_parkme+0x18/0x18 [ 47.753026] Code: 6b c0 85 c0 75 05 e8 fb 74 fc ff 89 f8 84 c0 75 08 8d 45 e8 e8 34 dd 33 00 83 c4 28 5b 5e 5f 5d c3 55 8b 80 10 02 00 00 89 e5 5d <8b> 40 f0 c3 55 b9 04 00 00 00 89 e5 52 8b 90 10 02 00 00 8d 45 [ 47.753026] EIP: [<c0239765>] kthread_data+0xa/0xe SS:ESP 0068:f6463a78 [ 47.753026] CR2: 00000000fffffff0 [ 47.753026] ---[ end trace 867ca0bdd0767790 ]--- Fixes: 3b302ada7f0a ("mac80211: mesh: move path tables into if_mesh") Reported-by: Fred Veldini <fred.veldini@gmail.com> Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:53 +02:00
Bob Copeland	68bb54b47e	mac80211: mesh: fix mesh path kerneldoc Several of the mesh path fields are undocumented and some of the documentation is no longer correct or relevant after the switch to rhashtable. Clean up the kernel doc accordingly and reorder some fields to match the structure layout. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:53 +02:00
Bob Copeland	3257523bed	mac80211: mesh: reorder structure members Reduce padding waste in struct mesh_table and struct rmc_entry by moving the smaller fields to the end. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:52 +02:00
Bob Copeland	18b27ff7d2	mac80211: mesh: embed gates hlist head directly Since we have converted the mesh path tables to rhashtable, we are no longer swapping out the entire mesh_pathtbl pointer with RCU. As a result, we no longer need indirection to the hlist head for the gates list and can simply embed it, saving a pair of pointer-sized allocations. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:51 +02:00
Bob Copeland	47a0489ce1	mac80211: mesh: use hlist for rmc cache The RMC cache has 256 list heads plus a u32, which puts it at the unfortunate size of 4104 bytes with padding. kmalloc() will then round this up to the next power-of-two, so we wind up actually using two pages here where most of the second is wasted. Switch to hlist heads here to reduce the structure size down to fit within a page. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:51 +02:00
Bob Copeland	0aa7fabbd5	mac80211: mesh: handle failed alloc for rmc cache In the unlikely case that mesh_rmc_init() fails with -ENOMEM, the rmc pointer will be left as NULL but the interface is still operational because ieee80211_mesh_init_sdata() is not allowed to fail. If this happens, we would blindly dereference rmc when checking whether a multicast frame is in the cache. Instead just drop the frames in the forwarding path. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:50 +02:00
Bob Copeland	749329594b	mac80211: mesh: fix crash in mesh_path_timer The mesh_path_reclaim() function, called from an rcu callback, cancels the mesh_path_timer associated with a mesh path. Unfortunately, this call can happen much later, perhaps after the hash table itself is destroyed. Such a situation led to the following crash in mesh_path_send_to_gates() when dereferencing the tbl pointer: [ 23.901661] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 23.905516] IP: [<ffffffff814c910b>] mesh_path_send_to_gates+0x2b/0x740 [ 23.908757] PGD 99ca067 PUD 99c4067 PMD 0 [ 23.910789] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 23.913485] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.5.0-rc6-wt+ #43 [ 23.916675] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [ 23.920471] task: ffffffff81685500 ti: ffffffff81678000 task.ti: ffffffff81678000 [ 23.922619] RIP: 0010:[<ffffffff814c910b>] [<ffffffff814c910b>] mesh_path_send_to_gates+0x2b/0x740 [ 23.925237] RSP: 0018:ffff88000b403d30 EFLAGS: 00010286 [ 23.926739] RAX: 0000000000000000 RBX: ffff880009bc0d20 RCX: 0000000000000102 [ 23.928796] RDX: 000000000000002e RSI: 0000000000000001 RDI: ffff880009bc0d20 [ 23.930895] RBP: ffff88000b403e18 R08: 0000000000000001 R09: 0000000000000001 [ 23.932917] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880009c20940 [ 23.936370] R13: ffff880009bc0e70 R14: ffff880009c21c40 R15: ffff880009bc0d20 [ 23.939823] FS: 0000000000000000(0000) GS:ffff88000b400000(0000) knlGS:0000000000000000 [ 23.943688] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 23.946429] CR2: 0000000000000008 CR3: 00000000099c5000 CR4: 00000000000006b0 [ 23.949861] Stack: [ 23.950840] 000000000000002e ffff880009c20940 ffff88000b403da8 ffffffff8109e551 [ 23.954467] ffffffff82711be2 000000000000002e 0000000000000000 ffffffff8166a5f5 [ 23.958141] 0000000000685ce8 0000000000000246 ffff880009bc0d20 ffff880009c20940 [ 23.961801] Call Trace: [ 23.962987] <IRQ> [ 23.963963] [<ffffffff8109e551>] ? vprintk_emit+0x351/0x5e0 [ 23.966782] [<ffffffff8109e8ff>] ? vprintk_default+0x1f/0x30 [ 23.969529] [<ffffffff810ffa41>] ? printk+0x48/0x50 [ 23.971956] [<ffffffff814ceef3>] mesh_path_timer+0x133/0x160 [ 23.974707] [<ffffffff814cedc0>] ? mesh_nexthop_resolve+0x230/0x230 [ 23.977775] [<ffffffff810b04ee>] call_timer_fn+0xce/0x330 [ 23.980448] [<ffffffff810b0425>] ? call_timer_fn+0x5/0x330 [ 23.983126] [<ffffffff814cedc0>] ? mesh_nexthop_resolve+0x230/0x230 [ 23.986091] [<ffffffff810b097c>] run_timer_softirq+0x22c/0x390 Instead of cancelling in the RCU callback, set a new flag to prevent the timer from being rearmed, and then cancel the timer synchronously when freeing the mesh path. This leaves mesh_path_reclaim() doing nothing but kfree, so switch to kfree_rcu(). Fixes: 3b302ada7f0a ("mac80211: mesh: move path tables into if_mesh") Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:49 +02:00
Ayala Beker	52cfa1d614	mac80211: track and tell driver about GO client P2P PS abilities Legacy clients don't support P2P power save mechanism, and thus if a P2P GO has a legacy client connected to it, it should disable P2P PS mechanisms. Let the driver know about this with a new bss_conf parameter. Signed-off-by: Ayala Beker <ayala.beker@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:49 +02:00
Ayala Beker	17b9424786	cfg80211: allow userspace to specify client P2P PS support Legacy clients don't support P2P power save mechanisms, and thus if a P2P GO has a legacy client connected to it, it has to make some changes in the PS behavior. To handle this, add an attribute to specify whether a station supports P2P PS or not. If the attribute was not specified cfg80211 will assume that station supports it for P2P GO interface, and does NOT support it for AP interface, matching the current assumptions in the code. Signed-off-by: Ayala Beker <ayala.beker@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:47 +02:00
Johannes Berg	b100e5d622	mac80211: avoid useless memory write on each frame RX In the likely case that probe_count is 0, don't write to the memory there. Also use ifmgd consistently in the function, instead of using sdata->u.mgd as well. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 21:34:21 +02:00
Johannes Berg	2c61cf9c56	mac80211: fix cipher scheme function name The code is only used with iwlwifi, but still should have proper mac80211 naming scheme; fix that. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 12:12:41 +02:00
Johannes Berg	c84387d2f2	mac80211: clean up station flags debugfs Avoid the really strange %s%s%s expression, use an array of flag names and check that all flags are present. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 12:12:26 +02:00
Johannes Berg	602fae425c	mac80211: don't start dynamic PS timer if not needed If the device implements dynamic PS itself, there's no need to ever start the dynamic powersave timer on RX. While at it, fix up some indentation in this code. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 12:11:54 +02:00
Johannes Berg	fc4a25c5b7	mac80211: remove sta_info debugfs sub-struct Since the previous patch, the struct only has a single member, so remove the struct and leave just the single member. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:59:05 +02:00
Mohammed Shafi Shajakhan	96f321c9d4	mac80211: Remove unused variable in per STA debugfs struct Remove unused variable in per STA debugfs structure, 'commit `34e895075e` ("mac80211: allow station add/remove to sleep")' removed the only user of 'add_has_run'. Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:57:11 +02:00
Sara Sharon	1e0bbebaae	mac80211: enable starting BA session with custom timeout Currently the debugfs entry for starting aggregation session starts it with timeout of 5 seconds. Allow opening a session with a custom timeout (according to spec 0 is no timeout). while at it, refactor the function and remove the magic numbers. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:46:05 +02:00
Sara Sharon	e03521232d	mac80211: add NETIF_F_RXCSUM to features white list NETIF_F_RXCSUM is not in the white list, though some drivers may want to set it in order to enable seeing the actual RX checksum status in ethtool. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:45:51 +02:00
Emmanuel Grumbach	f278ce4ffa	mac80211: Set global RRM capability Allow publishing RRM capabilities for features that are not HW dependent. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:44:46 +02:00
Felix Fietkau	06171e9c0b	mac80211: minstrel_ht: improve sample rate skip logic There were a few issues that were slowing down the process of finding the optimal rate, especially on devices with multi-rate retry limitations: When max_tp_rate[0] was slower than max_tp_rate[1], the code did not sample max_tp_rate[1], which would often allow it to switch places with max_tp_rate[0] (e.g. if only the first sampling attempts were bad, but the rate is otherwise good). Also, sample attempts of rates between max_tp_rate[0] and [1] were being ignored in this case, because the code only checked if the rate was slower than [1]. Fix this by checking against the fastest / second fastest max_tp_rate instead of assuming a specific order between the two. In my tests this patch significantly reduces the time until minstrel_ht finds the optimal rate right after assoc Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:40:06 +02:00
Emmanuel Grumbach	f6d4671a08	mac80211: close the SP when we enqueue frames during the SP Since we enqueued the frame that was supposed to be sent during the SP, and that frame may very well cary the IEEE80211_TX_STATUS_EOSP bit, we may never close the SP (WLAN_STA_SP will never be cleared). If that happens, we will not open any new SP and will never respond to any poll frame from the client. Clear WLAN_STA_SP manually if a frame that was polled during the SP is queued because of a starting A-MPDU session. The client may not see the EOSP bit, but it will at least be able to poll new frames in another SP. Reported-by: Alesya Shapira <alesya.shapira@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> [remove erroneous comment] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:33:49 +02:00
Ilan Peer	4b559ec0bf	mac80211: Fix BW upgrade for TDLS peers It is possible that the station is connected to an AP with bandwidth of 80+80MHz or 160MHz. In such cases there is no need to perform an upgrade as the maximal supported bandwidth is 80MHz. In addition, when upgrading and setting center_freq1 and bandwidth to 80MHz also set center_freq2 to 0. Fixes: `0fabfaafec` ("mac80211: upgrade BW of TDLS peers when possible" Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:26:33 +02:00
Emmanuel Grumbach	facde7f332	mac80211: don't send deferred frames outside the SP Frames that are sent between ampdu_action(IEEE80211_AMPDU_TX_START) and the move to the HT_AGG_STATE_OPERATIONAL state are buffered. If we try to start an A-MPDU session while the peer is sleeping and polling frames with U-APSD, we may have frames that will be buffered by ieee80211_tx_prep_agg. These frames have IEEE80211_TX_CTL_NO_PS_BUFFER set since they are sent to a sleeping client and possibly IEEE80211_TX_STATUS_EOSP. If the frame is buffered, we need clear these two flags since they will be re-sent after the move to HT_AGG_STATE_OPERATIONAL state which is very likely to happen after the SP ends. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:16:50 +02:00
Luis de Bethencourt	c2d45923e3	mac80211: remove description of dropped member Commit `976bd9efda` ("mac80211: move beacon_loss_count into ifmgd") removed the member from the sta_info struct but the description stayed lingering. Remove it. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:12:09 +02:00
Ben Greear	b6bf8c688e	mac80211: ensure no limits on station rhashtable By default, the rhashtable logic will fail to insert objects if the key-chains are too long and un-balanced. In the degenerate case where mac80211 is creating many virtual interfaces connected to the same peer(s), this case can happen. St insecure_elasticity to true to allow chains to grow as long as needed. Signed-off-by: Ben Greear <greearb@candelatech.com> [remove message, change commit message slightly] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 11:06:58 +02:00
Johannes Berg	62b14b241c	mac80211: properly deal with station hashtable insert errors The original hand-implemented hash-table in mac80211 couldn't result in insertion errors, and while converting to rhashtable I evidently forgot to check the errors. This surfaced now only because Ben is adding many identical keys and that resulted in hidden insertion errors. Cc: stable@vger.kernel.org Fixes: `7bedd0cfad` ("mac80211: use rhashtable for station table") Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:58:30 +02:00
Felix Fietkau	07310a6314	mac80211: do not pass injected frames without a valid rate to the driver Fall back to rate control if the requested bitrate was not found. Fixes: `dfdfc2beb0` ("mac80211: Parse legacy and HT rate in injected frames") Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:58:21 +02:00
Sven Eckelmann	f66b60f652	mac80211: fix parsing of 40Mhz in injected radiotap header The MCS bandwidth part of the radiotap header is 2 bits wide. The full 2 bit have to compared against IEEE80211_RADIOTAP_MCS_BW_40 and not only if the first bit is set. Otherwise IEEE80211_RADIOTAP_MCS_BW_40 can be confused with IEEE80211_RADIOTAP_MCS_BW_20U. Fixes: `dfdfc2beb0` ("mac80211: Parse legacy and HT rate in injected frames") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:58:16 +02:00
Arend van Spriel	38de03d2a2	nl80211: add feature for BSS selection support Introducing a new feature that the driver can use to indicate the driver/firmware supports configuration of BSS selection criteria upon CONNECT command. This can be useful when multiple BSS-es are found belonging to the same ESS, ie. Infra-BSS with same SSID. The criteria can then be used to offload selection of a preferred BSS. Reviewed-by: Hante Meuleman <meuleman@broadcom.com> Reviewed-by: Franky (Zhenhui) Lin <frankyl@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com> Reviewed-by: Lei Zhang <leizh@broadcom.com> Signed-off-by: Arend van Spriel <arend@broadcom.com> [move wiphy support check into parse_bss_select()] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:34 +02:00
Sara Sharon	f59374eb42	mac80211: synchronize driver rx queues before removing a station Some devices, like iwlwifi, have RSS queues. This may cause a situation where a disassociation is handled in control path and results in station removal while there are prior RX frames that were still not processed in other queues. When they will be processed the station will be gone, and the frames will be dropped. Add a synchronization interface to avoid that. When driver returns from the synchronization mac80211 may remove the station. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:34 +02:00
Bob Copeland	60854fd945	mac80211: mesh: convert path table to rhashtable In the time since the mesh path table was implemented as an RCU-traversable, dynamically growing hash table, a generic RCU hashtable implementation was added to the kernel. Switch the mesh path table over to rhashtable to remove some code and also gain some features like automatic shrinking. Cc: Thomas Graf <tgraf@suug.ch> Cc: netdev@vger.kernel.org Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:33 +02:00
Bob Copeland	8f6fd83c6c	rhashtable: accept GFP flags in rhashtable_walk_init In certain cases, the 802.11 mesh pathtable code wants to iterate over all of the entries in the forwarding table from the receive path, which is inside an RCU read-side critical section. Enable walks inside atomic sections by allowing GFP_ATOMIC allocations for the walker state. Change all existing callsites to pass in GFP_KERNEL. Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Bob Copeland <me@bobcopeland.com> [also adjust gfs2/glock.c and rhashtable tests] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:32 +02:00
Bob Copeland	947c2a0ecc	mac80211: mesh: embed known gates list in struct mesh_path The mesh path table uses a struct mesh_node in its hlists in order to support a resizable hash table: the mesh_node provides an indirection to the actual mesh path so that two different bucket lists can point to the same path entry. However, for the known gates list, we don't need this indirection because there is ever only one list. So we can just embed the hlist_node in the mesh path itself, which simplifies things a bit and saves a linear search whenever we need to find an item in the list. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:32 +02:00
Bob Copeland	b15dc38b98	mac80211: mesh: factor out common mesh path allocation code Remove duplicate code to allocate and initialize a mesh path or mesh proxy path. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:31 +02:00
Bob Copeland	443954815b	mac80211: mesh: don't hash sdata in mpath tables Now that the sdata pointer is the same for all entries of a path table, hashing it is pointless, so hash only the address. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:31 +02:00
Bob Copeland	2bdaf386f9	mac80211: mesh: move path tables into if_mesh The mesh path and mesh gate hashtables are global, containing all of the mpaths for every mesh interface, but the paths are all tied logically to a single interface. The common case is just a single mesh interface, so optimize for that by moving the global hashtable into the per-interface struct. Doing so allows us to drop sdata pointer comparisons inside the lookups and also saves a few bytes of BSS and data. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:30 +02:00
Jouni Malinen	e345f44f2b	mac80211: Support a scan request for a specific BSSID If the cfg80211 scan trigger operation specifies a single BSSID, use that value instead of the wildcard BSSID in the Probe Request frames. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:28 +02:00
Jouni Malinen	818965d391	cfg80211: Allow a scan request for a specific BSSID This allows scans for a specific BSSID to be optimized by the user space application by requesting the driver to set the Probe Request frame BSSID field (Address 3) to the specified BSSID instead of the wildcard BSSID. This prevents other APs from replying which reduces airtime need and latency in getting the response from the target AP through. This is an optimization and as such, it is acceptable for some of the drivers not to support the mechanism. If not supported, the wildcard BSSID will be used and more responses may be received. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:56:28 +02:00
Arik Nemtsov	aa507a7bc5	mac80211: recalc min_def chanctx even when chandef is identical The min_def chanctx is affected not only by the current chandef, but sometimes also by other stations on the vif. There's a valid scenario where a TDLS peer can widen its BW, thereby causing the min_def to increase. Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:51:08 +02:00
Arik Nemtsov	59021c6759	mac80211: TDLS: change BW calculation for WIDER_BW peers The previous approach simply ignored chandef restrictions when calculating the appropriate peer BW for a WIDER_BW peer. This could result in a regulatory violation if both peers indicated 80MHz support, but the regdomain forbade it. Change the approach to setting a WIDER_BW peer's BW. Don't exempt it from the chandef width at first. If during TDLS negotiation the chandef width is upgraded, update the peer's BW to match. Fixes: `0fabfaafec` ("mac80211: upgrade BW of TDLS peers when possible") Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:50:52 +02:00
Arik Nemtsov	db8d99774c	mac80211: TDLS: always downgrade invalid chandefs Even if the current chandef width is equal to the station's max-BW, it doesn't mean it's a valid width for TDLS. Make sure to always check regulatory constraints in these cases. Fixes: `0fabfaafec` ("mac80211: upgrade BW of TDLS peers when possible") Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:50:32 +02:00
Felix Fietkau	c3732a7b37	mac80211: fix AP buffered multicast frames with queue control and txq Buffered multicast frames must be passed to the driver directly via drv_tx instead of going through the txq, otherwise they cannot easily be scheduled to be sent after DTIM. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:50:17 +02:00
Sara Sharon	f980ebc058	mac80211: allow not sending MIC up from driver for HW crypto When HW crypto is used, there's no need for the CCMP/GCMP MIC to be available to mac80211, and the hardware might have removed it already after checking. The MIC is also useless to have when the frame is already decrypted, so allow indicating that it's not present. Since we are running out of bits in mac80211_rx_flags, make the flags field a u64. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:48:56 +02:00
Johannes Berg	162dd6a725	mac80211: allow drivers to report CLOCK_BOOTTIME for scan results This was requested by Android, and the appropriate cfg80211 API had been added by Dmitry. Support it in mac80211, allowing drivers to provide the timestamp. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:48:55 +02:00
Lorenzo Bianconi	646e76bb5d	mac80211: parse VHT info in injected frames Add VHT radiotap parsing support to ieee80211_parse_tx_radiotap(). That capability has been tested using a d-link dir-860l rev b1 running OpenWrt trunk and mt76 driver Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:48:54 +02:00
João Paulo Rechi Vita	1948b2a2ec	rfkill: Use switch to demux userspace operations Using a switch to handle different ev.op values in rfkill_fop_write() makes the code easier to extend, as out-of-range values can always be handled by the default case. Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com> [roll in fix for RFKILL_OP_CHANGE from Jouni] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:48:53 +02:00
Johannes Berg	98bd147d79	wext: unregister_pernet_subsys() on notifier registration failure If register_netdevice_notifier() fails (which in practice it can't right now), we should call unregister_pernet_subsys(). Do that. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-04-05 10:48:44 +02:00
Eric Dumazet	4ce7e93cb3	tcp: rate limit ACK sent by SYN_RECV request sockets Attackers like to use SYNFLOOD targeting one 5-tuple, as they hit a single RX queue (and cpu) on the victim. If they use random sequence numbers in their SYN, we detect they do not match the expected window and send back an ACK. This patch adds a rate limitation, so that the effect of such attacks is limited to ingress only. We roughly double our ability to absorb such attacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Maciej Żenczykowski <maze@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:20 -04:00
Eric Dumazet	a9d6532b56	ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply() TCP uses per cpu 'sockets' to send some packets : - RST packets ( tcp_v4_send_reset()) ) - ACK packets for SYN_RECV and TIMEWAIT sockets By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree() to not call sk_write_space() since these internal sockets do not care. This gives a small performance improvement, merely by allowing cpu to properly predict the sock_wfree() conditional branch, and avoiding one atomic operation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:20 -04:00
Eric Dumazet	9caad86415	tcp: increment sk_drops for listeners Goal: packets dropped by a listener are accounted for. This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock() so that children do not inherit their parent drop count. Note that we no longer increment LINUX_MIB_LISTENDROPS counter when sending a SYNCOOKIE, since the SYN packet generated a SYNACK. We already have a separate LINUX_MIB_SYNCOOKIESSENT Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:20 -04:00
Eric Dumazet	532182cd61	tcp: increment sk_drops for dropped rx packets Now ss can report sk_drops, we can instruct TCP to increment this per socket counter when it drops an incoming frame, to refine monitoring and debugging. Following patch takes care of listeners drops. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:20 -04:00
Eric Dumazet	15239302ed	sock_diag: add SK_MEMINFO_DROPS Reporting sk_drops to user space was available for UDP sockets using /proc interface. Add this to sock_diag, so that we can have the same information available to ss users, and we'll be able to add sk_drops indications for TCP sockets as well. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:20 -04:00
Eric Dumazet	3b24d854cb	tcp/dccp: do not touch listener sk_refcnt under synflood When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes. By letting listeners use SOCK_RCU_FREE infrastructure, we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets, only listeners are impacted by this change. Peak performance under SYNFLOOD is increased by ~33% : On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps Most consuming functions are now skb_set_owner_w() and sock_wfree() contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:20 -04:00
Eric Dumazet	2d331915a0	tcp/dccp: use rcu locking in inet_diag_find_one_icsk() RX packet processing holds rcu_read_lock(), so we can remove pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions if inet_diag also holds rcu before calling them. This is needed anyway as __inet_lookup_listener() and inet6_lookup_listener() will soon no longer increment refcount on the found listener. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:19 -04:00
Eric Dumazet	ee3cf32a4a	tcp/dccp: remove BH disable/enable in lookup Since linux 2.6.29, lookups only use rcu locking. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:19 -04:00
Eric Dumazet	ca065d0cf8	udp: no longer use SLAB_DESTROY_BY_RCU Tom Herbert would like not touching UDP socket refcnt for encapsulated traffic. For this to happen, we need to use normal RCU rules, with a grace period before freeing a socket. UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. This actually removes a lot of complexity in UDP stack. Multicast receives no longer need to hold a bucket spinlock. Note that ip early demux still needs to take a reference on the socket. Same remark for functions used by xt_socket and xt_PROXY netfilter modules, but this might be changed later. Performance for a single UDP socket receiving flood traffic from many RX queues/cpus. Simple udp_rx using simple recvfrom() loop : 438 kpps instead of 374 kpps : 17 % increase of the peak rate. v2: Addressed Willem de Bruijn feedback in multicast handling - keep early demux break in __udp4_lib_demux_lookup() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:19 -04:00
Eric Dumazet	a4298e4522	net: add SOCK_RCU_FREE socket flag We want a generic way to insert an RCU grace period before socket freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too much overhead. SLAB_DESTROY_BY_RCU strict rules force us to take a reference on the socket sk_refcnt, and it is a performance problem for UDP encapsulation, or TCP synflood behavior, as many CPUs might attempt the atomic operations on a shared sk_refcnt UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their lookup can use traditional RCU rules, without refcount changes. They can set the flag only once hashed and visible by other cpus. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Tested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 22:11:19 -04:00
Bastien Philbert	c862cc9b70	bridge: Fix incorrect variable assignment on error path in br_sysfs_addbr This fixes the incorrect variable assignment on error path in br_sysfs_addbr for when the call to kobject_create_and_add fails to assign the value of -EINVAL to the returned variable of err rather then incorrectly return zero making callers think this function has succeededed due to the previous assignment being assigned zero when assigning it the successful return value of the call to sysfs_create_group which is zero. Signed-off-by: Bastien Philbert <bastienphilbert@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 16:12:37 -04:00
Haishuang Yan	be447f3054	ipv6: l2tp: fix a potential issue in l2tp_ip6_recv pskb_may_pull() can change skb->data, so we have to load ptr/optr at the right place. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 16:00:28 -04:00
Haishuang Yan	5745b8232e	ipv4: l2tp: fix a potential issue in l2tp_ip_recv pskb_may_pull() can change skb->data, so we have to load ptr/optr at the right place. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 16:00:28 -04:00
Soheil Hassas Yeganeh	c14ac9451c	sock: enable timestamping using control messages Currently, SOL_TIMESTAMPING can only be enabled using setsockopt. This is very costly when users want to sample writes to gather tx timestamps. Add support for enabling SO_TIMESTAMPING via control messages by using tsflags added in `struct sockcm_cookie` (added in the previous patches in this series) to set the tx_flags of the last skb created in a sendmsg. With this patch, the timestamp recording bits in tx_flags of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg. Please note that this is only effective for overriding the recording timestamps flags. Users should enable timestamp reporting (e.g., SOF_TIMESTAMPING_SOFTWARE \| SOF_TIMESTAMPING_OPT_ID) using socket options and then should ask for SOF_TIMESTAMPING_TX_* using control messages per sendmsg to sample timestamps for each write. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh	ad1e46a837	ipv6: process socket-level control messages in IPv6 Process socket-level control messages by invoking __sock_cmsg_send in ip6_datagram_send_ctl for control messages on the SOL_SOCKET layer. This makes sure whenever ip6_datagram_send_ctl is called for udp and raw, we also process socket-level control messages. This is a bit uglier than IPv4, since IPv6 does not have something like ipcm_cookie. Perhaps we can later create a control message cookie for IPv6? Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv6 control messages. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh	24025c465f	ipv4: process socket-level control messages in IPv4 Process socket-level control messages by invoking __sock_cmsg_send in ip_cmsg_send for control messages on the SOL_SOCKET layer. This makes sure whenever ip_cmsg_send is called in udp, icmp, and raw, we also process socket-level control messages. Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv4 control messages. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh	3dd17e63f5	sock: accept SO_TIMESTAMPING flags in socket cmsg Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level as a basis to accept timestamping requests per write. This implementation only accepts TX recording flags (i.e., SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE, SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in control messages. Users need to set reporting flags (e.g., SOF_TIMESTAMPING_OPT_ID) per socket via socket options. This commit adds a tsflags field in sockcm_cookie which is set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_* bits in sockcm_cookie.tsflags allowing the control message to override the recording behavior per write, yet maintaining the value of other flags. This patch implements validating the control message and setting tsflags in struct sockcm_cookie. Next commits in this series will actually implement timestamping per write for different protocols. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:30 -04:00
Soheil Hassas Yeganeh	6b084928ba	tcp: use one bit in TCP_SKB_CB to mark ACK timestamps Currently, to avoid a cache line miss for accessing skb_shinfo, tcp_ack_tstamp skips socket that do not have SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is implemented based on an implicit assumption that the SOF_TIMESTAMPING_TX_ACK is set via socket options for the duration that ACK timestamps are needed. To implement per-write timestamps, this check should be removed and replaced with a per-packet alternative that quickly skips packets missing ACK timestamps marks without a cache-line miss. To enable per-packet marking without a cache line miss, use one bit in TCP_SKB_CB to mark a whether a SKB might need a ack tx timestamp or not. Further checks in tcp_ack_tstamp are not modified and work as before. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:29 -04:00
Soheil Hassas Yeganeh	6db8b963a7	tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs to associate timestamps with send calls. For TCP connections, tp->snd_una is used as the starting point to calculate relative IDs. This socket option will fail if set before the handshake on a passive TCP fast open connection with data in SYN or SYN/ACK, since setsockopt requires the connection to be in the ESTABLISHED state. To address these, instead of limiting the option to the ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as long as the connection is not in LISTEN or CLOSE states. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:29 -04:00
Willem de Bruijn	39771b127b	sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop To process cmsg's of the SOL_SOCKET level in addition to cmsgs of another level, protocols can call sock_cmsg_send(). This causes a double walk on the cmsghdr list, one for SOL_SOCKET and one for the other level. Extract the inner demultiplex logic from the loop that walks the list, to allow having this called directly from a walker in the protocol specific code. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-04 15:50:29 -04:00
Linus Torvalds	4a2d057e4f	Merge branch 'PAGE_CACHE_SIZE-removal' Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov: "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced long time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. Let's stop pretending that pages in page cache are special. They are not. The first patch with most changes has been done with coccinelle. The second is manual fixups on top. The third patch removes macros definition" [ I was planning to apply this just before rc2, but then I spaced out, so here it is right _after_ rc2 instead. As Kirill suggested as a possibility, I could have decided to only merge the first two patches, and leave the old interfaces for compatibility, but I'd rather get it all done and any out-of-tree modules and patches can trivially do the converstion while still also working with older kernels, so there is little reason to try to maintain the redundant legacy model. - Linus ] * PAGE_CACHE_SIZE-removal: mm: drop PAGE_CACHE_* and page_cache_{get,release} definition mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros	2016-04-04 10:50:24 -07:00
Kirill A. Shutemov	ea1754a084	mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
Kirill A. Shutemov	09cbfeaf1a	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced long time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
Haishuang Yan	7822ce73e6	netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address Since nla_get_in_addr and nla_put_in_addr were implemented, so use them appropriately. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-02 20:15:58 -04:00
Yuchung Cheng	2349262397	tcp: remove cwnd moderation after recovery For non-SACK connections, cwnd is lowered to inflight plus 3 packets when the recovery ends. This is an optional feature in the NewReno RFC 2582 to reduce the potential burst when cwnd is "re-opened" after recovery and inflight is low. This feature is questionably effective because of PRR: when the recovery ends (i.e., snd_una == high_seq) NewReno holds the CA_Recovery state for another round trip to prevent false fast retransmits. But if the inflight is low, PRR will overwrite the moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a receiver responds bogus ACKs (i.e., acking future data) to speed up transfer after recovery, it can only induce a burst up to a window worth of data packets by acking up to SND.NXT. A restart from (short) idle or receiving streched ACKs can both cause such bursts as well. On the other hand, if the recovery ends because the sender detects the losses were spurious (e.g., reordering). This feature unconditionally lowers a reverted cwnd even though nothing was lost. By principle loss recovery module should not update cwnd. Further pacing is much more effective to reduce burst. Hence this patch removes the cwnd moderation feature. v2 changes: revised commit message on bogus ACKs and burst, and missing signature Signed-off-by: Matt Mathis <mattmathis@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-02 20:11:43 -04:00
Linus Torvalds	05cf8077e5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Missing device reference in IPSEC input path results in crashes during device unregistration. From Subash Abhinov Kasiviswanathan. 2) Per-queue ISR register writes not being done properly in macb driver, from Cyrille Pitchen. 3) Stats accounting bugs in bcmgenet, from Patri Gynther. 4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from Quentin Armitage. 5) SXGBE driver has off-by-one in probe error paths, from Rasmus Villemoes. 6) Fix race in save/swap/delete options in netfilter ipset, from Vishwanath Pai. 7) Ageing time of bridge not set properly when not operating over a switchdev device. Fix from Haishuang Yan. 8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander Duyck. 9) IPV6 UDP code bumps wrong stats, from Eric Dumazet. 10) FEC driver should only access registers that actually exist on the given chipset, fix from Fabio Estevam. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits) net: mvneta: fix changing MTU when using per-cpu processing stmmac: fix MDIO settings Revert "stmmac: Fix 'eth0: No PHY found' regression" stmmac: fix TX normal DESC net: mvneta: use cache_line_size() to get cacheline size net: mvpp2: use cache_line_size() to get cacheline size net: mvpp2: fix maybe-uninitialized warning tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card rtnl: fix msg size calculation in if_nlmsg_size() fec: Do not access unexisting register in Coldfire net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: dsa: mv88e6xxx: Clear the PDOWN bit on setup net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write} bpf: make padding in bpf_tunnel_key explicit ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates bnxt_en: Fix ethtool -a reporting. bnxt_en: Fix typo in bnxt_hwrm_set_pause_common(). bnxt_en: Implement proper firmware message padding. ...	2016-04-01 20:03:33 -05:00
Daniel Borkmann	5a5abb1fa3	tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter Sasha Levin reported a suspicious rcu_dereference_protected() warning found while fuzzing with trinity that is similar to this one: [ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage! [ 52.765688] other info that might help us debug this: [ 52.765695] rcu_scheduler_active = 1, debug_locks = 1 [ 52.765701] 1 lock held by a.out/1525: [ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20 [ 52.765721] stack backtrace: [ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264 [...] [ 52.765768] Call Trace: [ 52.765775] [<ffffffff813e488d>] dump_stack+0x85/0xc8 [ 52.765784] [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110 [ 52.765792] [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90 [ 52.765801] [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun] [ 52.765810] [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun] [ 52.765818] [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210 [ 52.765827] [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun] [ 52.765834] [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690 [ 52.765843] [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60 [ 52.765850] [<ffffffff81261519>] SyS_ioctl+0x79/0x90 [ 52.765858] [<ffffffff81003ba2>] do_syscall_64+0x62/0x140 [ 52.765866] [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25 Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH, DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices. Since the fix in `f91ff5b9ff` ("net: sk_{detach\|attach}_filter() rcu fixes") sk_attach_filter()/sk_detach_filter() now dereferences the filter with rcu_dereference_protected(), checking whether socket lock is held in control path. Since its introduction in `9940516259` ("tun: socket filter support"), tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the sock_owned_by_user(sk) doesn't apply in this specific case and therefore triggers the false positive. Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair that is used by tap filters and pass in lockdep_rtnl_is_held() for the rcu_dereference_protected() checks instead. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-01 14:33:46 -04:00
Nicolas Dichtel	c57c7a95da	rtnl: fix msg size calculation in if_nlmsg_size() Size of the attribute IFLA_PHYS_PORT_NAME was missing. Fixes: `db24a9044e` ("net: add support for phys_port_name") CC: David Ahern <dsahern@gmail.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-31 16:49:54 -04:00
Daniel Borkmann	c0e760c9c6	bpf: make padding in bpf_tunnel_key explicit Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl and tunnel_label members explicit. No issue has been observed, and gcc/llvm does padding for the old struct already, where tunnel_label was not yet present, so the current code works, but since it's part of uapi, make sure we don't introduce holes in structs. Therefore, add tunnel_ext that we can use generically in future (f.e. to flag OAM messages for backends, etc). Also add the offset to the compat tests to be sure should some compilers not padd the tail of the old version of bpf_tunnel_key. Fixes: `4018ab1875` ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-30 19:01:33 -04:00
Eric Dumazet	2d4212261f	ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates IPv6 counters updates use a different macro than IPv4. Fixes: `36cbb2452c` ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Rick Jones <rick.jones2@hp.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-30 19:01:33 -04:00
Alexander Duyck	c3483384ee	gro: Allow tunnel stacking in the case of FOU/GUE This patch should fix the issues seen with a recent fix to prevent tunnel-in-tunnel frames from being generated with GRO. The fix itself is correct for now as long as we do not add any devices that support NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the potential to mess things up due to the fact that the outer transport header points to the outer UDP header and not the GRE header as would be expected. Fixes: `fac8e0f579` ("tunnels: Don't apply GRO to multiple layers of encapsulation.") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-30 16:02:33 -04:00
Marcelo Ricardo Leitner	28fd34985b	sctp: really allow using GFP_KERNEL on sctp_packet_transmit Somehow my patch for commit `cea8768f33` ("sctp: allow sctp_transmit_packet and others to use gfp") missed two important chunks, which are now added. Fixes: `cea8768f33` ("sctp: allow sctp_transmit_packet and others to use gfp") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-By: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-30 15:41:22 -04:00
Haishuang Yan	5e263f7126	bridge: Allow set bridge ageing time when switchdev disabled When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP, we should ignore this error code and continue to set the ageing time. Fixes: `c62987bbd8` ("bridge: push bridge setting ageing_time down to switchdev") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-30 15:38:13 -04:00
Liping Zhang	8fef24ca90	netfilter: ip6t_SYNPROXY: remove magic number for hop_limit Replace '64' with the per-net ipv6_devconf_all's hop_limit when building the ipv6 header. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-29 13:31:17 +02:00
Stephane Bryant	8d45ff22f1	netfilter: bridge: nf queue verdict to use NFQA_VLAN and NFQA_L2HDR This makes nf queues use NFQA_VLAN and NFQA_L2HDR in verdict to modify the original skb Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-29 13:29:30 +02:00
Stephane Bryant	15824ab29f	netfilter: bridge: pass L2 header and VLAN as netlink attributes in queues to userspace - This creates 2 netlink attribute NFQA_VLAN and NFQA_L2HDR. - These are filled up for the PF_BRIDGE family on the way to userspace. - NFQA_VLAN is a nested attribute, with the NFQA_VLAN_PROTO and the NFQA_VLAN_TCI carrying the corresponding vlan_proto and vlan_tci fields from the skb using big endian ordering (and using the CFI bit as the VLAN_TAG_PRESENT flag in vlan_tci as in the skb) Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-29 13:26:38 +02:00
Stephane Bryant	ac28634456	netfilter: bridge: add nf_afinfo to enable queuing to userspace This just adds and registers a nf_afinfo for the ethernet bridge, which enables queuing to userspace for the AF_BRIDGE family. No checksum computation is done. Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-29 13:24:37 +02:00
David S. Miller	0c84ea17ff	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for you net tree, they are: 1) There was a race condition between parallel save/swap and delete, which resulted a kernel crash due to the increase ref for save, swap, wrong ref decrease operations. Reported and fixed by Vishwanath Pai. 2) OVS should call into CT NAT for packets of new expected connections only when the conntrack state is persisted with the 'commit' option to the OVS CT action. From Jarno Rajahalme. 3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann. 4) Early validation of entry->target_offset to make sure it doesn't take us out from the blob, from Florian Westphal. 5) Again early validation of entry->next_offset to make sure it doesn't take out from the blob, also from Florian. 6) Check that entry->target_offset is always of of sizeof(struct xt_entry) for unconditional entries, when checking both from check_underflow() and when checking for loops in mark_source_chains(), again from Florian. 7) Fix inconsistent behaviour in nfnetlink_queue when NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer overrun, we have to reinject the packet as the user expects. 8) Enforce nul-terminated table names from getsockopt GET_ENTRIES requests. 9) Don't assume skb->sk is set from nft_bridge_reject and synproxy, this fixes a recent update of the code to namespaceify ip_default_ttl, patch from Liping Zhang. This batch comes with four patches to validate x_tables blobs coming from userspace. CONFIG_USERNS exposes the x_tables interface to unpriviledged users and to be honest this interface never received the attention for this move away from the CAP_NET_ADMIN domain. Florian is working on another round with more patches with more sanity checks, so expect a bit more Netfilter fixes in this development cycle than usual. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-28 15:38:59 -04:00
Al Viro	2da62906b1	[net] drop 'size' argument of sock_recvmsg() all callers have it equal to msg_data_left(msg). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-03-28 13:57:51 -04:00
Liping Zhang	29421198c3	netfilter: ipv4: fix NULL dereference Commit `fa50d974d1` ("ipv4: Namespaceify ip_default_ttl sysctl knob") use sock_net(skb->sk) to get the net namespace, but we can't assume that sk_buff->sk is always exist, so when it is NULL, oops will happen. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 17:59:29 +02:00
Pablo Neira Ayuso	b301f25387	netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES Make sure the table names via getsockopt GET_ENTRIES is nul-terminated in ebtables and all the x_tables variants and their respective compat code. Uncovered by KASAN. Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 17:59:24 +02:00
Pablo Neira Ayuso	931401137f	netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails When netlink unicast fails to deliver the message to userspace, we should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject the packet back to the stack. I think the user expects no packet drops when this flag is set due to queueing to userspace errors, no matter if related to the internal queue or when sending the netlink message to userspace. The userspace application will still get the ENOBUFS error via recvmsg() so the user still knows that, with the current configuration that is in place, the userspace application is not consuming the messages at the pace that the kernel needs. Reported-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>	2016-03-28 17:59:20 +02:00
Florian Westphal	54d83fc74a	netfilter: x_tables: fix unconditional helper Ben Hawkes says: In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it is possible for a user-supplied ipt_entry structure to have a large next_offset field. This field is not bounds checked prior to writing a counter value at the supplied offset. Problem is that mark_source_chains should not have been called -- the rule doesn't have a next entry, so its supposed to return an absolute verdict of either ACCEPT or DROP. However, the function conditional() doesn't work as the name implies. It only checks that the rule is using wildcard address matching. However, an unconditional rule must also not be using any matches (no -m args). The underflow validator only checked the addresses, therefore passing the 'unconditional absolute verdict' test, while mark_source_chains also tested for presence of matches, and thus proceeeded to the next (not-existent) rule. Unify this so that all the callers have same idea of 'unconditional rule'. Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 17:59:15 +02:00
Florian Westphal	6e94e0cfb0	netfilter: x_tables: make sure e->next_offset covers remaining blob size Otherwise this function may read data beyond the ruleset blob. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 17:59:08 +02:00
Florian Westphal	bdf533de69	netfilter: x_tables: validate e->target_offset early We should check that e->target_offset is sane before mark_source_chains gets called since it will fetch the target entry for loop detection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 17:59:04 +02:00
Arnd Bergmann	99b7248e2a	openvswitch: call only into reachable nf-nat code The openvswitch code has gained support for calling into the nf-nat-ipv4/ipv6 modules, however those can be loadable modules in a configuration in which openvswitch is built-in, leading to link errors: net/built-in.o: In function `__ovs_ct_lookup': :(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation' :(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation' The dependency on (!NF_NAT \|\| NF_NAT) prevents similar issues, but NF_NAT is set to 'y' if any of the symbols selecting it are built-in, but the link error happens when any of them are modular. A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in, CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely to be useful in practice, but the driver currently only handles IPv6 being optional. This patch improves the Kconfig dependency so that openvswitch cannot be built-in if either of the two other symbols are set to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute() with two "if (IS_ENABLED())" checks that should catch all corner cases also make the code more readable. The same #ifdef exists ovs_ct_nat_to_attr(), where it does not cause a link error, but for consistency I'm changing it the same way. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `05752523e5` ("openvswitch: Interface with NAT.") Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 17:58:59 +02:00
Jarno Rajahalme	5745b0be05	openvswitch: Fix checking for new expected connections. OVS should call into CT NAT for packets of new expected connections only when the conntrack state is persisted with the 'commit' option to the OVS CT action. The test for this condition is doubly wrong, as the CT status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather than the mask (IPS_EXPECTED), and due to the wrong assumption that the expected bit would apply only for the first (i.e., 'new') packet of a connection, while in fact the expected bit remains on for the lifetime of an expected connection. The 'ctinfo' value IP_CT_RELATED derived from the ct status can be used instead, as it is only ever applicable to the 'new' packets of the expected connection. Fixes: `05752523e5` ('openvswitch: Interface with NAT.') Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 17:58:51 +02:00
Vishwanath Pai	596cf3fe58	netfilter: ipset: fix race condition in ipset save, swap and delete This fix adds a new reference counter (ref_netlink) for the struct ip_set. The other reference counter (ref) can be swapped out by ip_set_swap and we need a separate counter to keep track of references for netlink events like dump. Using the same ref counter for dump causes a race condition which can be demonstrated by the following script: ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \ counters ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \ counters ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \ counters ipset save & ipset swap hash_ip3 hash_ip2 ipset destroy hash_ip3 /* will crash the machine / Swap will exchange the values of ref so destroy will see ref = 0 instead of ref = 1. With this fix in place swap will not succeed because ipset save still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink). Both delete and swap will error out if ref_netlink != 0 on the set. Note: The changes to _head functions is because previously we would increment ref whenever we called these functions, we don't do that anymore. Reviewed-by: Joshua Hunt <johunt@akamai.com> Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 17:57:45 +02:00
Haishuang Yan	ac71b46efd	openvswitch: Use proper buffer size in nla_memcpy For the input parameter count, it's better to use the size of destination buffer size, as nla_memcpy would take into account the length of the source netlink attribute when a data is copied from an attribute. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-28 11:37:14 -04:00
Weongyo Jeong	ccd63c20fe	netfilter: nf_conntrack: Uses pr_fmt() for logging. Uses pr_fmt() macro for debugging messages of nf_conntrack module. Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-03-28 12:56:07 +02:00
Quentin Armitage	995096a0a4	Fix returned tc and hoplimit values for route with IPv6 encapsulation For a route with IPv6 encapsulation, the traffic class and hop limit values are interchanged when returned to userspace by the kernel. For example, see below. ># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000 ># ip route show table 1000 192.168.0.1 encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2 scope link Signed-off-by: Quentin Armitage <quentin@armitage.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-27 22:35:02 -04:00
Linus Torvalds	d5a38f6e46	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is quite a bit here, including some overdue refactoring and cleanup on the mon_client and osd_client code from Ilya, scattered writeback support for CephFS and a pile of bug fixes from Zheng, and a few random cleanups and fixes from others" [ I already decided not to pull this because of it having been rebased recently, but ended up changing my mind after all. Next time I'll really hold people to it. Oh well. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits) libceph: use KMEM_CACHE macro ceph: use kmem_cache_zalloc rbd: use KMEM_CACHE macro ceph: use lookup request to revalidate dentry ceph: kill ceph_get_dentry_parent_inode() ceph: fix security xattr deadlock ceph: don't request vxattrs from MDS ceph: fix mounting same fs multiple times ceph: remove unnecessary NULL check ceph: avoid updating directory inode's i_size accidentally ceph: fix race during filling readdir cache libceph: use sizeof_footer() more ceph: kill ceph_empty_snapc ceph: fix a wrong comparison ceph: replace CURRENT_TIME by current_fs_time() ceph: scattered page writeback libceph: add helper that duplicates last extent operation libceph: enable large, variable-sized OSD requests libceph: osdc->req_mempool should be backed by a slab pool libceph: make r_request msg_size calculation clearer ...	2016-03-26 15:53:16 -07:00
Geliang Tang	5ee61e95b6	libceph: use KMEM_CACHE macro Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:57 +01:00
Ilya Dryomov	89f081730c	libceph: use sizeof_footer() more Don't open-code sizeof_footer() in read_partial_message() and ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now possible to use con_out_kvec_add() in prepare_write_message_footer(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2016-03-25 18:51:53 +01:00
Yan, Zheng	2c63f49a72	libceph: add helper that duplicates last extent operation This helper duplicates last extent operation in OSD request, then adjusts the new extent operation's offset and length. The helper is for scatterd page writeback, which adds nonconsecutive dirty pages to single OSD request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:43 +01:00
Ilya Dryomov	3f1af42ad0	libceph: enable large, variable-sized OSD requests Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:43 +01:00
Ilya Dryomov	9e767adbd3	libceph: osdc->req_mempool should be backed by a slab pool ceph_osd_request_cache was introduced a long time ago. Also, osd_req is about to get a flexible array member, which ceph_osd_request_cache is going to be aware of. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:43 +01:00
Ilya Dryomov	ae458f5a17	libceph: make r_request msg_size calculation clearer Although msg_size is calculated correctly, the terms are grouped in a misleading way - snaps appears to not have room for a u32 length. Move calculation closer to its use and regroup terms. No functional change. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:42 +01:00
Yan, Zheng	7665d85b73	libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op This avoids defining large array of r_reply_op_{len,result} in in struct ceph_osd_request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:42 +01:00
Ilya Dryomov	de2aa102ea	libceph: rename ceph_osd_req_op::payload_len to indata_len Follow userspace nomenclature on this - the next commit adds outdata_len. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:41 +01:00

1 2 3 4 5 ...

41601 Commits