linux

Author	SHA1	Message	Date
Anilkumar Kolli	f8252e7b5a	mac80211: implement ieee80211_tx_rate_update to update rate Current mac80211 has provision to update tx status through ieee80211_tx_status() and ieee80211_tx_status_ext(). But drivers like ath10k updates the tx status from the skb except txrate, txrate will be updated from a different path, peer stats. Using ieee80211_tx_status_ext() in two different paths (one for the stats, one for the tx rate) would duplicate the stats instead. To avoid this stats duplication, ieee80211_tx_rate_update() is implemented. Signed-off-by: Anilkumar Kolli <akolli@codeaurora.org> [minor commit message editing, use initializers in code] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-12 13:05:40 +02:00
Ankita Bajaj	0d4e14a32d	nl80211: Add per peer statistics to compute FCS error rate Add support for drivers to report the total number of MPDUs received and the number of MPDUs received with an FCS error from a specific peer. These counters will be incremented only when the TA of the frame matches the MAC address of the peer irrespective of FCS error. It should be noted that the TA field in the frame might be corrupted when there is an FCS error and TA matching logic would fail in such cases. Hence, FCS error counter might not be fully accurate, but it can provide help in detecting bad RX links in significant number of cases. This FCS error counter without full accuracy can be used, e.g., to trigger a kick-out of a connected client with a bad link in AP mode to force such a client to roam to another AP. Signed-off-by: Ankita Bajaj <bankita@codeaurora.org> Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-12 12:56:34 +02:00
Pradeep Kumar Chitrapu	bc847970f4	mac80211: support FTM responder configuration/statistics New bss param ftm_responder is used to notify the driver to enable fine timing request (FTM) responder role in AP mode. Plumb the new cfg80211 API for FTM responder statistics through to the driver API in mac80211. Signed-off-by: David Spinadel <david.spinadel@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-12 12:46:09 +02:00
Greg Kroah-Hartman	90ad18418c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net David writes: "Networking 1) RXRPC receive path fixes from David Howells. 2) Re-export __skb_recv_udp(), from Jiri Kosina. 3) Fix refcounting in u32 classificer, from Al Viro. 4) Userspace netlink ABI fixes from Eugene Syromiatnikov. 5) Don't double iounmap on rmmod in ena driver, from Arthur Kiyanovski. 6) Fix devlink string attribute handling, we must pull a copy into a kernel buffer if the lifetime extends past the netlink request. From Moshe Shemesh. 7) Fix hangs in RDS, from Ka-Cheong Poon. 8) Fix recursive locking lockdep warnings in tipc, from Ying Xue. 9) Clear RX irq correctly in socionext, from Ilias Apalodimas. 10) bcm_sf2 fixes from Florian Fainelli." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits) net: dsa: bcm_sf2: Call setup during switch resume net: dsa: bcm_sf2: Fix unbind ordering net: phy: sfp: remove sfp_mutex's definition r8169: set RX_MULTI_EN bit in RxConfig for 8168F-family chips net: socionext: clear rx irq correctly net/mlx4_core: Fix warnings during boot on driverinit param set failures tipc: eliminate possible recursive locking detected by LOCKDEP selftests: udpgso_bench.sh explicitly requires bash selftests: rtnetlink.sh explicitly requires bash. qmi_wwan: Added support for Gemalto's Cinterion ALASxx WWAN interface tipc: queue socket protocol error messages into socket receive buffer tipc: set link tolerance correctly in broadcast link net: ipv4: don't let PMTU updates increase route MTU net: ipv4: update fnhe_pmtu when first hop's MTU changes net/ipv6: stop leaking percpu memory in fib6 info rds: RDS (tcp) hangs on sendto() to unresponding address net: make skb_partial_csum_set() more robust against overflows devlink: Add helper function for safely copy string param devlink: Fix param cmode driverinit for string type devlink: Fix param set handling for string type ...	2018-10-12 09:01:59 +02:00
Ying Xue	a1f8dd34e6	tipc: eliminate possible recursive locking detected by LOCKDEP When booting kernel with LOCKDEP option, below warning info was found: WARNING: possible recursive locking detected 4.19.0-rc7+ #14 Not tainted -------------------------------------------- swapper/0/1 is trying to acquire lock: 00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh include/linux/spinlock.h:334 [inline] 00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850 but task is already holding lock: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh include/linux/spinlock.h:334 [inline] 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&list->lock)->rlock#4); lock(&(&list->lock)->rlock#4); * DEADLOCK * May be due to missing lock nesting notation 2 locks held by swapper/0/1: #0: 00000000f7539d34 (pernet_ops_rwsem){+.+.}, at: register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051 #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh include/linux/spinlock.h:334 [inline] #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849 stack backtrace: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1af/0x295 lib/dump_stack.c:113 print_deadlock_bug kernel/locking/lockdep.c:1759 [inline] check_deadlock kernel/locking/lockdep.c:1803 [inline] validate_chain kernel/locking/lockdep.c:2399 [inline] __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411 lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168 spin_lock_bh include/linux/spinlock.h:334 [inline] tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850 tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526 tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521 tipc_init_net+0x472/0x610 net/tipc/core.c:82 ops_init+0xf7/0x520 net/core/net_namespace.c:129 __register_pernet_operations net/core/net_namespace.c:940 [inline] register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011 register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052 tipc_init+0x83/0x104 net/tipc/core.c:140 do_one_initcall+0x109/0x70a init/main.c:885 do_initcall_level init/main.c:953 [inline] do_initcalls init/main.c:961 [inline] do_basic_setup init/main.c:979 [inline] kernel_init_freeable+0x4bd/0x57f init/main.c:1144 kernel_init+0x13/0x180 init/main.c:1063 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413 The reason why the noise above was complained by LOCKDEP is because we nested to hold l->wakeupq.lock and l->inputq->lock in tipc_link_reset function. In fact it's unnecessary to move skb buffer from l->wakeupq queue to l->inputq queue while holding the two locks at the same time. Instead, we can move skb buffers in l->wakeupq queue to a temporary list first and then move the buffers of the temporary list to l->inputq queue, which is also safe for us. Fixes: `3f32d0be6c` ("tipc: lock wakeup & inputq at tipc_link_reset()") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-11 10:23:48 -07:00
Greg Kroah-Hartman	834d3cd294	Fix open-coded multiplication arguments to allocators - Fixes several new open-coded multiplications added in the 4.19 merge window. -----BEGIN PGP SIGNATURE----- Comment: Kees Cook <kees@outflux.net> iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlu/fokWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsB/EACgKV77Sad5Luyr3rCmUtGcQ7az yLIrqvGcxC55ZEoZwHmSjxiN+5X2kDF6SEFrebvDKFSbiRoC0a1IWRC4pWTpBhTs +i1qHVTlOrwBZFTwOn2uklvgkkUfjatG/6zWc7l/Ye070Hekk0SnbMozlggCOJRm yKglXaBx9MKmj/T60Vpfve4ubBLM0zSuRPlsBON2qUUp2YTHbEqHOoYawfSK4RuF y2hzZc5A0/F7TionkHjrkdEJ8jRkwii2x4iM9KSdhNRxBT0lZkk3xpD6PjRaXCzt N2BMU17kftI5498QyKHXdTYCuVPqTpm+Z3d/q+YTbjdpXre1xcZU06ZT9Bqa+LwB pRaN4eqd7nLFKvCQYnUp0GuDj5pxd3Xz2dpC0IkaliEM8xYad1+NZRq7SkRJYOpM /y05GRdln9ULJF/pet5IS6LtXY+FSn4z+9e+ztVIPQ/kJUqvmyKfWPpdp6TPtwjC vb9cbKD7LRPoBfrY0efPXe4aixCwmc4Ob4kljCZtkyrpV+iImYQn9XqTblU7sbHa Om8FxGxdX7Xu9HUoT7uHeb8ZNg1g0/XWAEhs7pY22fzHT14T+0fYRz8njmlrw3ed dRdzydOxkJMcCVKLitoiw2X1yNRRHtGbXq/UhrHMNbEkOzf73/3fYZK68849FaEK 1oFOX/N/OI5kp7pNAQ== =NS8/ -----END PGP SIGNATURE----- Merge tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Kees writes: "Fix open-coded multiplication arguments to allocators - Fixes several new open-coded multiplications added in the 4.19 merge window." * tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: treewide: Replace more open-coded allocation size multiplications	2018-10-11 19:10:30 +02:00
Jouni Malinen	efb543e61c	mac80211: Extend SAE authentication in infra BSS STA mode Previous implementation of SAE authentication in infrastructure BSS was somewhat restricting and not exactly clean way of handling the two auth() operations. This ended up removing and re-adding the STA entry for the AP in the middle of authentication and also messing up authentication state tracking through the sequence of four Authentication frames. Furthermore, this did not work if the AP ended up sending out SAE Confirm (auth trans #2) immediately after SAE Commit (auth trans #1) before the station had time to transmit its SAE Confirm. Clean up authentication state handling for the SAE case to allow two rounds of auth() calls without dropping all state between those operations. Track peer Confirmed status and mark authentication completed only once both ends have confirmed. ieee80211_mgd_auth() check for EBUSY cases is now handling only the pending association (ifmgd->assoc_data) while all pending authentication (ifmgd->auth_data) cases are allowed to proceed to allow user space to start a new connection attempt from scratch even if the previously requested authentication is still waiting completion. This is needed to avoid making SAE error cases with retries take excessive amount of time with no means for the user space to stop that (apart from setting the netdev down). As an extra bonus, the end of ieee80211_rx_mgmt_auth() can be cleaned up to avoid the extra copy of the cfg80211_rx_mlme_mgmt() call for ongoing SAE authentication since the new ieee80211_mark_sta_auth() helper function can handle both completion of authentication and updates to the STA entry under the same condition and there is no need to return from the function between those operations. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:08 +02:00
Jouni Malinen	8d7432a2f5	mac80211: Move ieee80211_mgd_auth() EBUSY check to be before allocation This makes it easier to conditionally replace full allocation of auth_data to use reallocation for the case of continuing SAE authentication. Furthermore, there was not really any point in having this check done so late in the function after having already completed number of steps that cannot be used anyway in the error case. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:07 +02:00
Jouni Malinen	fc107a9330	mac80211: Helper function for marking STA authenticated Authentication exchange can be completed in both TX and RX paths for SAE, so move this common functionality into a helper function to avoid having to implement practically the same operations in two places when extending SAE implementation in the following commits. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:05 +02:00
Felix Fietkau	506dbf90c1	mac80211: rc80211_minstrel: remove variance / stddev calculation When there are few packets (e.g. for sampling attempts), the exponentially weighted variance is usually vastly overestimated, making the resulting data essentially useless. As far as I know, there has not been any practical use for this, so let's not waste any cycles on it. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:05 +02:00
Felix Fietkau	f4ec7cb0f9	mac80211: minstrel: do not sample rates 3 times slower than max_prob_rate These rates are highly unlikely to be used quickly, even if the link deteriorates rapidly. This improves throughput in cases where CCK rates are not reliable enough to be skipped entirely during sampling. Sampling these rates regularly can cost a lot of airtime. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:04 +02:00
Felix Fietkau	972b66b86f	mac80211: minstrel: fix sampling/reporting of CCK rates in HT mode Long/short preamble selection cannot be sampled separately, since it depends on the BSS state. Because of that, sampling attempts to currently not used preamble modes are not counted in the statistics, which leads to CCK rates being sampled too often. Fix statistics accounting for long/short preamble by increasing the index where necessary. Fix excessive CCK rate sampling by dropping unsupported sample attempts. This improves throughput on 2.4 GHz channels Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:03 +02:00
Felix Fietkau	80df9be67c	mac80211: minstrel: fix CCK rate group streams value Fixes a harmless underflow issue when CCK rates are actively being used Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:03 +02:00
Felix Fietkau	37439f2d6e	mac80211: minstrel: fix using short preamble CCK rates on HT clients mi->supported[MINSTREL_CCK_GROUP] needs to be updated short preamble rates need to be marked as supported regardless of whether it's currently enabled. Its state can change at any time without a rate_update call. Fixes: `782dda00ab` ("mac80211: minstrel_ht: move short preamble check out of get_rate") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:02 +02:00
Felix Fietkau	202df504d7	mac80211: minstrel: reduce minstrel_mcs_groups size By storing a shift value for all duration values of a group, we can reduce precision by a neglegible amount to make it fit into a u16 value. This improves cache footprint and reduces size: Before: text data bss dec hex filename 10024 116 0 10140 279c rc80211_minstrel_ht.o After: text data bss dec hex filename 9368 116 0 9484 250c rc80211_minstrel_ht.o Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:02 +02:00
Felix Fietkau	b1c4f68337	mac80211: minstrel: merge with minstrel_ht, always enable VHT support Legacy-only devices are not very common and the overhead of the extra code for HT and VHT rates is not big enough to justify all those extra lines of code to make it optional. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:01 +02:00
Felix Fietkau	5b5e87314e	mac80211: minstrel: remove unnecessary debugfs cleanup code debugfs entries are cleaned up by debugfs_remove_recursive already. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:00 +02:00
Chaitanya T K	f458e832ba	mac80211: minstrel: Enable STBC and LDPC for VHT Rates If peer support reception of STBC and LDPC, enable them for better performance. Signed-off-by: Chaitanya TK <chaitanya.mgit@gmail.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:01:00 +02:00
Johannes Berg	42dca5ef24	mac80211: avoid reflecting frames back to the client I'm not really sure exactly _why_ I've been carrying a note for what's probably _years_ to check that we don't do this, but we clearly do reflect frames back to the station itself if it sends such. One way or the other, it's useless since the station doesn't really need the AP to talk to itself, so suppress it. While at it, clarify some of the logic by removing skb->data references in favour of the destination address (pointer) we already have separately. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:00:59 +02:00
Johannes Berg	3d7af87835	nl80211: use netlink policy validation function for elements Instead of open-coding a lot of calls to is_valid_ie_attr(), add this validation directly to the policy, now that we can. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:00:59 +02:00
Johannes Berg	ab0d76f682	nl80211: use policy range validation where applicable Many range checks can be done in the policy, move them there. A few in mesh are added in the code (taken out of the macros) because they don't fit into the s16 range in the policy validation. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-11 16:00:58 +02:00
Florian Westphal	9dffff200f	xfrm: policy: use hlist rcu variants on insert bydst table/list lookups use rcu, so insertions must use rcu versions. Fixes: `a7c44247f7` ("xfrm: policy: make xfrm_policy_lookup_bytype lockless") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-10-11 13:24:46 +02:00
Alexei Starovoitov	9f7e43da6a	net/xfrm: fix out-of-bounds packet access BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161 Read of size 1 at addr ffff8801d882eec7 by task syz-executor1/6667 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430 _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161 __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299 xfrm_decode_session include/net/xfrm.h:1232 [inline] vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542 __netdev_start_xmit include/linux/netdevice.h:4313 [inline] netdev_start_xmit include/linux/netdevice.h:4322 [inline] xmit_one net/core/dev.c:3217 [inline] dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233 __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803 dev_queue_xmit+0x17/0x20 net/core/dev.c:3836 Reported-by: syzbot+acffccec848dc13fe459@syzkaller.appspotmail.com Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-10-11 13:24:46 +02:00
Pablo Neira Ayuso	d701d81172	netfilter: nft_compat: do not dump private area Zero pad private area, otherwise we expose private kernel pointer to userspace. This patch also zeroes the tail area after the ->matchsize and ->targetsize that results from XT_ALIGN(). Fixes: `0ca743a559` ("netfilter: nf_tables: add compatibility layer for x_tables") Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-10-11 11:29:53 +02:00
Taehee Yoo	18c0ab8736	netfilter: xt_TEE: add missing code to get interface index in checkentry. checkentry(tee_tg_check) should initialize priv->oif from dev if possible. But only netdevice notifier handler can set that. Hence priv->oif is always -1 until notifier handler is called. Fixes: `9e2f6c5d78` ("netfilter: Rework xt_TEE netdevice notifier") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-10-11 11:29:14 +02:00
Taehee Yoo	f24d2d4f95	netfilter: xt_TEE: fix wrong interface selection TEE netdevice notifier handler checks only interface name. however each netns can have same interface name. hence other netns's interface could be selected. test commands: %ip netns add vm1 %iptables -I INPUT -p icmp -j TEE --gateway 192.168.1.1 --oif enp2s0 %ip link set enp2s0 netns vm1 Above rule is in the root netns. but that rule could get enp2s0 ifindex of vm1 by notifier handler. After this patch, TEE rule is added to the per-netns list. Fixes: `9e2f6c5d78` ("netfilter: Rework xt_TEE netdevice notifier") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-10-11 11:29:14 +02:00
Fernando Fernandez Mancera	4a3e71b7b7	netfilter: nft_osf: usage from output path is not valid The nft_osf extension, like xt_osf, is not supported from the output path. Fixes: `b96af92d6e` ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-10-11 11:29:14 +02:00
Pablo Neira Ayuso	3b18d5eba4	netfilter: nft_set_rbtree: allow loose matching of closing element in interval Allow to find closest matching for the right side of an interval (end flag set on) so we allow lookups in inner ranges, eg. 10-20 in 5-25. Fixes: `ba0e4d9917` ("netfilter: nf_tables: get set elements via netlink") Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-10-11 11:29:14 +02:00
Björn Töpel	cee271678d	xsk: do not call synchronize_net() under RCU read lock The XSKMAP update and delete functions called synchronize_net(), which can sleep. It is not allowed to sleep during an RCU read section. Instead we need to make sure that the sock sk_destruct (xsk_destruct) function is asynchronously called after an RCU grace period. Setting the SOCK_RCU_FREE flag for XDP sockets takes care of this. Fixes: `fbfc504a24` ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-11 10:19:01 +02:00
Parthasarathy Bhuvaragan	e7eb058238	tipc: queue socket protocol error messages into socket receive buffer In tipc_sk_filter_rcv(), when we detect protocol messages with error we call tipc_sk_conn_proto_rcv() and let it reset the connection and notify the socket by calling sk->sk_state_change(). However, tipc_sk_filter_rcv() may have been called from the function tipc_backlog_rcv(), in which case the socket lock is held and the socket already awake. This means that the sk_state_change() call is ignored and the error notification lost. Now the receive queue will remain empty and the socket sleeps forever. In this commit, we convert the protocol message into a connection abort message and enqueue it into the socket's receive queue. By this addition to the above state change we cover all conditions. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:56:07 -07:00
Jon Maloy	047491ea33	tipc: set link tolerance correctly in broadcast link In the patch referred to below we added link tolerance as an additional criteria for declaring broadcast transmission "stale" and resetting the affected links. However, the 'tolerance' field of the broadcast link is never set, and remains at zero. This renders the whole commit without the intended improving effect, but luckily also with no negative effect. In this commit we add the missing initialization. Fixes: `a4dc70d46c` ("tipc: extend link reset criteria for stale packet retransmission") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:56:07 -07:00
Eric Dumazet	f98ebd47fd	net: sched: avoid writing on noop_qdisc While noop_qdisc.gso_skb and noop_qdisc.skb_bad_txq are not used in other places, it seems not correct to overwrite their fields in dev_init_scheduler_queue(). noop_qdisc is essentially a shared and read-only object, even if it is not marked as const because of some implementation detail. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:49:16 -07:00
David Ahern	d8a66aa254	net/mpls: Implement handler for strict data checking on dumps Without CONFIG_INET enabled compiles fail with: net/mpls/af_mpls.o: In function `mpls_dump_routes': af_mpls.c:(.text+0xed0): undefined reference to `ip_valid_fib_dump_req' The preference is for MPLS to use the same handler as ipv4 and ipv6 to allow consistency when doing a dump for AF_UNSPEC which walks all address families invoking the route dump handler. If INET is disabled then fallback to an MPLS version which can be tighter on the data checks. Fixes: `e8ba330ac0` ("rtnetlink: Update fib dumps for strict data checking") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:46:17 -07:00
Sabrina Dubroca	28d35bcdd3	net: ipv4: don't let PMTU updates increase route MTU When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is received, we must clamp its value. However, we can receive a PMTU exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an increase in PMTU. To fix this, take the smallest of the old MTU and ip_rt_min_pmtu. Before this patch, in case of an update, the exception's MTU would always change. Now, an exception can have only its lock flag updated, but not the MTU, so we need to add a check on locking to the following "is this exception getting updated, or close to expiring?" test. Fixes: `d52e5a7e7c` ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:44:46 -07:00
Sabrina Dubroca	af7d6cce53	net: ipv4: update fnhe_pmtu when first hop's MTU changes Since commit `5aad1de5ea` ("ipv4: use separate genid for next hop exceptions"), exceptions get deprecated separately from cached routes. In particular, administrative changes don't clear PMTU anymore. As Stefano described in commit `e9fa1495d7` ("ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routes"), the PMTU discovered before the local MTU change can become stale: - if the local MTU is now lower than the PMTU, that PMTU is now incorrect - if the local MTU was the lowest value in the path, and is increased, we might discover a higher PMTU Similarly to what commit `e9fa1495d7` did for IPv6, update PMTU in those cases. If the exception was locked, the discovered PMTU was smaller than the minimal accepted PMTU. In that case, if the new local MTU is smaller than the current PMTU, let PMTU discovery figure out if locking of the exception is still needed. To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU notifier. By the time the notifier is called, dev->mtu has been changed. This patch adds the old MTU as additional information in the notifier structure, and a new call_netdevice_notifiers_u32() function. Fixes: `5aad1de5ea` ("ipv4: use separate genid for next hop exceptions") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:44:46 -07:00
Mike Rapoport	7abab7b9b4	net/ipv6: stop leaking percpu memory in fib6 info The fib6_info_alloc() function allocates percpu memory to hold per CPU pointers to rt6_info, but this memory is never freed. Fix it. Fixes: `a64efe142f` ("net/ipv6: introduce fib6_info struct and helpers") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:42:54 -07:00
David S. Miller	49b538e79b	RxRPC fixes -----BEGIN PGP SIGNATURE----- iQIVAwUAW7vWy/u3V2unywtrAQJkeA//UTmUmv9bL8k38XvnnHQ82iTyIjvZ40md 9XbNagwjkxuDv2eqW1rTxk5+Po596Nb4s/cBvpyAABjCA+zoT0/fklK5PqsSKFla ibAHlrBiv/1FlSUSTf4dvAfVVszRAItCP4OhKTU+0fkX3Hq75T7Lty1h92Yk/ijx ccqSV9K/nB4WN9bbg+sDngu3RAVR3VJ5RhVSmrnE9xd3M9ihz2RHacJ6GWNQf5PV j8cKuP3qY/2K245hb4sGn1i9OO6/WVVQhnojYA+jR8Vg4+8mY0Y++MLsW8ywlsuo xlrDCVPBaR6YqjnlDoOXIXYZ8EbEsmy5FP51zfEZJqeKJcVPLavQyoku0xJ8fHgr oK/3DBUkwgi0hiq4dC1CrmXZRSbBQqGTEJMA6AKh2ApXb9BkmhmqRGMx+2nqocwF TjAQmCu0FJTAYdwJuqzv5SRFjocjrHGcYDpQ27CgNqSiiDlh8CZ1ej/mbV+o8pUQ tOrpnRCLY8Nl55AgJbl2260mKcpcKQcRUwggKAOzqRjXwcy9q3KKm8H149qejJSl AQihDkPVqBZrv5ODKR0z88e5cVF9ad+vBPd+2TrRVGtTWhm/q8N/0Hh0M2sYDSc6 vLklYfwI4U5U1/naJC9WCkW+ybj5ekVNHgW91ICvQVNG0TH2zfE7sUGMQn6lRZL8 jfpLTVbtKfI= =q6E6 -----END PGP SIGNATURE----- Merge tag 'rxrpc-fixes-20181008' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fix packet reception code Here are a set of patches that prepares for and fix problems in rxrpc's package reception code. There serious problems are: (A) There's a window between binding the socket and setting the data_ready hook in which packets can find their way into the UDP socket's receive queues. (B) The skb_recv_udp() will return an error (and clear the error state) if there was an error on the Tx side. rxrpc doesn't handle this. (C) The rxrpc data_ready handler doesn't fully drain the UDP receive queue. (D) The rxrpc data_ready handler assumes it is called in a non-reentrant state. The second patch fixes (A) - (C); the third patch renders (B) and (C) non-issues by using the recap_rcv hook instead of data_ready - and the final patch fixes (D). That last is the most complex. The preparatory patches are: (1) Fix some places that are doing things in the wrong net namespace. (2) Stop taking the rcu read lock as it's held by the IP input routine in the call chain. (3) Only end the Tx phase if we rotated the final packet out of the Tx buffer. (4) Don't assume that the call state won't change after dropping the call_state lock. (5) Only take receive window and MTU suze parameters from an ACK packet if it's the latest ACK packet. (6) Record connection-level abort information correctly. (7) Fix a trace line. And then there are three main patches - note that these are mixed in with the preparatory patches somewhat: (1) Fix the setup window (A), skb_recv_udp() error check (B) and packet drainage (C). (2) Switch to using the encap_rcv instead of data_ready to cut out the effects of the UDP read queues and get the packets delivered directly. (3) Add more locking into the various packet input paths to defend against re-entrance (D). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:27:38 -07:00
Yuchung Cheng	ffd177dea5	tcp: refactor DCTCP ECN ACK handling DCTCP has two parts - a new ECN signalling mechanism and the response function to it. The first part can be used by other congestion control for DCTCP-ECN deployed networks. This patch moves that part into a separate tcp_dctcp.h to be used by other congestion control module (like how Yeah uses Vegas algorithmas). For example, BBR is experimenting such ECN signal currently https://tinyurl.com/ietf-102-iccrg-bbr2 Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:26:00 -07:00
David Ahern	ed792e28c4	net/ipv6: Make ipv6_route_table_template static ipv6_route_table_template is exported but there are no users outside of route.c. Make it static. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:25:10 -07:00
David Ahern	e75fa0735c	rtnetlink: Update comment in rtnl_stats_dump regarding strict data checking The NLM_F_DUMP_PROPER_HDR netlink flag was replaced by a setsockopt. Update the comment in rtnl_stats_dump. Fixes: `841891ec0c` ("rtnetlink: Update rtnl_stats_dump for strict data checking") Reported-by: Christian Brauner <christian@brauner.io> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:24:51 -07:00
David Ahern	4565d7e5a3	rtnetlink: Move ifm in valid_fdb_dump_legacy to closer to use Move setting of local variable ifm to after the message parsing in valid_fdb_dump_legacy. Avoid potential future use of unchecked variable. Fixes: `8dfbda19a2` ("rtnetlink: Move input checking for rtnl_fdb_dump to helper") Reported-by: Christian Brauner <christian@brauner.io> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:24:33 -07:00
Ka-Cheong Poon	9a4890bd6d	rds: RDS (tcp) hangs on sendto() to unresponding address In rds_send_mprds_hash(), if the calculated hash value is non-zero and the MPRDS connections are not yet up, it will wait. But it should not wait if the send is non-blocking. In this case, it should just use the base c_path for sending the message. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 22:19:52 -07:00
Eric Dumazet	52b5d6f5dc	net: make skb_partial_csum_set() more robust against overflows syzbot managed to crash in skb_checksum_help() [1] : BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb)); Root cause is the following check in skb_partial_csum_set() if (unlikely(start > skb_headlen(skb)) \|\| unlikely((int)start + off > skb_headlen(skb) - 2)) return false; If skb_headlen(skb) is 1, then (skb_headlen(skb) - 2) becomes 0xffffffff and the check fails to detect that ((int)start + off) is off the limit, since the compare is unsigned. When we fix that, then the first condition (start > skb_headlen(skb)) becomes obsolete. Then we should also check that (skb_headroom(skb) + start) wont overflow 16bit field. [1] kernel BUG at net/core/dev.c:2880! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 7330 Comm: syz-executor4 Not tainted 4.19.0-rc6+ #253 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_checksum_help+0x9e3/0xbb0 net/core/dev.c:2880 Code: 85 00 ff ff ff 48 c1 e8 03 42 80 3c 28 00 0f 84 09 fb ff ff 48 8b bd 00 ff ff ff e8 97 a8 b9 fb e9 f8 fa ff ff e8 2d 09 76 fb <0f> 0b 48 8b bd 28 ff ff ff e8 1f a8 b9 fb e9 b1 f6 ff ff 48 89 cf RSP: 0018:ffff8801d83a6f60 EFLAGS: 00010293 RAX: ffff8801b9834380 RBX: ffff8801b9f8d8c0 RCX: ffffffff8608c6d7 RDX: 0000000000000000 RSI: ffffffff8608cc63 RDI: 0000000000000006 RBP: ffff8801d83a7068 R08: ffff8801b9834380 R09: 0000000000000000 R10: ffff8801d83a76d8 R11: 0000000000000000 R12: 0000000000000001 R13: 0000000000010001 R14: 000000000000ffff R15: 00000000000000a8 FS: 00007f1a66db5700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7d77f091b0 CR3: 00000001ba252000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: skb_csum_hwoffload_help+0x8f/0xe0 net/core/dev.c:3269 validate_xmit_skb+0xa2a/0xf30 net/core/dev.c:3312 __dev_queue_xmit+0xc2f/0x3950 net/core/dev.c:3797 dev_queue_xmit+0x17/0x20 net/core/dev.c:3838 packet_snd net/packet/af_packet.c:2928 [inline] packet_sendmsg+0x422d/0x64c0 net/packet/af_packet.c:2953 Fixes: `5ff8dda303` ("net: Ensure partial checksum offset is inside the skb head") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 10:21:31 -07:00
Moshe Shemesh	bde74ad10e	devlink: Add helper function for safely copy string param Devlink string param buffer is allocated at the size of DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure this size is not exceeded. Renamed DEVLINK_PARAM_MAX_STRING_VALUE to __DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by devlink only. The driver should use the helper function instead to verify it doesn't exceed the allowed length. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 10:19:10 -07:00
Moshe Shemesh	1276534c98	devlink: Fix param cmode driverinit for string type Driverinit configuration mode value is held by devlink to enable the driver fetch the value after reload command. In case the param type is string devlink should copy the value from driver string buffer to devlink string buffer on devlink_param_driverinit_value_set() and vice-versa on devlink_param_driverinit_value_get(). Fixes: `ec01aeb180` ("devlink: Add support for get/set driverinit value") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 10:19:10 -07:00
Moshe Shemesh	f355cfcdb2	devlink: Fix param set handling for string type In case devlink param type is string, it needs to copy the string value it got from the input to devlink_param_value. Fixes: `e3b7ca18ad` ("devlink: Add param set command") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-10 10:19:10 -07:00
Johannes Berg	b802a5d6f3	lib80211: don't use skcipher Using skcipher just makes the code longer, and mac80211 also "open-codes" the WEP encrypt/decrypt. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-10 14:44:16 +02:00
Jesper Dangaard Brouer	2972495699	net: fix generic XDP to handle if eth header was mangled XDP can modify (and resize) the Ethernet header in the packet. There is a bug in generic-XDP, because skb->protocol and skb->pkt_type are setup before reaching (netif_receive_)generic_xdp. This bug was hit when XDP were popping VLAN headers (changing eth->h_proto), as skb->protocol still contains VLAN-indication (ETH_P_8021Q) causing invocation of skb_vlan_untag(skb), which corrupt the packet (basically popping the VLAN again). This patch catch if XDP changed eth header in such a way, that SKB fields needs to be updated. V2: on request from Song Liu, use ETH_HLEN instead of mac_len, in __skb_push() as eth_type_trans() use ETH_HLEN in paired skb_pull_inline(). Fixes: `d445516966` ("net: xdp: support xdp generic on virtual devices") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-10-09 21:59:09 -07:00
Dominique Martinet	fb488fc1f2	9p/trans_fd: put worker reqs on destroy p9_read_work/p9_write_work might still hold references to a req after having been cancelled; make sure we put any of these to avoid potential request leak on disconnect. Fixes: `728356dede` ("9p: Add refcount to p9_req_t") Link: http://lkml.kernel.org/r/1539057956-23741-2-git-send-email-asmadeus@codewreck.org Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Reviewed-by: Tomas Bortoli <tomasbortoli@gmail.com>	2018-10-10 09:14:34 +09:00
Dominique Martinet	e4ca13f7d0	9p/trans_fd: abort p9_read_work if req status changed p9_read_work would try to handle an errored req even if it got put to error state by another thread between the lookup (that worked) and the time it had been fully read. The request itself is safe to use because we hold a ref to it from the lookup (for m->rreq, so it was safe to read into the request data buffer until this point), but the req_list has been deleted at the same time status changed, and client_cb already has been called as well, so we should not do either. Link: http://lkml.kernel.org/r/1539057956-23741-1-git-send-email-asmadeus@codewreck.org Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Reported-by: syzbot+2222c34dc40b515f30dc@syzkaller.appspotmail.com Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net>	2018-10-10 09:13:30 +09:00
Dan Carpenter	72ea032108	9p: potential NULL dereference p9_tag_alloc() is supposed to return error pointers, but we accidentally return a NULL here. It would cause a NULL dereference in the caller. Link: http://lkml.kernel.org/m/20180926103934.GA14535@mwanda Fixes: `996d5b4db4` ("9p: Use a slab for allocating requests") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>	2018-10-10 09:13:20 +09:00
David S. Miller	071a234ad7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-10-08 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) sk_lookup_[tcp\|udp] and sk_release helpers from Joe Stringer which allow BPF programs to perform lookups for sockets in a network namespace. This would allow programs to determine early on in processing whether the stack is expecting to receive the packet, and perform some action (eg drop, forward somewhere) based on this information. 2) per-cpu cgroup local storage from Roman Gushchin. Per-cpu cgroup local storage is very similar to simple cgroup storage except all the data is per-cpu. The main goal of per-cpu variant is to implement super fast counters (e.g. packet counters), which don't require neither lookups, neither atomic operations in a fast path. The example of these hybrid counters is in selftests/bpf/netcnt_prog.c 3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet 4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski 5) rename of libbpf interfaces from Andrey Ignatov. libbpf is maturing as a library and should follow good practices in library design and implementation to play well with other libraries. This patch set brings consistent naming convention to global symbols. 6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov to let Apache2 projects use libbpf 7) various AF_XDP fixes from Björn and Magnus ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 23:42:44 -07:00
David S. Miller	9000a457a0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree: 1) Support for matching on ipsec policy already set in the route, from Florian Westphal. 2) Split set destruction into deactivate and destroy phase to make it fit better into the transaction infrastructure, also from Florian. This includes a patch to warn on imbalance when setting the new activate and deactivate interfaces. 3) Release transaction list from the workqueue to remove expensive synchronize_rcu() from configuration plane path. This speeds up configuration plane quite a bit. From Florian Westphal. 4) Add new xfrm/ipsec extension, this new extension allows you to match for ipsec tunnel keys such as source and destination address, spi and reqid. From Máté Eckl and Florian Westphal. 5) Add secmark support, this includes connsecmark too, patches from Christian Gottsche. 6) Allow to specify remaining bytes in xt_quota, from Chenbo Feng. One follow up patch to calm a clang warning for this one, from Nathan Chancellor. 7) Flush conntrack entries based on layer 3 family, from Kristian Evensen. 8) New revision for cgroups2 to shrink the path field. 9) Get rid of obsolete need_conntrack(), as a result from recent demodularization works. 10) Use WARN_ON instead of BUG_ON, from Florian Westphal. 11) Unused exported symbol in nf_nat_ipv4_fn(), from Florian. 12) Remove superfluous check for timeout netlink parser and dump functions in layer 4 conntrack helpers. 13) Unnecessary redundant rcu read side locks in NAT redirect, from Taehee Yoo. 14) Pass nf_hook_state structure to error handlers, patch from Florian Westphal. 15) Remove ->new() interface from layer 4 protocol trackers. Place them in the ->packet() interface. From Florian. 16) Place conntrack ->error() handling in the ->packet() interface. Patches from Florian Westphal. 17) Remove unused parameter in the pernet initialization path, also from Florian. 18) Remove additional parameter to specify layer 3 protocol when looking up for protocol tracker. From Florian. 19) Shrink array of layer 4 protocol trackers, from Florian. 20) Check for linear skb only once from the ALG NAT mangling codebase, from Taehee Yoo. 21) Use rhashtable_walk_enter() instead of deprecated rhashtable_walk_init(), also from Taehee. 22) No need to flush all conntracks when only one single address is gone, from Tan Hu. 23) Remove redundant check for NAT flags in flowtable code, from Taehee Yoo. 24) Use rhashtable_lookup() instead of rhashtable_lookup_fast() from netfilter codebase, since rcu read lock side is already assumed in this path. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 21:28:55 -07:00
Arnd Bergmann	df3f94a0bb	bpf: fix building without CONFIG_INET The newly added TCP and UDP handling fails to link when CONFIG_INET is disabled: net/core/filter.o: In function `sk_lookup': filter.c:(.text+0x7ff8): undefined reference to `tcp_hashinfo' filter.c:(.text+0x7ffc): undefined reference to `tcp_hashinfo' filter.c:(.text+0x8020): undefined reference to `__inet_lookup_established' filter.c:(.text+0x8058): undefined reference to `__inet_lookup_listener' filter.c:(.text+0x8068): undefined reference to `udp_table' filter.c:(.text+0x8070): undefined reference to `udp_table' filter.c:(.text+0x808c): undefined reference to `__udp4_lib_lookup' net/core/filter.o: In function `bpf_sk_release': filter.c:(.text+0x82e8): undefined reference to `sock_gen_put' Wrap the related sections of code in #ifdefs for the config option. Furthermore, sk_lookup() should always have been marked 'static', this also avoids a warning about a missing prototype when building with 'make W=1'. Fixes: `6acc9b432e` ("bpf: Add helper to retrieve socket in BPF") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Joe Stringer <joe@wand.net.nz> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-09 00:49:44 +02:00
Nathan Chancellor	ffa0a9a590	netfilter: xt_quota: Don't use aligned attribute in sizeof Clang warns: net/netfilter/xt_quota.c:47:44: warning: 'aligned' attribute ignored when parsing type [-Wignored-attributes] BUILD_BUG_ON(sizeof(atomic64_t) != sizeof(__aligned_u64)); ^~~~~~~~~~~~~ Use 'sizeof(__u64)' instead, as the alignment doesn't affect the size of the type. Fixes: `e9837e55b0` ("netfilter: xt_quota: fix the behavior of xt_quota module") Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-10-09 00:19:25 +02:00
David Howells	c1e15b4944	rxrpc: Fix the packet reception routine The rxrpc_input_packet() function and its call tree was built around the assumption that data_ready() handler called from UDP to inform a kernel service that there is data to be had was non-reentrant. This means that certain locking could be dispensed with. This, however, turns out not to be the case with a multi-queue network card that can deliver packets to multiple cpus simultaneously. Each of those cpus can be in the rxrpc_input_packet() function at the same time. Fix by adding or changing some structure members: (1) Add peer->rtt_input_lock to serialise access to the RTT buffer. (2) Make conn->service_id into a 32-bit variable so that it can be cmpxchg'd on all arches. (3) Add call->input_lock to serialise access to the Rx/Tx state. Note that although the Rx and Tx states are (almost) entirely separate, there's no point completing the separation and having separate locks since it's a bi-phasal RPC protocol rather than a bi-direction streaming protocol. Data transmission and data reception do not take place simultaneously on any particular call. and making the following functional changes: (1) In rxrpc_input_data(), hold call->input_lock around the core to prevent simultaneous producing of packets into the Rx ring and updating of tracking state for a particular call. (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and check it before checking RXRPC_CALL_PINGING as that's a cheaper test. The bit test and bit clear can then be combined. No further locking is needed here. (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of the ACK packet. The superseded ACK check is then done both before and after the lock is taken. The handing of ackinfo data is split, parsing before the lock is taken and processing with it held. This is keyed on rxMTU being non-zero. Congestion management is also done within the locked section. (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window rotation. The ACKALL packet carries no information and is only really useful after all packets have been transmitted since it's imprecise. (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to prevent calls being simultaneously implicitly ended on two cpus and also to prevent any races with incoming call setup. (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade on a connection. It is only permitted to happen once for a connection. (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside rx->incoming_lock to see if someone else set up the call, connection or peer whilst we were getting there. We can't trust the values from the earlier routing check unless we pin refs on them - which we want to avoid. Further, we need to allow for an incoming call to have its state changed on another CPU between us making it live and us adjusting it because the conn is now in the RXRPC_CONN_SERVICE state. (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access to the RTT buffer. Don't need to lock around setting peer->rtt. For reference, the inventory of state-accessing or state-altering functions used by the packet input procedure is: > rxrpc_input_packet() * PACKET CHECKING * ROUTING > rxrpc_post_packet_to_local() > rxrpc_find_connection_rcu() - uses RCU > rxrpc_lookup_peer_rcu() - uses RCU > rxrpc_find_service_conn_rcu() - uses RCU > idr_find() - uses RCU * CONNECTION-LEVEL PROCESSING - Service upgrade - Can only happen once per conn ! Changed to use cmpxchg > rxrpc_post_packet_to_conn() - Setting conn->hi_serial - Probably safe not using locks - Maybe use cmpxchg * CALL-LEVEL PROCESSING > Old-call checking > rxrpc_input_implicit_end_call() > rxrpc_call_completed() > rxrpc_queue_call() ! Need to take rx->incoming_lock > __rxrpc_disconnect_call() > rxrpc_notify_socket() > rxrpc_new_incoming_call() - Uses rx->incoming_lock for the entire process - Might be able to drop this earlier in favour of the call lock > rxrpc_incoming_call() ! Conflicts with rxrpc_input_implicit_end_call() > rxrpc_send_ping() - Don't need locks to check rtt state > rxrpc_propose_ACK * PACKET DISTRIBUTION > rxrpc_input_call_packet() > rxrpc_input_data() * QUEUE DATA PACKET ON CALL > rxrpc_reduce_call_timer() - Uses timer_reduce() ! Needs call->input_lock() > rxrpc_receiving_reply() ! Needs locking around ack state > rxrpc_rotate_tx_window() > rxrpc_end_tx_phase() > rxrpc_proto_abort() > rxrpc_input_dup_data() - Fills the Rx buffer - rxrpc_propose_ACK() - rxrpc_notify_socket() > rxrpc_input_ack() * APPLY ACK PACKET TO CALL AND DISCARD PACKET > rxrpc_input_ping_response() - Probably doesn't need any extra locking ! Need READ_ONCE() on call->ping_serial > rxrpc_input_check_for_lost_ack() - Takes call->lock to consult Tx buffer > rxrpc_peer_add_rtt() ! Needs to take a lock (peer->rtt_input_lock) ! Could perhaps manage with cmpxchg() and xadd() instead > rxrpc_input_requested_ack - Consults Tx buffer ! Probably needs a lock > rxrpc_peer_add_rtt() > rxrpc_propose_ack() > rxrpc_input_ackinfo() - Changes call->tx_winsize ! Use cmpxchg to handle change ! Should perhaps track serial number - Uses peer->lock to record MTU specification changes > rxrpc_proto_abort() ! Need to take call->input_lock > rxrpc_rotate_tx_window() > rxrpc_end_tx_phase() > rxrpc_input_soft_acks() - Consults the Tx buffer > rxrpc_congestion_management() - Modifies the Tx annotations ! Needs call->input_lock() > rxrpc_queue_call() > rxrpc_input_abort() * APPLY ABORT PACKET TO CALL AND DISCARD PACKET > rxrpc_set_call_completion() > rxrpc_notify_socket() > rxrpc_input_ackall() * APPLY ACKALL PACKET TO CALL AND DISCARD PACKET ! Need to take call->input_lock > rxrpc_rotate_tx_window() > rxrpc_end_tx_phase() > rxrpc_reject_packet() There are some functions used by the above that queue the packet, after which the procedure is terminated: - rxrpc_post_packet_to_local() - local->event_queue is an sk_buff_head - local->processor is a work_struct - rxrpc_post_packet_to_conn() - conn->rx_queue is an sk_buff_head - conn->processor is a work_struct - rxrpc_reject_packet() - local->reject_queue is an sk_buff_head - local->processor is a work_struct And some that offload processing to process context: - rxrpc_notify_socket() - Uses RCU lock - Uses call->notify_lock to call call->notify_rx - Uses call->recvmsg_lock to queue recvmsg side - rxrpc_queue_call() - call->processor is a work_struct - rxrpc_propose_ACK() - Uses call->lock to wrap __rxrpc_propose_ACK() And a bunch that complete a call, all of which use call->state_lock to protect the call state: - rxrpc_call_completed() - rxrpc_set_call_completion() - rxrpc_abort_call() - rxrpc_proto_abort() - Also uses rxrpc_queue_call() Fixes: `17926a7932` ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-08 22:42:04 +01:00
David Howells	647530924f	rxrpc: Fix connection-level abort handling Fix connection-level abort handling to cache the abort and error codes properly so that a new incoming call can be properly aborted if it races with the parent connection being aborted by another CPU. The abort_code and error parameters can then be dropped from rxrpc_abort_calls(). Fixes: `f5c17aaeb2` ("rxrpc: Calls should only have one terminal state") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-08 22:42:04 +01:00
David Howells	298bc15b20	rxrpc: Only take the rwind and mtu values from latest ACK Move the out-of-order and duplicate ACK packet check to before the call to rxrpc_input_ackinfo() so that the receive window size and MTU size are only checked in the latest ACK packet and don't regress. Fixes: `248f219cb8` ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-08 22:42:04 +01:00
David Ahern	8c6e137fbc	rtnetlink: Update rtnl_fdb_dump for strict data checking Update rtnl_fdb_dump for strict data checking. If the flag is set, the dump request is expected to have an ndmsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the NDA_IFINDEX and NDA_MASTER attributes are supported. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	8dfbda19a2	rtnetlink: Move input checking for rtnl_fdb_dump to helper Move the existing input checking for rtnl_fdb_dump into a helper, valid_fdb_dump_legacy. This function will retain the current logic that works around the 2 headers that userspace has been allowed to send up to this point. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	c77b93641e	net/bridge: Update br_mdb_dump for strict data checking Update br_mdb_dump for strict data checking. If the flag is set, the dump request is expected to have a br_port_msg struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	addd383f5a	net: Update netconf dump handlers for strict data checking Update inet_netconf_dump_devconf, inet6_netconf_dump_devconf, and mpls_netconf_dump_devconf for strict data checking. If the flag is set, the dump request is expected to have an netconfmsg struct as the header. The struct only has the family member and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	f2ae64bb6b	net/ipv6: Update ip6addrlbl_dump for strict data checking Update ip6addrlbl_dump for strict data checking. If the flag is set, the dump request is expected to have an ifaddrlblmsg struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	4a73e5e56d	net/fib_rules: Update fib_nl_dumprule for strict data checking Update fib_nl_dumprule for strict data checking. If the flag is set, the dump request is expected to have fib_rule_hdr struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	f80f14c364	net/namespace: Update rtnl_net_dumpid for strict data checking Update rtnl_net_dumpid for strict data checking. If the flag is set, the dump request is expected to have an rtgenmsg struct as the header which has the family as the only element. No data may be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	9632d47f6a	net/neighbor: Update neightbl_dump_info for strict data checking Update neightbl_dump_info for strict data checking. If the flag is set, the dump request is expected to have an ndtmsg struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	51183d233b	net/neighbor: Update neigh_dump_info for strict data checking Update neigh_dump_info for strict data checking. If the flag is set, the dump request is expected to have an ndmsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the NDA_IFINDEX and NDA_MASTER attributes are supported. Existing code does not fail the dump if nlmsg_parse fails. That behavior is kept for non-strict checking. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	e8ba330ac0	rtnetlink: Update fib dumps for strict data checking Add helper to check netlink message for route dumps. If the strict flag is set the dump request is expected to have an rtmsg struct as the header. All elements of the struct are expected to be 0 with the exception of rtm_flags (which is used by both ipv4 and ipv6 dumps) and no attributes can be appended. rtm_flags can only have RTM_F_CLONED and RTM_F_PREFIX set. Update inet_dump_fib, inet6_dump_fib, mpls_dump_routes, ipmr_rtm_dumproute, and ip6mr_rtm_dumproute to call this helper if strict data checking is enabled. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:05 -07:00
David Ahern	14fc5bb29f	rtnetlink: Update ipmr_rtm_dumplink for strict data checking Update ipmr_rtm_dumplink for strict data checking. If the flag is set, the dump request is expected to have an ifinfomsg struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	786e0007e2	rtnetlink: Update inet6_dump_ifinfo for strict data checking Update inet6_dump_ifinfo for strict data checking. If the flag is set, the dump request is expected to have an ifinfomsg struct as the header. All elements of the struct are expected to be 0 and no attributes can be appended. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	841891ec0c	rtnetlink: Update rtnl_stats_dump for strict data checking Update rtnl_stats_dump for strict data checking. If the flag is set, the dump request is expected to have an if_stats_msg struct as the header. All elements of the struct are expected to be 0 except filter_mask which must be non-0 (legacy behavior). No attributes are supported. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	2d011be8c0	rtnetlink: Update rtnl_bridge_getlink for strict data checking Update rtnl_bridge_getlink for strict data checking. If the flag is set, the dump request is expected to have an ifinfomsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the IFLA_EXT_MASK attribute is supported. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	905cf0abe8	rtnetlink: Update rtnl_dump_ifinfo for strict data checking Update rtnl_dump_ifinfo for strict data checking. If the flag is set, the dump request is expected to have an ifinfomsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the IFA_TARGET_NETNSID, IFLA_EXT_MASK, IFLA_MASTER, and IFLA_LINKINFO attributes are supported. Existing code does not fail the dump if nlmsg_parse fails. That behavior is kept for non-strict checking. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	ed6eff1179	net/ipv6: Update inet6_dump_addr for strict data checking Update inet6_dump_addr for strict data checking. If the flag is set, the dump request is expected to have an ifaddrmsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values suppored by the dump handler are allowed to be non-0 or set in the request. At the moment only the IFA_TARGET_NETNSID attribute is supported. Follow on patches can add support for other fields (e.g., honor ifa_index and only return data for the given device index). Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	c33078e3df	net/ipv4: Update inet_dump_ifaddr for strict data checking Update inet_dump_ifaddr for strict data checking. If the flag is set, the dump request is expected to have an ifaddrmsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the IFA_TARGET_NETNSID attribute is supported. Follow on patches can support for other fields (e.g., honor ifa_index and only return data for the given device index). Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	89d35528d1	netlink: Add new socket option to enable strict checking on dumps Add a new socket option, NETLINK_DUMP_STRICT_CHK, that userspace can use via setsockopt to request strict checking of headers and attributes on dump requests. To get dump features such as kernel side filtering based on data in the header or attributes appended to the dump request, userspace must call setsockopt() for NETLINK_DUMP_STRICT_CHK and a non-zero value. Since the netlink sock and its flags are private to the af_netlink code, the strict checking flag is passed to dump handlers via a flag in the netlink_callback struct. For old userspace on new kernel there is no impact as all of the data checks in later patches are wrapped in a check on the new strict flag. For new userspace on old kernel, the setsockopt will fail and even if new userspace sets data in the headers and appended attributes the kernel will silently ignore it. Moving forward when the setsockopt succeeds, the new userspace on old kernel means the dump request can pass an attribute the kernel does not understand. The dump will then fail as the older kernel does not understand it. New userspace on new kernel setting the socket option gets the benefit of the improved data dump. Kernel side the NETLINK_DUMP_STRICT_CHK uapi is converted to a generic NETLINK_F_STRICT_CHK flag which can potentially be leveraged for tighter checking on the NEW, DEL, and SET commands. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	6ba1e6e856	net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs Pull the inet6_fill_args arg up to in6_dump_addrs and move netnsid into it. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	dac9c9790e	net: Add extack to nlmsg_parse Make sure extack is passed to nlmsg_parse where easy to do so. Most of these are dump handlers and leveraging the extack in the netlink_callback. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:04 -07:00
David Ahern	4a19edb60d	netlink: Pass extack to dump handlers Declare extack in netlink_dump and pass to dump handlers via netlink_callback. Add any extack message after the dump_done_errno allowing error messages to be returned. This will be useful when strict checking is done on dump requests, returning why the dump fails EINVAL. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:39:03 -07:00
Al Viro	a030598690	net: sched: cls_u32: simplify the hell out u32_delete() emptiness check Now that we have the knode count, we can instantly check if any hnodes are non-empty. And that kills the check for extra references to root hnode - those could happen only if there was a knode to carry such a link. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	b245d32c99	net: sched: cls_u32: keep track of knodes count in tc_u_common allows to simplify u32_delete() considerably Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	8a8065f683	net: sched: cls_u32: get rid of tp_c Both hnode ->tp_c and tp_c argument of u32_set_parms() the latter is redundant, the former - never read... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	db04ff4863	net: sched: cls_u32: the tp_c argument of u32_set_parms() is always tp->data It must be tc_u_common associated with that tp (i.e. tp->data). Proof: * both ->ht_up and ->tp_c are assign-once * ->tp_c of anything inserted into tp_c->hlist is tp_c * hnodes never get reinserted into the lists or moved between those, so anything found by u32_lookup_ht(tp->data, ...) will have ->tp_c equal to tp->data. * tp->root->tp_c == tp->data. * ->ht_up of anything inserted into hnode->ht[...] is equal to hnode. * knodes never get reinserted into hash chains or moved between those, so anything returned by u32_lookup_key(ht, ...) will have ->ht_up equal to ht. * any knode returned by u32_get(tp, ...) will have ->ht_up->tp_c point to tp->data Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	18512f5c25	net: sched: cls_u32: pass tc_u_common to u32_set_parms() instead of tc_u_hnode the only thing we used ht for was ht->tp_c and callers can get that without going through ->tp_c at all; start with lifting that into the callers, next commits will massage those, eventually removing ->tp_c altogether. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	4895c42f62	net: sched: cls_u32: clean tc_u_common hashtable * calculate key once, not for each hash chain element * let tc_u_hash() return the pointer to chain head rather than index - callers are cleaner that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	07743ca5c9	net: sched: cls_u32: get rid of tc_u_common ->rcu unused Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	ec17caf078	net: sched: cls_u32: get rid of tc_u_knode ->tp not used anymore Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	dc07c57363	net: sched: cls_u32: get rid of unused argument of u32_destroy_key() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	2f0c982df7	net: sched: cls_u32: make sure that divisor is a power of 2 Tested by modifying iproute2 to allow sending a divisor > 255 Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:35 -07:00
Al Viro	27594ec4b6	net: sched: cls_u32: disallow linking to root hnode Operation makes no sense. Nothing will actually break if we do so (depth limit in u32_classify() will prevent infinite loops), but according to maintainers it's best prohibited outright. NOTE: doing so guarantees that u32_destroy() will trigger the call of u32_destroy_hnode(); we might want to make that unconditional. Test: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent ffff: protocol ip prio 100 u32 \ link 800: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff should fail with Error: cls_u32: Not linking to root node Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:34 -07:00
Al Viro	b44ef84542	net: sched: cls_u32: mark root hnode explicitly ... and produce consistent error on attempt to delete such. Existing check in u32_delete() is inconsistent - after tc qdisc add dev eth0 ingress tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: u32 \ divisor 1 tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: u32 \ divisor 1 both tc filter delete dev eth0 parent ffff: protocol ip prio 100 handle 801: u32 and tc filter delete dev eth0 parent ffff: protocol ip prio 100 handle 800: u32 will fail (at least with refcounting fixes), but the former will complain about an attempt to remove a busy table, while the latter will recognize it as root and yield "Not allowed to delete root node" instead. The problem with the existing check is that several tcf_proto instances might share the same tp->data and handle-to-hnode lookup will be the same for all of them. So comparing an hnode to be deleted with tp->root won't catch the case when one tp is used to try deleting the root of another. Solution is trivial - mark the root hnodes explicitly upon allocation and check for that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-08 10:33:34 -07:00
David Howells	dfe9952246	rxrpc: Carry call state out of locked section in rxrpc_rotate_tx_window() Carry the call state out of the locked section in rxrpc_rotate_tx_window() rather than sampling it afterwards. This is only used to select tracepoint data, but could have changed by the time we do the tracepoint. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-08 15:57:36 +01:00
David Howells	c479d5f2c2	rxrpc: Don't check RXRPC_CALL_TX_LAST after calling rxrpc_rotate_tx_window() We should only call the function to end a call's Tx phase if we rotated the marked-last packet out of the transmission buffer. Make rxrpc_rotate_tx_window() return an indication of whether it just rotated the packet marked as the last out of the transmit buffer, carrying the information out of the locked section in that function. We can then check the return value instead of examining RXRPC_CALL_TX_LAST. Fixes: `70790dbe3f` ("rxrpc: Pass the last Tx packet marker in the annotation buffer") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-08 15:55:20 +01:00
David Howells	bfd2821117	rxrpc: Don't need to take the RCU read lock in the packet receiver We don't need to take the RCU read lock in the rxrpc packet receive function because it's held further up the stack in the IP input routine around the UDP receive routines. Fix this by dropping the RCU read lock calls from rxrpc_input_packet(). This simplifies the code. Fixes: `70790dbe3f` ("rxrpc: Pass the last Tx packet marker in the annotation buffer") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-08 15:45:56 +01:00
David Howells	5271953cad	rxrpc: Use the UDP encap_rcv hook Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception in which a packet is placed onto the UDP receive queue and then immediately removed again by rxrpc. Going via the queue in this manner seems like it should be unnecessary. This does, however, require the invention of a value to place in encap_type as that's one of the conditions to switch packets out to the encap_rcv hook. Possibly the value doesn't actually matter for anything other than sockopts on the UDP socket, which aren't accessible outside of rxrpc anyway. This seems to cut a bit of time out of the time elapsed between each sk_buff being timestamped and turning up in rxrpc (the final number in the following trace excerpts). I measured this by making the rxrpc_rx_packet trace point print the time elapsed between the skb being timestamped and the current time (in ns), e.g.: ... 424.278721: rxrpc_rx_packet: ... ACK 25026 So doing a 512MiB DIO read from my test server, with an unmodified kernel: N min max sum mean stddev 27605 2626 7581 7.83992e+07 2840.04 181.029 and with the patch applied: N min max sum mean stddev 27547 1895 12165 6.77461e+07 2459.29 255.02 Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-08 15:45:18 +01:00
Björn Töpel	541d7fdd76	xsk: proper AF_XDP socket teardown ordering The AF_XDP socket struct can exist in three different, implicit states: setup, bound and released. Setup is prior the socket has been bound to a device. Bound is when the socket is active for receive and send. Released is when the process/userspace side of the socket is released, but the sock object is still lingering, e.g. when there is a reference to the socket in an XSKMAP after process termination. The Rx fast-path code uses the "dev" member of struct xdp_sock to check whether a socket is bound or relased, and the Tx code uses the struct xdp_umem "xsk_list" member in conjunction with "dev" to determine the state of a socket. However, the transition from bound to released did not tear the socket down in correct order. On the Rx side "dev" was cleared after synchronize_net() making the synchronization useless. On the Tx side, the internal queues were destroyed prior removing them from the "xsk_list". This commit corrects the cleanup order, and by doing so xdp_del_sk_umem() can be simplified and one synchronize_net() can be removed. Fixes: `965a990984` ("xsk: add support for bind for Rx") Fixes: `ac98d8aab6` ("xsk: wire upp Tx zero-copy functions") Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-08 10:09:22 +02:00
Johannes Berg	188de5dd80	Merge remote-tracking branch 'net-next/master' into mac80211-next Merge net-next, which pulled in net, so I can merge a few more patches that would otherwise conflict. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-08 09:48:36 +02:00
Li RongQing	f1193e9157	xfrm: use correct size to initialise sp->ovec This place should want to initialize array, not a element, so it should be sizeof(array) instead of sizeof(element) but now this array only has one element, so no error in this condition that XFRM_MAX_OFFLOAD_DEPTH is 1 Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-10-08 08:15:55 +02:00
Li RongQing	b7138fddd6	xfrm: remove unnecessary check in xfrmi_get_stats64 if tstats of a device is not allocated, this device is not registered correctly and can not be used. Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-10-08 08:15:55 +02:00
Al Viro	6d4c407744	net: sched: cls_u32: fix hnode refcounting cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via ->hlist and via ->tp_root together. u32_destroy() drops the former and, in case when there had been links, leaves the sucker on the list. As the result, there's nothing to protect it from getting freed once links are dropped. That also makes the "is it busy" check incapable of catching the root hnode - it is busy (there's a reference from tp), but we don't see it as something separate. "Is it our root?" check partially covers that, but the problem exists for others' roots as well. AFAICS, the minimal fix preserving the existing behaviour (where it doesn't include oopsen, that is) would be this: * count tp->root and tp_c->hlist as separate references. I.e. have u32_init() set refcount to 2, not 1. * in u32_destroy() we always drop the former; in u32_destroy_hnode() - the latter. That way we have all references contributing to refcount. List removal happens in u32_destroy_hnode() (called only when ->refcnt is 1) an in u32_destroy() in case of tc_u_common going away, along with everything reachable from it. IOW, that way we know that u32_destroy_key() won't free something still on the list (or pointed to by someone's ->root). Reproducer: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: \ u32 divisor 1 tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: \ u32 divisor 1 tc filter add dev eth0 parent ffff: protocol ip prio 100 \ handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \ plus 0 eat match ip protocol 6 ff tc filter delete dev eth0 parent ffff: protocol ip prio 200 tc filter change dev eth0 parent ffff: protocol ip prio 100 \ handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \ eat match ip protocol 6 ff tc filter delete dev eth0 parent ffff: protocol ip prio 100 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-07 21:02:37 -07:00
Leslie Monis	ac4a02c5ab	net: sched: pie: fix coding style issues Fix 5 warnings and 14 checks issued by checkpatch.pl: CHECK: Logical continuations should be on the previous line + if ((q->vars.qdelay < q->params.target / 2) + && (q->vars.prob < MAX_PROB / 5)) WARNING: line over 80 characters + q->params.tupdate = usecs_to_jiffies(nla_get_u32(tb[TCA_PIE_TUPDATE])); CHECK: Blank lines aren't necessary after an open brace '{' +{ + CHECK: braces {} should be used on all arms of this statement + if (qlen < QUEUE_THRESHOLD) [...] + else { [...] CHECK: Unbalanced braces around else statement + else { CHECK: No space is necessary after a cast + if (delta > (s32) (MAX_PROB / (100 / 2)) && CHECK: Unnecessary parentheses around 'qdelay == 0' + if ((qdelay == 0) && (qdelay_old == 0) && update_prob) CHECK: Unnecessary parentheses around 'qdelay_old == 0' + if ((qdelay == 0) && (qdelay_old == 0) && update_prob) CHECK: Unnecessary parentheses around 'q->vars.prob == 0' + if ((q->vars.qdelay < q->params.target / 2) && + (q->vars.qdelay_old < q->params.target / 2) && + (q->vars.prob == 0) && + (q->vars.avg_dq_rate > 0)) CHECK: Unnecessary parentheses around 'q->vars.avg_dq_rate > 0' + if ((q->vars.qdelay < q->params.target / 2) && + (q->vars.qdelay_old < q->params.target / 2) && + (q->vars.prob == 0) && + (q->vars.avg_dq_rate > 0)) CHECK: Blank lines aren't necessary before a close brace '}' + +} CHECK: Comparison to NULL could be written "!opts" + if (opts == NULL) CHECK: No space is necessary after a cast + ((u32) PSCHED_TICKS2NS(q->params.target)) / WARNING: line over 80 characters + nla_put_u32(skb, TCA_PIE_TUPDATE, jiffies_to_usecs(q->params.tupdate)) \|\| CHECK: Blank lines aren't necessary before a close brace '}' + +} CHECK: No space is necessary after a cast + .delay = ((u32) PSCHED_TICKS2NS(q->vars.qdelay)) / WARNING: Missing a blank line after declarations + struct sk_buff skb; + skb = qdisc_dequeue_head(sch); WARNING: Missing a blank line after declarations + struct pie_sched_data q = qdisc_priv(sch); + qdisc_reset_queue(sch); WARNING: Missing a blank line after declarations + struct pie_sched_data *q = qdisc_priv(sch); + q->params.tupdate = 0; Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-07 20:39:01 -07:00
Jiri Kosina	7e823644b6	udp: Unbreak modules that rely on external __skb_recv_udp() availability Commit `2276f58ac5` ("udp: use a separate rx queue for packet reception") turned static inline __skb_recv_udp() from being a trivial helper around __skb_recv_datagram() into a UDP specific implementaion, making it EXPORT_SYMBOL_GPL() at the same time. There are external modules that got broken by __skb_recv_udp() not being visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL(). Rationale (one of those) why this is actually "technically correct" thing to do: __skb_recv_udp() used to be an inline wrapper around __skb_recv_datagram(), which itself (still, and correctly so, I believe) is EXPORT_SYMBOL(). Cc: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Fixes: `2276f58ac5` ("udp: use a separate rx queue for packet reception") Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-07 20:33:10 -07:00
David S. Miller	72438f8cef	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2018-10-06 14:43:42 -07:00
Kees Cook	329e098939	treewide: Replace more open-coded allocation size multiplications As done treewide earlier, this catches several more open-coded allocation size calculations that were added to the kernel during the merge window. This performs the following mechanical transformations using Coccinelle: kvmalloc(a * b, ...) -> kvmalloc_array(a, b, ...) kvzalloc(a * b, ...) -> kvcalloc(a, b, ...) devm_kzalloc(..., a * b, ...) -> devm_kcalloc(..., a, b, ...) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-10-05 18:06:30 -07:00
Vijay Khemka	fb4ee67529	net/ncsi: Add NCSI OEM command support This patch adds OEM commands and response handling. It also defines OEM command and response structure as per NCSI specification along with its handlers. ncsi_cmd_handler_oem: This is a generic command request handler for OEM commands ncsi_rsp_handler_oem: This is a generic response handler for OEM commands Signed-off-by: Vijay Khemka <vijaykhemka@fb.com> Reviewed-by: Justin Lee <justin.lee1@dell.com> Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 14:54:47 -07:00
Wei Wang	a688caa34b	ipv6: take rcu lock in rawv6_send_hdrinc() In rawv6_send_hdrinc(), in order to avoid an extra dst_hold(), we directly assign the dst to skb and set passed in dst to NULL to avoid double free. However, in error case, we free skb and then do stats update with the dst pointer passed in. This causes use-after-free on the dst. Fix it by taking rcu read lock right before dst could get released to make sure dst does not get freed until the stats update is done. Note: we don't have this issue in ipv4 cause dst is not used for stats update in v4. Syzkaller reported following crash: BUG: KASAN: use-after-free in rawv6_send_hdrinc net/ipv6/raw.c:692 [inline] BUG: KASAN: use-after-free in rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921 Read of size 8 at addr ffff8801d95ba730 by task syz-executor0/32088 CPU: 1 PID: 32088 Comm: syz-executor0 Not tainted 4.19.0-rc2+ #93 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 rawv6_send_hdrinc net/ipv6/raw.c:692 [inline] rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:621 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:631 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114 __sys_sendmsg+0x11d/0x280 net/socket.c:2152 __do_sys_sendmsg net/socket.c:2161 [inline] __se_sys_sendmsg net/socket.c:2159 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457099 Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f83756edc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f83756ee6d4 RCX: 0000000000457099 RDX: 0000000000000000 RSI: 0000000020003840 RDI: 0000000000000004 RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000004d4b30 R14: 00000000004c90b1 R15: 0000000000000000 Allocated by task 32088: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490 kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554 dst_alloc+0xbb/0x1d0 net/core/dst.c:105 ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:353 ip6_rt_cache_alloc+0x247/0x7b0 net/ipv6/route.c:1186 ip6_pol_route+0x8f8/0xd90 net/ipv6/route.c:1895 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2093 fib6_rule_lookup+0x277/0x860 net/ipv6/fib6_rules.c:122 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2121 ip6_route_output include/net/ip6_route.h:88 [inline] ip6_dst_lookup_tail+0xe27/0x1d60 net/ipv6/ip6_output.c:951 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079 rawv6_sendmsg+0x12d9/0x4630 net/ipv6/raw.c:905 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798 sock_sendmsg_nosec net/socket.c:621 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:631 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114 __sys_sendmsg+0x11d/0x280 net/socket.c:2152 __do_sys_sendmsg net/socket.c:2161 [inline] __se_sys_sendmsg net/socket.c:2159 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 5356: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kmem_cache_free+0x83/0x290 mm/slab.c:3756 dst_destroy+0x267/0x3c0 net/core/dst.c:141 dst_destroy_rcu+0x16/0x19 net/core/dst.c:154 __rcu_reclaim kernel/rcu/rcu.h:236 [inline] rcu_do_batch kernel/rcu/tree.c:2576 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline] __rcu_process_callbacks kernel/rcu/tree.c:2847 [inline] rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864 __do_softirq+0x30b/0xad8 kernel/softirq.c:292 Fixes: `1789a640f5` ("raw: avoid two atomics in xmit") Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 14:45:34 -07:00
Jakub Sitnicki	068b88cc17	socket: Tighten no-error check in bind() move_addr_to_kernel() returns only negative values on error, or zero on success. Rewrite the error check to an idiomatic form to avoid confusing the reader. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 14:34:45 -07:00
David Ahern	8b4c3cdd9d	net: sched: Add policy validation for tc attributes A number of TC attributes are processed without proper validation (e.g., length checks). Add a tca policy for all input attributes and use when invoking nlmsg_parse. The 2 Fixes tags below cover the latest additions. The other attributes are a string (KIND), nested attribute (OPTIONS which does seem to have validation in most cases), for dumps only or a flag. Fixes: `5bc1701881` ("net: sched: introduce multichain support for filters") Fixes: `d47a6b0e7c` ("net: sched: introduce ingress/egress block index attributes for qdisc") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 14:26:44 -07:00
Mauricio Faria de Oliveira	bd961c9bc6	rtnetlink: fix rtnl_fdb_dump() for ndmsg header Currently, rtnl_fdb_dump() assumes the family header is 'struct ifinfomsg', which is not always true -- 'struct ndmsg' is used by iproute2 ('ip neigh'). The problem is, the function bails out early if nlmsg_parse() fails, which does occur for iproute2 usage of 'struct ndmsg' because the payload length is shorter than the family header alone (as 'struct ifinfomsg' is assumed). This breaks backward compatibility with userspace -- nothing is sent back. Some examples with iproute2 and netlink library for go [1]: 1) $ bridge fdb show 33:33:00:00:00:01 dev ens3 self permanent 01:00:5e:00:00:01 dev ens3 self permanent 33:33:ff:15:98:30 dev ens3 self permanent This one works, as it uses 'struct ifinfomsg'. fdb_show() @ iproute2/bridge/fdb.c """ .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), ... if (rtnl_dump_request(&rth, RTM_GETNEIGH, [...] """ 2) $ ip --family bridge neigh RTNETLINK answers: Invalid argument Dump terminated This one fails, as it uses 'struct ndmsg'. do_show_or_flush() @ iproute2/ip/ipneigh.c """ .n.nlmsg_type = RTM_GETNEIGH, .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndmsg)), """ 3) $ ./neighlist < no output > This one fails, as it uses 'struct ndmsg'-based. neighList() @ netlink/neigh_linux.go """ req := h.newNetlinkRequest(unix.RTM_GETNEIGH, [...] msg := Ndmsg{ """ The actual breakage was introduced by commit `0ff50e83b5` ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse error"), because nlmsg_parse() fails if the payload length (with the _actual_ family header) is less than the family header length alone (which is assumed, in parameter 'hdrlen'). This is true in the examples above with struct ndmsg, with size and payload length shorter than struct ifinfomsg. However, that commit just intends to fix something under the assumption the family header is indeed an 'struct ifinfomsg' - by preventing access to the payload as such (via 'ifm' pointer) if the payload length is not sufficient to actually contain it. The assumption was introduced by commit `5e6d243587` ("bridge: netlink dump interface at par with brctl"), to support iproute2's 'bridge fdb' command (not 'ip neigh') which indeed uses 'struct ifinfomsg', thus is not broken. So, in order to unbreak the 'struct ndmsg' family headers and still allow 'struct ifinfomsg' to continue to work, check for the known message sizes used with 'struct ndmsg' in iproute2 (with zero or one attribute which is not used in this function anyway) then do not parse the data as ifinfomsg. Same examples with this patch applied (or revert/before the original fix): $ bridge fdb show 33:33:00:00:00:01 dev ens3 self permanent 01:00:5e:00:00:01 dev ens3 self permanent 33:33:ff:15:98:30 dev ens3 self permanent $ ip --family bridge neigh dev ens3 lladdr 33:33:00:00:00:01 PERMANENT dev ens3 lladdr 01:00:5e:00:00:01 PERMANENT dev ens3 lladdr 33:33:ff:15:98:30 PERMANENT $ ./neighlist netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0x0, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0} netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x1, 0x0, 0x5e, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0} netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0xff, 0x15, 0x98, 0x30}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0} Tested on mainline (v4.19-rc6) and net-next (`3bd09b05b0`). References: [1] netlink library for go (test-case) https://github.com/vishvananda/netlink $ cat ~/go/src/neighlist/main.go package main import ("fmt"; "syscall"; "github.com/vishvananda/netlink") func main() { neighs, _ := netlink.NeighList(0, syscall.AF_BRIDGE) for _, neigh := range neighs { fmt.Printf("%#v\n", neigh) } } $ export GOPATH=~/go $ go get github.com/vishvananda/netlink $ go build neighlist $ ~/go/src/neighlist/neighlist Thanks to David Ahern for suggestions to improve this patch. Fixes: `0ff50e83b5` ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse error") Fixes: `5e6d243587` ("bridge: netlink dump interface at par with brctl") Reported-by: Aidan Obley <aobley@pivotal.io> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 14:21:42 -07:00
Eric Dumazet	fda21d46cc	ipv6: do not leave garbage in rt->fib6_metrics In case ip_fib_metrics_init() returns an error, we better rewrite rt->fib6_metrics with &dst_default_metrics so that we do not crash later in ip_fib_metrics_put() Fixes: `767a221753` ("net: common metrics init helper for FIB entries") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 11:57:11 -07:00
Willem de Bruijn	f2e9de210d	udp: gro behind static key Avoid the socket lookup cost in udp_gro_receive if no socket has a udp tunnel callback configured. udp_sk(sk)->gro_receive requires a registration with setup_udp_tunnel_sock, which enables the static key. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 11:52:38 -07:00
Shanthosh RK	33aa8da1f8	net: bpfilter: Fix type cast and pointer warnings Fixes the following Sparse warnings: net/bpfilter/bpfilter_kern.c:62:21: warning: cast removes address space of expression net/bpfilter/bpfilter_kern.c:101:49: warning: Using plain integer as NULL pointer Signed-off-by: Shanthosh RK <shanthosh.rk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 11:51:06 -07:00
Arnd Bergmann	2192476586	SUNRPC: use cmpxchg64() in gss_seq_send64_fetch_and_inc() The newly introduced gss_seq_send64_fetch_and_inc() fails to build on 32-bit architectures: net/sunrpc/auth_gss/gss_krb5_seal.c:144:14: note: in expansion of macro 'cmpxchg' seq_send = cmpxchg(&ctx->seq_send64, old, old + 1); ^~~~~~~ arch/x86/include/asm/cmpxchg.h:128:3: error: call to '__cmpxchg_wrong_size' declared with attribute error: Bad argument size for cmpxchg __cmpxchg_wrong_size(); \ As the message tells us, cmpxchg() cannot be used on 64-bit arguments, that's what cmpxchg64() does. Fixes: `571ed1fd23` ("SUNRPC: Replace krb5_seq_lock with a lockless scheme") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-10-05 09:32:37 -04:00
David Howells	2cfa227160	rxrpc: Fix the data_ready handler Fix the rxrpc_data_ready() function to pick up all packets and to not miss any. There are two problems: (1) The sk_data_ready pointer on the UDP socket is set after it is bound. This means that it's open for business before we're ready to dequeue packets and there's a tiny window exists in which a packet can sneak onto the receive queue, but we never know about it. Fix this by setting the pointers on the socket prior to binding it. (2) skb_recv_udp() will return an error (such as ENETUNREACH) if there was an error on the transmission side, even though we set the sk_error_report hook. Because rxrpc_data_ready() returns immediately in such a case, it never actually removes its packet from the receive queue. Fix this by abstracting out the UDP dequeuing and checksumming into a separate function that keeps hammering on skb_recv_udp() until it returns -EAGAIN, passing the packets extracted to the remainder of the function. and two potential problems: (3) It might be possible in some circumstances or in the future for packets to be being added to the UDP receive queue whilst rxrpc is running consuming them, so the data_ready() handler might get called less often than once per packet. Allow for this by fully draining the queue on each call as (2). (4) If a packet fails the checksum check, the code currently returns after discarding the packet without checking for more. Allow for this by fully draining the queue on each call as (2). Fixes: `17926a7932` ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com>	2018-10-05 14:21:59 +01:00
David Howells	5e33a23ba4	rxrpc: Fix some missed refs to init_net Fix some refs to init_net that should've been changed to the appropriate network namespace. Fixes: `2baec2c3f8` ("rxrpc: Support network namespacing") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com>	2018-10-05 14:21:59 +01:00
Cong Wang	95278ddaa1	net_sched: convert idrinfo->lock from spinlock to a mutex In commit `ec3ed293e7` ("net_sched: change tcf_del_walker() to take idrinfo->lock") we move fl_hw_destroy_tmplt() to a workqueue to avoid blocking with the spinlock held. Unfortunately, this causes a lot of troubles here: 1. tcf_chain_destroy() could be called right after we queue the work but before the work runs. This is a use-after-free. 2. The chain refcnt is already 0, we can't even just hold it again. We can check refcnt==1 but it is ugly. 3. The chain with refcnt 0 is still visible in its block, which means it could be still found and used! 4. The block has a refcnt too, we can't hold it without introducing a proper API either. We can make it working but the end result is ugly. Instead of wasting time on reviewing it, let's just convert the troubling spinlock to a mutex, which allows us to use non-atomic allocations too. Fixes: `ec3ed293e7` ("net_sched: change tcf_del_walker() to take idrinfo->lock") Reported-by: Ido Schimmel <idosch@idosch.org> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 00:36:31 -07:00
Magnus Karlsson	a41b4f3c58	xsk: simplify xdp_clear_umem_at_qid implementation As we now do not allow ethtool to deactivate the queue id we are running an AF_XDP socket on, we can simplify the implementation of xdp_clear_umem_at_qid(). Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-05 09:31:01 +02:00
Jakub Kicinski	1661d34662	ethtool: don't allow disabling queues with umem installed We already check the RSS indirection table does not use queues which would be disabled by channel reconfiguration. Make sure user does not try to disable queues which have a UMEM and zero-copy AF_XDP socket installed. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-05 09:31:01 +02:00
Jakub Kicinski	b8c8a2e2e3	ethtool: rename local variable max -> curr ethtool_set_channels() validates the config against driver's max settings. It retrieves the current config and stores it in a variable called max. This was okay when only max settings were accessed but we will soon want to access current settings as well, so calling the entire structure max makes the code less readable. While at it drop unnecessary parenthesis. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-05 09:31:00 +02:00
Magnus Karlsson	c9b47cc1fa	xsk: fix bug when trying to use both copy and zero-copy on one queue id Previously, the xsk code did not record which umem was bound to a specific queue id. This was not required if all drivers were zero-copy enabled as this had to be recorded in the driver anyway. So if a user tried to bind two umems to the same queue, the driver would say no. But if copy-mode was first enabled and then zero-copy mode (or the reverse order), we mistakenly enabled both of them on the same umem leading to buggy behavior. The main culprit for this is that we did not store the association of umem to queue id in the copy case and only relied on the driver reporting this. As this relation was not stored in the driver for copy mode (it does not rely on the AF_XDP NDOs), this obviously could not work. This patch fixes the problem by always recording the umem to queue id relationship in the netdev_queue and netdev_rx_queue structs. This way we always know what kind of umem has been bound to a queue id and can act appropriately at bind time. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-05 09:31:00 +02:00
David Ahern	6f52f80e85	net/neigh: Extend dump filter to proxy neighbor dumps Move the attribute parsing from neigh_dump_table to neigh_dump_info, and pass the filter arguments down to neigh_dump_table in a new struct. Add the filter option to proxy neigh dumps as well to make them consistent. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-05 00:28:45 -07:00
Jianfeng Tan	9d2f67e43b	net/packet: fix packet drop as of virtio gso When we use raw socket as the vhost backend, a packet from virito with gso offloading information, cannot be sent out in later validaton at xmit path, as we did not set correct skb->protocol which is further used for looking up the gso function. To fix this, we set this field according to virito hdr information. Fixes: `e858fae2b0` ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion") Signed-off-by: Jianfeng Tan <jianfeng.tan@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 22:23:15 -07:00
David Ahern	1620a33695	net: Move free of dst_metrics to helper Move the refcounting and potential free of dst metrics associated for ipv4 and ipv6 to a common helper. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:54:25 -07:00
David Ahern	e1255ed4b6	net: common metrics init helper for dst_entry ipv4 and ipv6 both use refcounted metrics if FIB entries have metrics set. Move the common initialization code to a helper and use for both protocols. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:54:19 -07:00
David Ahern	cc5f0eb216	net: Move free of fib_metrics to helper Move the refcounting and potential free of dst metrics associated with a fib entry to a helper and use it in both ipv4 and ipv6. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:54:10 -07:00
David Ahern	767a221753	net: common metrics init helper for FIB entries Consolidate initialization of ipv4 and ipv6 metrics when fib entries are created into a single helper, ip_fib_metrics_init, that handles the call to ip_metrics_convert. If no metrics are defined for the fib entry, then the metrics is set to dst_default_metrics. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:54:03 -07:00
Flavio Leitner	17c357efe5	openvswitch: load NAT helper Load the respective NAT helper module if the flow uses it. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 21:45:16 -07:00
Vinicius Costa Gomes	5a781ccbd1	tc: Add support for configuring the taprio scheduler This traffic scheduler allows traffic classes states (transmission allowed/not allowed, in the simplest case) to be scheduled, according to a pre-generated time sequence. This is the basis of the IEEE 802.1Qbv specification. Example configuration: tc qdisc replace dev enp3s0 parent root handle 100 taprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 2@2 \ base-time 1528743495910289987 \ sched-entry S 01 300000 \ sched-entry S 02 300000 \ sched-entry S 04 300000 \ clockid CLOCK_TAI The configuration format is similar to mqprio. The main difference is the presence of a schedule, built by multiple "sched-entry" definitions, each entry has the following format: sched-entry <CMD> <GATE MASK> <INTERVAL> The only supported <CMD> is "S", which means "SetGateStates", following the IEEE 802.1Qbv-2015 definition (Table 8-6). <GATE MASK> is a bitmask where each bit is a associated with a traffic class, so bit 0 (the least significant bit) being "on" means that traffic class 0 is "active" for that schedule entry. <INTERVAL> is a time duration in nanoseconds that specifies for how long that state defined by <CMD> and <GATE MASK> should be held before moving to the next entry. This schedule is circular, that is, after the last entry is executed it starts from the first one, indefinitely. The other parameters can be defined as follows: - base-time: specifies the instant when the schedule starts, if 'base-time' is a time in the past, the schedule will start at base-time + (N * cycle-time) where N is the smallest integer so the resulting time is greater than "now", and "cycle-time" is the sum of all the intervals of the entries in the schedule; - clockid: specifies the reference clock to be used; The parameters should be similar to what the IEEE 802.1Q family of specification defines. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 13:52:23 -07:00
Vasundhara Volam	16511789b9	devlink: Add generic parameter msix_vec_per_pf_min msix_vec_per_pf_min - This param sets the number of minimal MSIX vectors required for the device initialization. This value is set in the device which limits MSIX vectors per PF. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 13:49:42 -07:00
Vasundhara Volam	f61cba4291	devlink: Add generic parameter msix_vec_per_pf_max msix_vec_per_pf_max - This param sets the number of MSIX vectors that the device requests from the host on driver initialization. This value is set in the device which is applicable per PF. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 13:49:42 -07:00
Vasundhara Volam	e3b5106162	devlink: Add generic parameter ignore_ari ignore_ari - Device ignores ARI(Alternate Routing ID) capability, even when platforms has the support and creates same number of partitions when platform does not support ARI capability. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 13:49:42 -07:00
David S. Miller	9e15ff7b89	Just three small fixes: * fix use-after-free in regulatory code * fix rx-mgmt key flag in AP mode (mac80211) * fix wireless extensions compat code memory leak -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlu2bEUACgkQB8qZga/f l8RAgw/7BfRpm3Kr7XW919naGkt/pQeJxUcuF9YggBpTCrp/DSLQYsjOBE5DyS/m 728oPD8jEDehUHasWKsbG7wit1S7ImExCHTPim8C1mbABhbqdhwD4ceUvBO7RYi2 p0+yN8X8z5D0qruMrNwhtxdE8iV9bBgmY6u1jubpJFkKLPf2euZyroH40b879CIn aHqB42GNJCdwO2UFaPDH2cdx5DFWrDlfA1LGbrbuzrXMBfNGWYgen2JJH/5iDOyU 1rVXk/pUpVffp0Zde+66NtyCxxC0+hQwrTczEKXICb5qoWJpz6kugFGGO1oDQgdp AbM7KNrV712h/qwTEnC1NG0KUXgocpwWIuf/cuTow0vGUJSl+O2pLS/3GLOwH2du 1u/FF4LiBc4NFXmWBPMN3LUN+Ica0/YWSbVwcv2c4guemV1EOGinlbFc+tnoue7M fpLkQJUYCiEVFRXGWVaSl0Hr6z+zwgfa8qHYN2yq1qyB0dYHryYiVhgKLV7yisCm RNy1hmVuV7rMsL3f4iUq/2xnL3U8qK1+19Mr+i58/kU4Tx2jMkWj0kivRdgYX4EN XcBhJzWXrb3yMldACrCji6iRnrFMqg7osyEgFiMLMl4cZfs057+qsrJyR5xajVOi Ws+hj3a1LukJ1nIhou4uOLAt9D7ohuJojQq6D2GLeBwfGPNkz4E= =QyWK -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2018-10-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just three small fixes: * fix use-after-free in regulatory code * fix rx-mgmt key flag in AP mode (mac80211) * fix wireless extensions compat code memory leak ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 13:47:39 -07:00
David S. Miller	9e50727f0e	mlx5-updates-2018-10-03 mlx5 core driver and ethernet netdev updates, please note there is a small devlink releated update to allow extack argument to eswitch operations. From Eli Britstein, 1) devlink: Add extack argument to the eswitch related operations 2) net/mlx5e: E-Switch, return extack messages for failures in the e-switch devlink callbacks 3) net/mlx5e: Add extack messages for TC offload failures From Eran Ben Elisha, 4) mlx5e: Add counter for aRFS rule insertion failures From Feras Daoud 5) Fast teardown support for mlx5 device This change introduces the enhanced version of the "Force teardown" that allows SW to perform teardown in a faster way without the need to reclaim all the FW pages. Fast teardown provides the following advantages: 1- Fix a FW race condition that could cause command timeout 2- Avoid moving to polling mode 3- Close the vport to prevent PCI ACK to be sent without been scatter to memory -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJbtU45AAoJEEg/ir3gV/o+/C4H/RHA4KImrb476EdB3VNYMqAN dgXb+bmh6sZP+jHWqQ4c3aVeh6/T8qm4gwiSn2nVTtHEnxtCdIYljzDC1Nswczeg pSjD1eOP7M1LpAOmBb8xdnJcX7yM7r1bTklnp2sN853WShbsDRYgZBHsBwTzx25U ZdzL4QTLuohlG/aLrbGXMntIy45ya2fVQrnK54s18nFlgsdFjEs0mi0xaUKNBC6+ P8CTohHAxuuxmL5b+6MIYLZCdgd8cLNQFdtqbckEVw7SvcRTxfraRlyqJ0YOgTGB TdSWnqZz2JYH29wSFbpFG8qX6GCv8FoiZ+fKzldbolHk442rrktHv3+Y7qQuZVs= =NVks -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2018-10-03' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2018-10-03 mlx5 core driver and ethernet netdev updates, please note there is a small devlink releated update to allow extack argument to eswitch operations. From Eli Britstein, 1) devlink: Add extack argument to the eswitch related operations 2) net/mlx5e: E-Switch, return extack messages for failures in the e-switch devlink callbacks 3) net/mlx5e: Add extack messages for TC offload failures From Eran Ben Elisha, 4) mlx5e: Add counter for aRFS rule insertion failures From Feras Daoud 5) Fast teardown support for mlx5 device This change introduces the enhanced version of the "Force teardown" that allows SW to perform teardown in a faster way without the need to reclaim all the FW pages. Fast teardown provides the following advantages: 1- Fix a FW race condition that could cause command timeout 2- Avoid moving to polling mode 3- Close the vport to prevent PCI ACK to be sent without been scatter to memory ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 09:48:37 -07:00
David S. Miller	f0e834e17f	RxRPC development -----BEGIN PGP SIGNATURE----- iQIVAwUAW7XYNfu3V2unywtrAQJCjhAAle39wPhGLcvO69dEwL3skLT74d2XTT+Q 0e7+5eUbyNFfW+hEphZBCtTCTOFYhzxoqATZHVpEquOZOt2B5b43W97LitHvpirf rHAOT6Y6P6t682/f7oHiXx4tW9hfEpgXyFj1aqluAcK7lOvBQf/koy4t1DAMfiTv dmc0A8OwBxfKfwSmAmRY26mpuLTzGe+gZkMzMFScicJ/GmWSIw8ujfajjIFv7c0F kYub9KkTCsc2LfI/2KFIZVgyI0acDlwMKVmZirzlZJWTUSaf71v6ApFVNbUOSKBm 64lm6h5jr5//x3rK6DdI6MV943uCwl5uFKbnTVu/Wv+bjvCuBviIvXuYWCHm6QMl uqt7Ht3qIid0VKBJc1ekrEvfDo7xi61GWV07BknRpMbUjlqadST7td7wDfuK017C OUzKdJFNTzNwAWzkJIgRpyyRjUSm3pTxEL7iTZ0+shENpKpT+SyzKgseNPUCHAPv 8kprzGD3pXNjbpgZOeLcbpGbhnFTrV6NKguIAKy/NBMVlnx4yzbNTCjEzwLx/oG4 G0KlPUzXR/ZO3CtjDGz0k/pjJOAHmLxWfaaAk+opks8mFDaKKGHAg1rzkM5KN6kk GGsO0vWFIRfbFqqWhojFxEtiNgrtHIStVJSZtnBG0WRdn8jdrYAnFBo9leGnS+5/ y6PM91VI2+8= =b6yx -----END PGP SIGNATURE----- Merge tag 'rxrpc-next-20181004' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Development Here are some development patches for AF_RXRPC. The most significant points are: (1) Change the tracepoint that indicates a packet has been transmitted into one that indicates a packet is about to be transmitted. Without this, the response tracepoint may occur first if the round trip is fast enough. (2) Sort out AFS address list handling to better enforce maximum capacity to use helper functions to fill them and to do an insertion sort to order them. This is here to make (3) easier. (3) Keep AF_INET addresses as AF_INET addresses rather than converting them to AF_INET6 in both AF_RXRPC and kAFS. I hadn't realised that a UDP6 socket would just call down into UDP4 if given an AF_INET address. (4) Allow the timestamp on the first DATA packet of a reply to be retrieved by a kernel service. This will give the kAFS a more accurate base from which to calculate the callback promise expiration. (5) Allow the rxrpc protocol epoch value to be retrieved from an incoming call. This will allow kAFS to determine if the fileserver restarted and if two addresses apparently assigned to the same fileserver actually are different boxes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 09:44:15 -07:00
David Howells	bbb4c4323a	dns: Allow the dns resolver to retrieve a server set Allow the DNS resolver to retrieve a set of servers and their associated addresses, ports, preference and weight ratings. In terms of communication with userspace, "srv=1" is added to the callout string (the '1' indicating the maximum data version supported by the kernel) to ask the userspace side for this. If the userspace side doesn't recognise it, it will ignore the option and return the usual text address list. If the userspace side does recognise it, it will return some binary data that begins with a zero byte that would cause the string parsers to give an error. The second byte contains the version of the data in the blob (this may be between 1 and the version specified in the callout data). The remainder of the payload is version-specific. In version 1, the payload looks like (note that this is packed): u8 Non-string marker (ie. 0) u8 Content (0 => Server list) u8 Version (ie. 1) u8 Source (eg. DNS_RECORD_FROM_DNS_SRV) u8 Status (eg. DNS_LOOKUP_GOOD) u8 Number of servers foreach-server { u16 Name length (LE) u16 Priority (as per SRV record) (LE) u16 Weight (as per SRV record) (LE) u16 Port (LE) u8 Source (eg. DNS_RECORD_FROM_NSS) u8 Status (eg. DNS_LOOKUP_GOT_NOT_FOUND) u8 Protocol (eg. DNS_SERVER_PROTOCOL_UDP) u8 Number of addresses char[] Name (not NUL-terminated) foreach-address { u8 Family (AF_INET{,6}) union { u8[4] ipv4_addr u8[16] ipv6_addr } } } This can then be used to fetch a whole cell's VL-server configuration for AFS, for example. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-04 09:40:52 -07:00
David Howells	e908bcf4f1	rxrpc: Allow the reply time to be obtained on a client call Allow the epoch value to be queried on a server connection. This is in the rxrpc header of every packet for use in routing and is derived from the client's state. It's also not supposed to change unless the client gets restarted. AFS can make use of this information to deduce whether a fileserver has been restarted because the fileserver makes client calls to the filesystem driver's cache manager to send notifications (ie. callback breaks) about conflicting changes from other clients. These convey the fileserver's own epoch value back to the filesystem. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-04 09:54:29 +01:00
David Howells	2070a3e449	rxrpc: Allow the reply time to be obtained on a client call Allow the timestamp on the sk_buff holding the first DATA packet of a reply to be queried. This can then be used as a base for the expiry time calculation on the callback promise duration indicated by an operation result. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-04 09:42:29 +01:00
Joe Stringer	d71019b54b	net: core: Fix build with CONFIG_IPV6=m Stephen Rothwell reports the following link failure with IPv6 as module: x86_64-linux-gnu-ld: net/core/filter.o: in function `sk_lookup': (.text+0x19219): undefined reference to `__udp6_lib_lookup' Fix the build by only enabling the IPv6 socket lookup if IPv6 support is compiled into the kernel. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-04 10:32:56 +02:00
David Howells	5a790b7375	rxrpc: Drop the local endpoint arg from rxrpc_extract_addr_from_skb() rxrpc_extract_addr_from_skb() doesn't use the argument that points to the local endpoint, so remove the argument. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-04 09:32:28 +01:00
David Howells	46894a1359	rxrpc: Use IPv4 addresses throught the IPv6 AF_RXRPC opens an IPv6 socket through which to send and receive network packets, both IPv6 and IPv4. It currently turns AF_INET addresses into AF_INET-as-AF_INET6 addresses based on an assumption that this was necessary; on further inspection of the code, however, it turns out that the IPv6 code just farms packets aimed at AF_INET addresses out to the IPv4 code. Fix AF_RXRPC to use AF_INET addresses directly when given them. Fixes: `7b674e390e` ("rxrpc: Fix IPv6 support") Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-04 09:32:28 +01:00
David Howells	b3cfb6f567	rxrpc: Emit the data Tx trace line before transmitting Print the data Tx trace line before transmitting so that it appears before the trace lines indicating success or failure of the transmission. This makes the trace log less confusing. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-04 09:32:27 +01:00
David Howells	d2944b1c66	rxrpc: Use rxrpc_free_skb() rather than rxrpc_lose_skb() rxrpc_lose_skb() is now exactly the same as rxrpc_free_skb(), so remove it and use the latter instead. Signed-off-by: David Howells <dhowells@redhat.com>	2018-10-04 09:32:27 +01:00
David S. Miller	6f41617bf2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net' overlapped the renaming of a netlink attribute in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-03 21:00:17 -07:00
Eli Britstein	db7ff19e7b	devlink: Add extack for eswitch operations Add extack argument to the eswitch related operations. Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-10-03 16:17:58 -07:00
Chuck Lever	ad0911802c	xprtrdma: Clean up xprt_rdma_disconnect_inject Clean up: Use the appropriate C macro instead of open-coding container_of() . Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 14:19:31 -04:00
Chuck Lever	f26c32fa5c	xprtrdma: Add documenting comments Clean up: fill in or update documenting comments for transport switch entry points. For xprt_rdma_allocate: The first paragraph is no longer true since commit `5a6d1db455` ("SUNRPC: Add a transport-specific private field in rpc_rqst"). The second paragraph is no longer true since commit `54cbd6b0c6` ("xprtrdma: Delay DMA mapping Send and Receive buffers"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 14:10:55 -04:00
Chuck Lever	61c208a5ca	xprtrdma: Report when there were zero posted Receives To show that a caller did attempt to allocate and post more Receive buffers, the trace point in rpcrdma_post_recvs() should report when rpcrdma_post_recvs() was invoked but no new Receive buffers were posted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 14:02:01 -04:00
Chuck Lever	512ccfb61a	xprtrdma: Move rb_flags initialization Clean up: rb_flags might be used for other things besides RPCRDMA_BUF_F_EMPTY_SCQ, so initialize it in a generic spot instead of in a send-completion-queue-related helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 13:39:02 -04:00
Gustavo A. R. Silva	2cc543f5cd	sctp: fix fall-through annotation Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-03 09:33:13 -07:00
Trond Myklebust	b92a8fabab	SUNRPC: Refactor sunrpc_cache_lookup This is a trivial split into lookup and insert functions, no change in behavior. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-10-03 11:54:34 -04:00
Trond Myklebust	608a0ab2f5	SUNRPC: Add lockless lookup of the server's auth domain Avoid taking the global auth_domain_lock in most lookups of the auth domain by adding an RCU protected lookup. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-10-03 11:32:59 -04:00
Trond Myklebust	30382d6ce5	SUNRPC: Remove the server 'authtab_lock' and just use RCU Module removal is RCU safe by design, so we really have no need to lock the 'authtab[]' array. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-10-03 11:32:59 -04:00
Chuck Lever	f7d4668155	xprtrdma: Don't disable BH's in backchannel server Clean up: This code was copied from xprtsock.c and backchannel_rqst.c. For rpcrdma, the backchannel server runs exclusively in process context, thus disabling bottom-halves is unnecessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 10:23:45 -04:00
Chuck Lever	83e301dd13	xprtrdma: Remove memory address of "ep" from an error message Clean up: Replace the hashed memory address of the target rpcrdma_ep with the server's IP address and port. The server address is more useful in an administrative error message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 10:05:37 -04:00
Chuck Lever	f9521d53e8	xprtrdma: Rename rpcrdma_qp_async_error_upcall Clean up: Use a function name that is consistent with the RDMA core API and with other consumers. Because this is a function that is invoked from outside the rpcrdma.ko module, add an appropriate documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 09:14:57 -04:00
Chuck Lever	31e62d25b5	xprtrdma: Simplify RPC wake-ups on connect Currently, when a connection is established, rpcrdma_conn_upcall invokes rpcrdma_conn_func and then wake_up_all(&ep->rep_connect_wait). The former wakes waiting RPCs, but the connect worker is not done yet, and that leads to races, double wakes, and difficulty understanding how this logic is supposed to work. Instead, collect all the "connection established" logic in the connect worker (xprt_rdma_connect_worker). A disconnect worker is retained to handle provider upcalls safely. Fixes: `254f91e2fa` ("xprtrdma: RPC/RDMA must invoke ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 08:58:41 -04:00
Chuck Lever	316a616e78	xprtrdma: Re-organize the switch() in rpcrdma_conn_upcall Clean up: Eliminate the FALLTHROUGH into the default arm to make the switch easier to understand. Also, as long as I'm here, do not display the memory address of the target rpcrdma_ep. A hashed memory address is of marginal use here. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-03 08:39:39 -04:00
Chenbo Feng	e9837e55b0	netfilter: xt_quota: fix the behavior of xt_quota module A major flaw of the current xt_quota module is that quota in a specific rule gets reset every time there is a rule change in the same table. It makes the xt_quota module not very useful in a table in which iptables rules are changed at run time. This fix introduces a new counter that is visible to userspace as the remaining quota of the current rule. When userspace restores the rules in a table, it can restore the counter to the remaining quota instead of resetting it to the full quota. Signed-off-by: Chenbo Feng <fengc@google.com> Suggested-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2018-10-03 11:32:54 +02:00
Vakul Garg	4e6d47206c	tls: Add support for inplace records encryption Presently, for non-zero copy case, separate pages are allocated for storing plaintext and encrypted text of records. These pages are stored in sg_plaintext_data and sg_encrypted_data scatterlists inside record structure. Further, sg_plaintext_data & sg_encrypted_data are passed to cryptoapis for record encryption. Allocating separate pages for plaintext and encrypted text is inefficient from both required memory and performance point of view. This patch adds support of inplace encryption of records. For non-zero copy case, we reuse the pages from sg_encrypted_data scatterlist to copy the application's plaintext data. For the movement of pages from sg_encrypted_data to sg_plaintext_data scatterlists, we introduce a new function move_to_plaintext_sg(). This function add pages into sg_plaintext_data from sg_encrypted_data scatterlists. tls_do_encryption() is modified to pass the same scatterlist as both source and destination into aead_request_set_crypt() if inplace crypto has been enabled. A new ariable 'inplace_crypto' has been introduced in record structure to signify whether the same scatterlist can be used. By default, the inplace_crypto is enabled in get_rec(). If zero-copy is used (i.e. plaintext data is not copied), inplace_crypto is set to '0'. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Reviewed-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 23:03:47 -07:00
David S. Miller	00538ba915	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2018-09-30 Here's the first bluetooth-next pull request for the 4.20 kernel. - Fixes & cleanups to hci_qca driver - NULL dereference fix to debugfs - Improved L2CAP Connection-oriented Channel MTU & MPS handling - Added support for USB-based RTL8822C controller - Added device ID for BCM4335C0 UART-based controller - Various other smaller cleanups & fixes Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:34:54 -07:00
Eric Dumazet	64199fc0a4	ipv4: fix use-after-free in ip_cmsg_recv_dstaddr() Caching ip_hdr(skb) before a call to pskb_may_pull() is buggy, do not do it. Fixes: `2efd4fca70` ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:32:05 -07:00
Patrick Ruddy	e4a38c0c4b	ipv6: add vrf table handling code for ipv6 mcast The code to obtain the correct table for the incoming interface was missing for IPv6. This has been added along with the table creation notification to fib rules for the RTNL_FAMILY_IP6MR address family. Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:29:08 -07:00
Robert Shearman	854da99173	ipv4: Allow sending multicast packets on specific i/f using VRF socket It is useful to be able to use the same socket for listening in a specific VRF, as for sending multicast packets out of a specific interface. However, the bound device on the socket currently takes precedence and results in the packets not being sent. Relax the condition on overriding the output interface to use for sending packets out of UDP, raw and ping sockets to allow multicast packets to be sent using the specified multicast interface. Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:28:17 -07:00
Ido Schimmel	6919622af3	bridge: mcast: Default back to multicast enabled state Commit `13cefad2f2` ("net: bridge: convert and rename mcast disabled") converted the 'multicast_disabled' field to an option bit named 'BROPT_MULTICAST_ENABLED'. While the old field was implicitly initialized to 0, the new field is not initialized, resulting in the bridge defaulting to multicast disabled state and breaking existing applications. Fix this by explicitly initializing the option. Fixes: `13cefad2f2` ("net: bridge: convert and rename mcast disabled") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:27:36 -07:00
Eric Dumazet	8873c064d1	tcp: do not release socket ownership in tcp_close() syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk)); in tcp_close() While a socket is being closed, it is very possible other threads find it in rtnetlink dump. tcp_get_info() will acquire the socket lock for a short amount of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);), enough to trigger the warning. Fixes: `67db3e4bfb` ("tcp: no longer hold ehash lock while calling tcp_get_info()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 22:17:35 -07:00
Joe Stringer	6acc9b432e	bpf: Add helper to retrieve socket in BPF This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and bpf_sk_lookup_udp() which allows BPF programs to find out if there is a socket listening on this host, and returns a socket pointer which the BPF program can then access to determine, for instance, whether to forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the socket, so when a BPF program makes use of this function, it must subsequently pass the returned pointer into the newly added sk_release() to return the reference. By way of example, the following pseudocode would filter inbound connections at XDP if there is no corresponding service listening for the traffic: struct bpf_sock_tuple tuple; struct bpf_sock_ops *sk; populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0); if (!sk) { // Couldn't find a socket listening for this traffic. Drop. return TC_ACT_SHOT; } bpf_sk_release(sk, 0); return TC_ACT_OK; Signed-off-by: Joe Stringer <joe@wand.net.nz> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-03 02:53:47 +02:00
Joe Stringer	c64b798328	bpf: Add PTR_TO_SOCKET verifier type Teach the verifier a little bit about a new type of pointer, a PTR_TO_SOCKET. This pointer type is accessed from BPF through the 'struct bpf_sock' structure. Signed-off-by: Joe Stringer <joe@wand.net.nz> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-03 02:53:47 +02:00
Maciej Żenczykowski	744486d426	net: inet6_rtm_getroute() - use new style struct initializer instead of memset Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:40 -07:00
Maciej Żenczykowski	84db840715	net: rtm_to_fib6_config() - use new style struct initializer instead of memset (allows for better compiler optimization) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:40 -07:00
Maciej Żenczykowski	8823a3acfd	net: rtmsg_to_fib6_config() - use new style struct initializer instead of memset (allows for better compiler optimization) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:40 -07:00
Maciej Żenczykowski	dc92095dd9	net: ip6_update_pmtu() - use new style struct initializer instead of memset (allows for better compiler optimization) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:40 -07:00
Maciej Żenczykowski	d456336d16	net: remove 1 always zero parameter from ip6_redirect_no_header() (the parameter in question is mark) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:40 -07:00
Maciej Żenczykowski	0b26fb17ca	net: ip6_redirect_no_header() - use new style struct initializer instead of memset (allows for better compiler optimization) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:40 -07:00
Maciej Żenczykowski	1f7f10ac4a	net: ip6_redirect() - use new style struct initializer instead of memset (allows for better compiler optimization) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:39 -07:00
Maciej Żenczykowski	e8e3fbe92c	net: inet_rtm_getroute() - use new style struct initializer instead of memset Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:39 -07:00
Maciej Żenczykowski	e351bb6227	net: ip_rt_get_source() - use new style struct initializer instead of memset (allows for better compiler optimization) Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:12:39 -07:00
Eric Dumazet	0e1d6eca51	rtnl: limit IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES to 4096 We have an impressive number of syzkaller bugs that are linked to the fact that syzbot was able to create a networking device with millions of TX (or RX) queues. Let's limit the number of RX/TX queues to 4096, this really should cover all known cases. A separate patch will add various cond_resched() in the loops handling sysfs entries at device creation and dismantle. Tested: lpaa6:~# ip link add gre-4097 numtxqueues 4097 numrxqueues 4097 type ip6gretap RTNETLINK answers: Invalid argument lpaa6:~# time ip link add gre-4096 numtxqueues 4096 numrxqueues 4096 type ip6gretap real 0m0.180s user 0m0.000s sys 0m0.107s Fixes: `76ff5cc919` ("rtnl: allow to specify number of rx and tx queues on device creation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 16:06:13 -07:00
Eric Dumazet	2ab2ddd301	inet: make sure to grab rcu_read_lock before using ireq->ireq_opt Timer handlers do not imply rcu_read_lock(), so my recent fix triggered a LOCKDEP warning when SYNACK is retransmit. Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt usages instead of guessing what is done by callers, since it is not worth the pain. Get rid of ireq_opt_deref() helper since it hides the logic without real benefit, since it is now a standard rcu_dereference(). Fixes: `1ad98e9d1b` ("tcp/dccp: fix lockdep issue when SYN is backlogged") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 15:52:12 -07:00
Chuck Lever	aadc5a9448	xprtrdma: Eliminate "connstate" variable from rpcrdma_conn_upcall() Clean up. Since commit `173b8f49b3` ("xprtrdma: Demote "connect" log messages") there has been no need to initialize connstat to zero. In fact, in this code path there's now no reason not to set rep_connected directly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 16:16:54 -04:00
Chuck Lever	ed97f1f79b	xprtrdma: Conventional variable names in rpcrdma_conn_upcall Clean up: The convention throughout other parts of xprtrdma is to name variables of type struct rpcrdma_xprt "r_xprt", not "xprt". This convention enables the use of the name "xprt" for a "struct rpc_xprt" type variable, as in other parts of the RPC client. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 16:15:36 -04:00
Chuck Lever	ae38288eb7	xprtrdma: Rename rpcrdma_conn_upcall Clean up: Use a function name that is consistent with the RDMA core API and with other consumers. Because this is a function that is invoked from outside the rpcrdma.ko module, add an appropriate documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 16:14:05 -04:00
Chuck Lever	8440a88611	sunrpc: Report connect_time in seconds The way connection-oriented transports report connect_time is wrong: it's supposed to be in seconds, not in jiffies. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 16:11:00 -04:00
Chuck Lever	3968a8a531	sunrpc: Fix connect metrics For TCP, the logic in xprt_connect_status is currently never invoked to record a successful connection. Commit `2a4919919a` ("SUNRPC: Return EAGAIN instead of ENOTCONN when waking up xprt->pending") changed the way TCP xprt's are awoken after a connect succeeds. Instead, change connection-oriented transports to bump connect_count and compute connect_time the moment that XPRT_CONNECTED is set. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 16:08:12 -04:00
Chuck Lever	d379eaa838	xprtrdma: Name MR trace events consistently Clean up the names of trace events related to MRs so that it's easy to enable these with a glob. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 15:55:36 -04:00
Chuck Lever	61da886bf7	xprtrdma: Explicitly resetting MRs is no longer necessary When a memory operation fails, the MR's driver state might not match its hardware state. The only reliable recourse is to dereg the MR. This is done in ->ro_recover_mr, which then attempts to allocate a fresh MR to replace the released MR. Since commit `e2ac236c0b` ("xprtrdma: Allocate MRs on demand"), xprtrdma dynamically allocates MRs. It can add more MRs whenever they are needed. That makes it possible to simply release an MR when a memory operation fails, instead of "recovering" it. It will automatically be replaced by the on-demand MR allocator. This commit is a little larger than I wanted, but it replaces ->ro_recover_mr, rb_recovery_lock, rb_recovery_worker, and the rb_stale_mrs list with a generic work queue. Since MRs are no longer orphaned, the mrs_orphaned metric is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 15:48:12 -04:00
Chuck Lever	c421ece68f	xprtrdma: Create more MRs at a time Some devices require more than 3 MRs to build a single 1MB I/O. Ensure that rpcrdma_mrs_create() will add enough MRs to build that I/O. In a subsequent patch I'm changing the MR recovery logic to just toss out the MRs. In that case it's possible for ->send_request to loop acquiring some MRs, not getting enough, getting called again, recycling the previous MRs, then not getting enough, lather rinse repeat. Thus first we need to ensure enough MRs are created to prevent that loop. I'm "reusing" ia->ri_max_segs. All of its accessors seem to want the maximum number of data segments plus two, so I'm going to bake that into the initial calculation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 15:46:35 -04:00
Chuck Lever	ef739b2175	xprtrdma: Reset credit grant properly after a disconnect On a fresh connection, an RPC/RDMA client is supposed to send only one RPC Call until it gets a credit grant in the first RPC Reply from the server [RFC 8166, Section 3.3.3]. There is a bug in the Linux client's credit accounting mechanism introduced by commit `e7ce710a88` ("xprtrdma: Avoid deadlock when credit window is reset"). On connect, it simply dumps all pending RPC Calls onto the new connection. Servers have been tolerant of this bad behavior. Currently no server implementation ever changes its credit grant over reconnects, and servers always repost enough Receives before connections are fully established. To correct this issue, ensure that the client resets both the credit grant _and_ the congestion window when handling a reconnect. Fixes: `e7ce710a88` ("xprtrdma: Avoid deadlock when credit ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 15:45:09 -04:00
Chuck Lever	91ca18660e	xprtrdma: xprt_release_rqst_cong is called outside of transport_lock Since commit `ce7c252a8c` ("SUNRPC: Add a separate spinlock to protect the RPC request receive list") the RPC/RDMA reply handler has been calling xprt_release_rqst_cong without holding xprt->transport_lock. I think the only way this call is ever made is if the credit grant increases and there are RPCs pending. Current server implementations do not change their credit grant during operation (except at connect time). Commit `e7ce710a88` ("xprtrdma: Avoid deadlock when credit window is reset") added the ->release_rqst call because UDP invokes xprt_adjust_cwnd(), which calls __xprt_put_cong() after adjusting xprt->cwnd. Both xprt_release() and ->xprt_release_xprt already wake another task in this case, so it is safe to remove this call from the reply handler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2018-10-02 15:26:30 -04:00
Paolo Abeni	cc16567e5a	net: drop unused skb_append_datato_frags() This helper is unused since commit `988cf74deb` ("inet: Stop generating UFO packets.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-02 11:18:09 -07:00
Johannes Berg	5207ca554b	cfg80211: sort tracing properly There were supposed to be two blocks - one for each direction cfg80211 <-> driver, clean up the code to restore that. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:59:07 +02:00
Johannes Berg	ec8f170bc3	cfg80211: unify sending NL80211_CMD_NEW_INTERFACE There isn't really any need for us to be sending this from two different places - move cfg80211_init_wdev() later and send the notification from there, removing it from the non- netdev case. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:58:57 +02:00
Johannes Berg	85dd3da43d	cfg80211: combine wdev/netdev unregister code We currently have two places that do similar things, depending on whether it's a wdev with or without netdev. Combine the code to avoid having to duplicate all new additions. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:58:51 +02:00
Johannes Berg	49f9cf0e1b	nl80211: add error messages to nl80211_parse_chandef() Add some error messages to nl80211_parse_chandef() to make failures here - especially with disabled channels - easier to diagnose. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:58:44 +02:00
Johannes Berg	b60ad34851	cfg80211: move cookie_counter out of wiphy There's no reason for drivers to be able to access the cfg80211 internal cookie counter; move it out of the wiphy into the rdev structure. While at it, also make it never assign 0 as a cookie (we consider that invalid in some places), and warn if we manage to do that for some reason (wrapping is not likely to happen with a u64.) Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:58:36 +02:00
Johannes Berg	71e5e88680	cfg80211: regulatory: make initialization more robust Since my change to split out the regulatory init to occur later, any issues during earlier cfg80211_init() or errors during the platform device allocation would lead to crashes later. Make this more robust by checking that the earlier initialization succeeded. Fixes: `d7be102f29` ("cfg80211: initialize regulatory keys/database later") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:58:25 +02:00
Masashi Honma	c70616bd8a	mac80211: Remove unused initialization The variable j will be initialized at trailing step. Signed-off-by: Masashi Honma <masashi.honma@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:30 +02:00
Johannes Berg	5297c65c1d	nl80211: remove nl80211_prepare_wdev_dump() skb argument nl80211_prepare_wdev_dump() is using the output skb to look up the network namespace, but this isn't really necessary, it can just as well use the input skb which is available as cb->skb, the sk is the same anyway. Therefore, remove the redundant argument. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:30 +02:00
Pradeep Kumar Chitrapu	81e54d08d9	cfg80211: support FTM responder configuration/statistics Allow userspace to enable fine timing measurement responder functionality with configurable lci/civic parameters in AP mode. This can be done at AP start or changing beacon parameters. A new EXT_FEATURE flag is introduced for drivers to advertise the capability. Also nl80211 API support for retrieving statistics is added. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org> [remove unused cfg80211_ftm_responder_params, clarify docs, move validation into policy] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:30 +02:00
Johannes Berg	7057f2496c	cfg80211: tracing: reuse wiphy_wdev_evt for rdev_get_txq_stats A simple cleanup, reuse the event definition that we already have. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:30 +02:00
Andrew Zaborowski	efdfce7270	nl80211: Fix a GET_KEY reply attribute Use the NL80211_KEY_IDX attribute inside the NL80211_ATTR_KEY in NL80211_CMD_GET_KEY responses to comply with nl80211_key_policy. This is unlikely to affect existing userspace. Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:30 +02:00
Wei Yongjun	48f3b9e989	mac80211: fix error handling in ieee80211_register_hw() Fix to return a negative error code -ENOMEM from the kmemdup error handling case instead of 0. Fixes: `09b4a4faf9` ("mac80211: introduce capability flags for VHT EXT NSS support") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:29 +02:00
Johannes Berg	e4d4216e91	cfg80211: combine duplicate wdev init code There's a bit of duplicated code to initialize a wdev, pull it out into a separate function to call from both places. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:29 +02:00
Erik Stromdahl	a5ae326418	mac80211: fix issue with possible txq NULL pointer Drivers that do not have the BUFF_MMPDU_TXQ flag set will not have a TXQ for the special TID = 16. In this case, the last member in the struct ieee80211_sta txq array will be NULL. We must check this in order not to get a NULL pointer dereference when iterating the txq array. Signed-off-by: Erik Stromdahl <erik.stromdahl@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:29 +02:00
Colin Ian King	6762696429	cfg80211: remove redundant check of !scan_plan The check for !scan_plan is redunant as this has been checked in the proceeding statement and the code returns -ENOBUFS if it is true. Remove the redundant check. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:29 +02:00
zhong jiang	d25d062f55	cfg80211: remove unnecessary null pointer check in cfg80211_netdev_notifier_call The iterator in list_for_each_entry_safe is never null, therefore, remove the redundant null pointer check. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-02 09:56:29 +02:00
Dave Jones	6fe9487892	bond: take rcu lock in netpoll_send_skb_on_dev The bonding driver lacks the rcu lock when it calls down into netdev_lower_get_next_private_rcu from bond_poll_controller, which results in a trace like: WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 netdev_lower_get_next_private_rcu+0x34/0x40 CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1 Workqueue: bond0 bond_mii_monitor RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40 Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 0f 1f 40 00 0f 1f 44 00 00 48 8> RSP: 0018:ffffc9000087fa68 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880429614560 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffffffffa184ada0 RBP: ffffc9000087fa80 R08: 0000000000000001 R09: 0000000000000000 R10: ffffc9000087f9f0 R11: ffff880429798040 R12: ffff8804289d5980 R13: ffffffffa1511f60 R14: 00000000000000c8 R15: 00000000ffffffff FS: 0000000000000000(0000) GS:ffff88042f880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f4b78fce180 CR3: 000000018180f006 CR4: 00000000001606e0 Call Trace: bond_poll_controller+0x52/0x170 netpoll_poll_dev+0x79/0x290 netpoll_send_skb_on_dev+0x158/0x2c0 netpoll_send_udp+0x2d5/0x430 write_ext_msg+0x1e0/0x210 console_unlock+0x3c4/0x630 vprintk_emit+0xfa/0x2f0 printk+0x52/0x6e ? __netdev_printk+0x12b/0x220 netdev_info+0x64/0x80 ? bond_3ad_set_carrier+0xe9/0x180 bond_select_active_slave+0x1fc/0x310 bond_mii_monitor+0x709/0x9b0 process_one_work+0x221/0x5e0 worker_thread+0x4f/0x3b0 kthread+0x100/0x140 ? process_one_work+0x5e0/0x5e0 ? kthread_delayed_work_timer_fn+0x90/0x90 ret_from_fork+0x24/0x30 We're also doing rcu dereferences a layer up in netpoll_send_skb_on_dev before we call down into netpoll_poll_dev, so just take the lock there. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:25:25 -07:00
David Ahern	893626d6a3	rtnetlink: Fail dump if target netnsid is invalid Link dumps can return results from a target namespace. If the namespace id is invalid, then the dump request should fail if get_target_net fails rather than continuing with a dump of the current namespace. Fixes: `79e1ad148c` ("rtnetlink: use netnsid to query interface") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:22:04 -07:00
Flavio Leitner	7f6d6558ae	Revert "openvswitch: Fix template leak in error cases." This reverts commit `90c7afc96c`. When the commit was merged, the code used nf_ct_put() to free the entry, but later on commit `76644232e6` ("openvswitch: Free tmpl with tmpl_free.") replaced that with nf_ct_tmpl_free which is a more appropriate. Now the original problem is removed. Then `44d6e2f273` ("net: Replace NF_CT_ASSERT() with WARN_ON().") replaced a debug assert with a WARN_ON() which is trigged now. Signed-off-by: Flavio Leitner <fbl@redhat.com> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:20:13 -07:00
Eric Dumazet	fb420d5d91	tcp/fq: move back to CLOCK_MONOTONIC In the recent TCP/EDT patch series, I switched TCP and sch_fq clocks from MONOTONIC to TAI, in order to meet the choice done earlier for sch_etf packet scheduler. But sure enough, this broke some setups were the TAI clock jumps forward (by almost 50 year...), as reported by Leonard Crestez. If we want to converge later, we'll probably need to add an skb field to differentiate the clock bases, or a socket option. In the meantime, an UDP application will need to use CLOCK_MONOTONIC base for its SCM_TXTIME timestamps if using fq packet scheduler. Fixes: `72b0094f91` ("tcp: switch tcp_clock_ns() to CLOCK_TAI base") Fixes: `142537e419` ("net_sched: sch_fq: switch to CLOCK_TAI") Fixes: `fd2bca2aa7` ("tcp: switch internal pacing timer to CLOCK_TAI") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Leonard Crestez <leonard.crestez@nxp.com> Tested-by: Leonard Crestez <leonard.crestez@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:18:51 -07:00
Li RongQing	4da402597c	xfrm: fix gro_cells leak when remove virtual xfrm interfaces The device gro_cells has been initialized, it should be freed, otherwise it will be leaked Fixes: `f203b76d78` ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2018-10-02 08:11:45 +02:00
Cong Wang	460b360104	net_sched: fix a crash in tc_new_tfilter() When tcf_block_find() fails, it already rollbacks the qdisc refcnt, so its caller doesn't need to clean up this again. Avoid calling qdisc_put() again by resetting qdisc to NULL for callers. Reported-by: syzbot+37b8770e6d5a8220a039@syzkaller.appspotmail.com Fixes: `e368fdb61d` ("net: sched: use Qdisc rcu API instead of relying on rtnl lock") Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 23:09:41 -07:00
David S. Miller	92d7c74b6f	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2018-09-27 Here's one more Bluetooth fix for 4.19, fixing the handling of an attempt to unpair a device while pairing is in progress. Let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 22:40:39 -07:00
Soheil Hassas Yeganeh	789762ceec	tcp: adjust rcv zerocopy hints based on frag sizes When SKBs are coalesced, we can have SKBs with different frag sizes. Some with PAGE_SIZE and some not with PAGE_SIZE. Since recv_skip_hint is always set to the full SKB size, it can overestimate the amount that should be read using normal read for coalesced packets. Change the recv_skip_hint so that it only includes the first frags that are not of PAGE_SIZE. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 22:36:56 -07:00
Soheil Hassas Yeganeh	8f2b029311	tcp: set recv_skip_hint when tcp_inq is less than PAGE_SIZE When we have less than PAGE_SIZE of data on receive queue, we set recv_skip_hint to 0. Instead, set it to the actual number of bytes available. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 22:36:56 -07:00
LUU Duc Canh	d949cfedbc	tipc: ignore STATE_MSG on wrong link session The initial session number when a link is created is based on a random value, taken from struct tipc_net->random. It is then incremented for each link reset to avoid mixing protocol messages from different link sessions. However, when a bearer is reset all its links are deleted, and will later be re-created using the same random value as the first time. This means that if the link never went down between creation and deletion we will still sometimes have two subsequent sessions with the same session number. In virtual environments with potentially long transmission times this has turned out to be a real problem. We now fix this by randomizing the session number each time a link is created. With a session number size of 16 bits this gives a risk of session collision of 1/64k. To reduce this further, we also introduce a sanity check on the very first STATE message arriving at a link. If this has an acknowledge value differing from 0, which is logically impossible, we ignore the message. The final risk for session collision is hence reduced to 1/4G, which should be sufficient. Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 22:35:30 -07:00
Dan Carpenter	aeadd93f2b	net: sched: act_ipt: check for underflow in __tcf_ipt_init() If "td->u.target_size" is larger than sizeof(struct xt_entry_target) we return -EINVAL. But we don't check whether it's smaller than sizeof(struct xt_entry_target) and that could lead to an out of bounds read. Fixes: `7ba699c604` ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 22:34:14 -07:00
David S. Miller	2240c12d7d	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2018-10-01 1) Make xfrmi_get_link_net() static to silence a sparse warning. From Wei Yongjun. 2) Remove a unused esph pointer definition in esp_input(). From Haishuang Yan. 3) Allow the NIC driver to quietly refuse xfrm offload in case it does not support it, the SA is created without offload in this case. From Shannon Nelson. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 22:31:17 -07:00
David S. Miller	ee0b6f4834	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-10-01 1) Validate address prefix lengths in the xfrm selector, otherwise we may hit undefined behaviour in the address matching functions if the prefix is too big for the given address family. 2) Fix skb leak on local message size errors. From Thadeu Lima de Souza Cascardo. 3) We currently reset the transport header back to the network header after a transport mode transformation is applied. This leads to an incorrect transport header when multiple transport mode transformations are applied. Reset the transport header only after all transformations are already applied to fix this. From Sowmini Varadhan. 4) We only support one offloaded xfrm, so reset crypto_done after the first transformation in xfrm_input(). Otherwise we may call the wrong input method for subsequent transformations. From Sowmini Varadhan. 5) Fix NULL pointer dereference when skb_dst_force clears the dst_entry. skb_dst_force does not really force a dst refcount anymore, it might clear it instead. xfrm code did not expect this, add a check to not dereference skb_dst() if it was cleared by skb_dst_force. 6) Validate xfrm template mode, otherwise we can get a stack-out-of-bounds read in xfrm_state_find. From Sean Tranchetti. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 22:29:25 -07:00
Yuchung Cheng	041a14d267	tcp: start receiver buffer autotuning sooner Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data. After the initial receiver buffer was raised by patch `a337531b94` ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"), the reciver buffer may take too long to start raising. To address this issue, this patch lowers the initial bytes expected to receive roughly the expected sender's initial window. Fixes: `a337531b94` ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 15:45:26 -07:00
Eric Dumazet	1ad98e9d1b	tcp/dccp: fix lockdep issue when SYN is backlogged In normal SYN processing, packets are handled without listener lock and in RCU protected ingress path. But syzkaller is known to be able to trick us and SYN packets might be processed in process context, after being queued into socket backlog. In commit `06f877d613` ("tcp/dccp: fix other lockdep splats accessing ireq_opt") I made a very stupid fix, that happened to work mostly because of the regular path being RCU protected. Really the thing protecting ireq->ireq_opt is RCU read lock, and the pseudo request refcnt is not relevant. This patch extends what I did in commit `449809a66c` ("tcp/dccp: block BH for SYN processing") by adding an extra rcu_read_{lock\|unlock} pair in the paths that might be taken when processing SYN from socket backlog (thus possibly in process context) Fixes: `06f877d613` ("tcp/dccp: fix other lockdep splats accessing ireq_opt") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 15:42:13 -07:00
David S. Miller	c8424ddd97	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree: 1) Skip ip_sabotage_in() for packet making into the VRF driver, otherwise packets are dropped, from David Ahern. 2) Clang compilation warning uncovering typo in the nft_validate_register_store() call from nft_osf, from Stefan Agner. 3) Double sizeof netlink message length calculations in ctnetlink, from zhong jiang. 4) Missing rb_erase() on batch full in rbtree garbage collector, from Taehee Yoo. 5) Calm down compilation warning in nf_hook(), from Florian Westphal. 6) Missing check for non-null sk in xt_socket before validating netns procedence, from Flavio Leitner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-01 15:41:01 -07:00
Roman Gushchin	8bad74f984	bpf: extend cgroup bpf core to allow multiple cgroup storage types In order to introduce per-cpu cgroup storage, let's generalize bpf cgroup core to support multiple cgroup storage types. Potentially, per-node cgroup storage can be added later. This commit is mostly a formal change that replaces cgroup_storage pointer with a array of cgroup_storage pointers. It doesn't actually introduce a new storage type, it will be done later. Each bpf program is now able to have one cgroup storage of each type. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-10-01 16:18:32 +02:00
Yu Zhao	1db5852945	cfg80211: fix use-after-free in reg_process_hint() reg_process_hint_country_ie() can free regulatory_request and return REG_REQ_ALREADY_SET. We shouldn't use regulatory_request after it's called. KASAN error was observed when this happens. BUG: KASAN: use-after-free in reg_process_hint+0x839/0x8aa [cfg80211] Read of size 4 at addr ffff8800c430d434 by task kworker/1:3/89 <snipped> Workqueue: events reg_todo [cfg80211] Call Trace: dump_stack+0xc1/0x10c ? _atomic_dec_and_lock+0x1ad/0x1ad ? _raw_spin_lock_irqsave+0xa0/0xd2 print_address_description+0x86/0x26f ? reg_process_hint+0x839/0x8aa [cfg80211] kasan_report+0x241/0x29b reg_process_hint+0x839/0x8aa [cfg80211] reg_todo+0x204/0x5b9 [cfg80211] process_one_work+0x55f/0x8d0 ? worker_detach_from_pool+0x1b5/0x1b5 ? _raw_spin_unlock_irq+0x65/0xdd ? _raw_spin_unlock_irqrestore+0xf3/0xf3 worker_thread+0x5dd/0x841 ? kthread_parkme+0x1d/0x1d kthread+0x270/0x285 ? pr_cont_work+0xe3/0xe3 ? rcu_read_unlock_sched_notrace+0xca/0xca ret_from_fork+0x22/0x40 Allocated by task 2718: set_track+0x63/0xfa __kmalloc+0x119/0x1ac regulatory_hint_country_ie+0x38/0x329 [cfg80211] __cfg80211_connect_result+0x854/0xadd [cfg80211] cfg80211_rx_assoc_resp+0x3bc/0x4f0 [cfg80211] smsc95xx v1.0.6 ieee80211_sta_rx_queued_mgmt+0x1803/0x7ed5 [mac80211] ieee80211_iface_work+0x411/0x696 [mac80211] process_one_work+0x55f/0x8d0 worker_thread+0x5dd/0x841 kthread+0x270/0x285 ret_from_fork+0x22/0x40 Freed by task 89: set_track+0x63/0xfa kasan_slab_free+0x6a/0x87 kfree+0xdc/0x470 reg_process_hint+0x31e/0x8aa [cfg80211] reg_todo+0x204/0x5b9 [cfg80211] process_one_work+0x55f/0x8d0 worker_thread+0x5dd/0x841 kthread+0x270/0x285 ret_from_fork+0x22/0x40 <snipped> Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-01 09:14:03 +02:00
Felix Fietkau	211710ca74	mac80211: fix setting IEEE80211_KEY_FLAG_RX_MGMT for AP mode keys key->sta is only valid after ieee80211_key_link, which is called later in this function. Because of that, the IEEE80211_KEY_FLAG_RX_MGMT is never set when management frame protection is enabled. Fixes: `e548c49e6d` ("mac80211: add key flag for management keys") Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-01 09:13:48 +02:00
Stefan Seyfried	848e616e66	cfg80211: fix wext-compat memory leak cfg80211_wext_giwrate and sinfo.pertid might allocate sinfo.pertid via rdev_get_station(), but never release it. Fix that. Fixes: `8689c051a2` ("cfg80211: dynamically allocate per-tid stats for station info") Signed-off-by: Stefan Seyfried <seife+kernel@b1-systems.com> [johannes: fix error path, use cfg80211_sinfo_release_content(), add Fixes] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2018-10-01 09:11:36 +02:00
Trond Myklebust	571ed1fd23	SUNRPC: Replace krb5_seq_lock with a lockless scheme Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:18 -04:00
Trond Myklebust	0c1c19f46e	SUNRPC: Lockless lookup of RPCSEC_GSS mechanisms Use RCU protected lookups for discovering the supported mechanisms. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:17 -04:00
Trond Myklebust	4e4c3bef44	SUNRPC: Remove rpc_authflavor_lock in favour of RCU locking Module removal is RCU safe by design, so we really have no need to lock the auth_flavors[] array. Substitute a lockless scheme to add/remove entries in the array, and then use rcu. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:17 -04:00
Trond Myklebust	ec846469ba	SUNRPC: Unexport xdr_partial_copy_from_skb() It is no longer used outside of net/sunrpc/socklib.c Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	4f54614975	SUNRPC: Clean up xs_udp_data_receive() Simplify the retry logic. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	550aebfe1c	SUNRPC: Allow AF_LOCAL sockets to use the generic stream receive Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	c50b8ee02f	SUNRPC: Clean up - rename xs_tcp_data_receive() to xs_stream_data_receive() In preparation for sharing with AF_LOCAL. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	277e4ab7d5	SUNRPC: Simplify TCP receive code by switching to using iterators Most of this code should also be reusable with other socket types. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	9d96acbc7f	SUNRPC: Add a bvec array to struct xdr_buf for use with iovec_iter() Add a bvec array to struct xdr_buf, and have the client allocate it when we need to receive data into pages. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	431f6eb357	SUNRPC: Add a label for RPC calls that require allocation on receive If the RPC call relies on the receive call allocating pages as buffers, then let's label it so that we a) Don't leak memory by allocating pages for requests that do not expect this behaviour b) Can optimise for the common case where calls do not require allocation. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	79c99152a3	SUNRPC: Convert the xprt->sending queue back to an ordinary wait queue We no longer need priority semantics on the xprt->sending queue, because the order in which tasks are sent is now dictated by their position in the send queue. Note that the backlog queue remains a priority queue, meaning that slot resources are still managed in order of task priority. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	f42f7c2830	SUNRPC: Fix priority queue fairness Fix up the priority queue to not batch by owner, but by queue, so that we allow '1 << priority' elements to be dequeued before switching to the next priority queue. The owner field is still used to wake up requests in round robin order by owner to avoid single processes hogging the RPC layer by loading the queues. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	95f7691daa	SUNRPC: Convert xprt receive queue to use an rbtree If the server is slow, we can find ourselves with quite a lot of entries on the receive queue. Converting the search from an O(n) to O(log(n)) can make a significant difference, particularly since we have to hold a number of locks while searching. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	bd79bc579c	SUNRPC: Don't take transport->lock unnecessarily when taking XPRT_LOCK Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	adfa71446d	SUNRPC: Cleanup: remove the unused 'task' argument from the request_send() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:16 -04:00
Trond Myklebust	c544577dad	SUNRPC: Clean up transport write space handling Treat socket write space handling in the same way we now treat transport congestion: by denying the XPRT_LOCK until the transport signals that it has free buffer space. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	36bd7de949	SUNRPC: Turn off throttling of RPC slots for TCP sockets The theory was that we would need to grab the socket lock anyway, so we might as well use it to gate the allocation of RPC slots for a TCP socket. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	f05d54ecf6	SUNRPC: Allow soft RPC calls to time out when waiting for the XPRT_LOCK This no longer causes them to lose their place in the transmission queue. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	89f90fe1ad	SUNRPC: Allow calls to xprt_transmit() to drain the entire transmit queue Rather than forcing each and every RPC task to grab the socket write lock in order to send itself, we allow whichever task is holding the write lock to attempt to drain the entire transmit queue. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	86aeee0eb6	SUNRPC: Enqueue swapper tagged RPCs at the head of the transmit queue Avoid memory starvation by giving RPCs that are tagged with the RPC_TASK_SWAPPER flag the highest priority. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	75891f502f	SUNRPC: Support for congestion control when queuing is enabled Both RDMA and UDP transports require the request to get a "congestion control" credit before they can be transmitted. Right now, this is done when the request locks the socket. We'd like it to happen when a request attempts to be transmitted for the first time. In order to support retransmission of requests that already hold such credits, we also want to ensure that they get queued first, so that we don't deadlock with requests that have yet to obtain a credit. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	918f3c1fe8	SUNRPC: Improve latency for interactive tasks One of the intentions with the priority queues was to ensure that no single process can hog the transport. The field task->tk_owner therefore identifies the RPC call's origin, and is intended to allow the RPC layer to organise queues for fairness. This commit therefore modifies the transmit queue to group requests by task->tk_owner, and ensures that we round robin among those groups. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	dcbbeda836	SUNRPC: Move RPC retransmission stat counter to xprt_transmit() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	5f2f6bd987	SUNRPC: Simplify xprt_prepare_transmit() Remove the checks for whether or not we need to transmit, and whether or not a reply has been received. Those are already handled in call_transmit() itself. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00
Trond Myklebust	04b3b88fbf	SUNRPC: Don't reset the request 'bytes_sent' counter when releasing XPRT_LOCK If the request is still on the queue, this will be incorrect behaviour. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-09-30 15:35:15 -04:00

... 3 4 5 6 7 ...

53625 Commits