linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-11 13:41:55 +00:00

Author	SHA1	Message	Date
Dotan Barak	d64bf89a75	net/rds: Fix error handling in rds_ib_add_one() rds_ibdev:ipaddr_list and rds_ibdev:conn_list are initialized after allocation some resources such as protection domain. If allocation of such resources fail, then these uninitialized variables are accessed in rds_ib_dev_free() in failure path. This can potentially crash the system. The code has been updated to initialize these variables very early in the function. Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Signed-off-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-02 12:16:57 -04:00
Linus Walleij	e8521e53cc	net: dsa: rtl8366: Check VLAN ID and not ports There has been some confusion between the port number and the VLAN ID in this driver. What we need to check for validity is the VLAN ID, nothing else. The current confusion came from assigning a few default VLANs for default routing and we need to rewrite that properly. Instead of checking if the port number is a valid VLAN ID, check the actual VLAN IDs passed in to the callback one by one as expected. Fixes: `d8652956cf` ("net: dsa: realtek-smi: Add Realtek SMI driver") Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-02 12:09:23 -04:00
Michal Kubecek	8b6b82ad16	mlx5: avoid 64-bit division in dr_icm_pool_mr_create() Recently added code introduces 64-bit division in dr_icm_pool_mr_create() so that build on 32-bit architectures fails with ERROR: "__umoddi3" [drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko] undefined! As the divisor is always a power of 2, we can use bitwise operation instead. Fixes: `29cf8febd1` ("net/mlx5: DR, ICM pool memory allocator") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-02 11:08:20 -04:00
Tuong Lien	e95584a889	tipc: fix unlimited bundling of small messages We have identified a problem with the "oversubscription" policy in the link transmission code. When small messages are transmitted, and the sending link has reached the transmit window limit, those messages will be bundled and put into the link backlog queue. However, bundles of data messages are counted at the 'CRITICAL' level, so that the counter for that level, instead of the counter for the real, bundled message's level is the one being increased. Subsequent, to-be-bundled data messages at non-CRITICAL levels continue to be tested against the unchanged counter for their own level, while contributing to an unrestrained increase at the CRITICAL backlog level. This leaves a gap in congestion control algorithm for small messages that can result in starvation for other users or a "real" CRITICAL user. Even that eventually can lead to buffer exhaustion & link reset. We fix this by keeping a 'target_bskb' buffer pointer at each levels, then when bundling, we only bundle messages at the same importance level only. This way, we know exactly how many slots a certain level have occupied in the queue, so can manage level congestion accurately. By bundling messages at the same level, we even have more benefits. Let consider this: - One socket sends 64-byte messages at the 'CRITICAL' level; - Another sends 4096-byte messages at the 'LOW' level; When a 64-byte message comes and is bundled the first time, we put the overhead of message bundle to it (+ 40-byte header, data copy, etc.) for later use, but the next message can be a 4096-byte one that cannot be bundled to the previous one. This means the last bundle carries only one payload message which is totally inefficient, as for the receiver also! Later on, another 64-byte message comes, now we make a new bundle and the same story repeats... With the new bundling algorithm, this will not happen, the 64-byte messages will be bundled together even when the 4096-byte message(s) comes in between. However, if the 4096-byte messages are sent at the same level i.e. 'CRITICAL', the bundling algorithm will again cause the same overhead. Also, the same will happen even with only one socket sending small messages at a rate close to the link transmit's one, so that, when one message is bundled, it's transmitted shortly. Then, another message comes, a new bundle is created and so on... We will solve this issue radically by another patch. Fixes: `365ad353c2` ("tipc: reduce risk of user starvation during link congestion") Reported-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-02 11:02:05 -04:00
Dongli Zhang	a761129e36	xen-netfront: do not use ~0U as error return value for xennet_fill_frags() xennet_fill_frags() uses ~0U as return value when the sk_buff is not able to cache extra fragments. This is incorrect because the return type of xennet_fill_frags() is RING_IDX and 0xffffffff is an expected value for ring buffer index. In the situation when the rsp_cons is approaching 0xffffffff, the return value of xennet_fill_frags() may become 0xffffffff which xennet_poll() (the caller) would regard as error. As a result, queue->rx.rsp_cons is set incorrectly because it is updated only when there is error. If there is no error, xennet_poll() would be responsible to update queue->rx.rsp_cons. Finally, queue->rx.rsp_cons would point to the rx ring buffer entries whose queue->rx_skbs[i] and queue->grant_rx_ref[i] are already cleared to NULL. This leads to NULL pointer access in the next iteration to process rx ring buffer entries. The symptom is similar to the one fixed in commit `00b368502d` ("xen-netfront: do not assume sk_buff_head list is empty in error handling"). This patch changes the return type of xennet_fill_frags() to indicate whether it is successful or failed. The queue->rx.rsp_cons will be always updated inside this function. Fixes: `ad4f15dc2c` ("xen/netfront: don't bug in case of too many frags") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 21:49:51 -04:00
David Ahern	a3ce2a21bb	ipv6: Handle race in addrconf_dad_work Rajendra reported a kernel panic when a link was taken down: [ 6870.263084] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 [ 6870.271856] IP: [<ffffffff8efc5764>] __ipv6_ifa_notify+0x154/0x290 <snip> [ 6870.570501] Call Trace: [ 6870.573238] [<ffffffff8efc58c6>] ? ipv6_ifa_notify+0x26/0x40 [ 6870.579665] [<ffffffff8efc98ec>] ? addrconf_dad_completed+0x4c/0x2c0 [ 6870.586869] [<ffffffff8efe70c6>] ? ipv6_dev_mc_inc+0x196/0x260 [ 6870.593491] [<ffffffff8efc9c6a>] ? addrconf_dad_work+0x10a/0x430 [ 6870.600305] [<ffffffff8f01ade4>] ? __switch_to_asm+0x34/0x70 [ 6870.606732] [<ffffffff8ea93a7a>] ? process_one_work+0x18a/0x430 [ 6870.613449] [<ffffffff8ea93d6d>] ? worker_thread+0x4d/0x490 [ 6870.619778] [<ffffffff8ea93d20>] ? process_one_work+0x430/0x430 [ 6870.626495] [<ffffffff8ea99dd9>] ? kthread+0xd9/0xf0 [ 6870.632145] [<ffffffff8f01ade4>] ? __switch_to_asm+0x34/0x70 [ 6870.638573] [<ffffffff8ea99d00>] ? kthread_park+0x60/0x60 [ 6870.644707] [<ffffffff8f01ae77>] ? ret_from_fork+0x57/0x70 [ 6870.650936] Code: 31 c0 31 d2 41 b9 20 00 08 02 b9 09 00 00 0 addrconf_dad_work is kicked to be scheduled when a device is brought up. There is a race between addrcond_dad_work getting scheduled and taking the rtnl lock and a process taking the link down (under rtnl). The latter removes the host route from the inet6_addr as part of addrconf_ifdown which is run for NETDEV_DOWN. The former attempts to use the host route in ipv6_ifa_notify. If the down event removes the host route due to the race to the rtnl, then the BUG listed above occurs. This scenario does not occur when the ipv6 address is not kept (net.ipv6.conf.all.keep_addr_on_down = 0) as addrconf_ifdown sets the state of the ifp to DEAD. Handle when the addresses are kept by checking IF_READY which is reset by addrconf_ifdown. The 'dead' flag for an inet6_addr is set only under rtnl, in addrconf_ifdown and it means the device is getting removed (or IPv6 is disabled). The interesting cases for changing the idev flag are addrconf_notify (NETDEV_UP and NETDEV_CHANGE) and addrconf_ifdown (reset the flag). The former does not have the idev lock - only rtnl; the latter has both. Based on that the existing dead + IF_READY check can be moved to right after the rtnl_lock in addrconf_dad_work. Fixes: `f1705ec197` ("net: ipv6: Make address flushing on ifdown optional") Reported-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com> Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 21:43:41 -04:00
Eric Dumazet	3256a2d6ab	tcp: adjust rto_base in retransmits_timed_out() The cited commit exposed an old retransmits_timed_out() bug which assumed it could call tcp_model_timeout() with TCP_RTO_MIN as rto_base for all states. But flows in SYN_SENT or SYN_RECV state uses a different RTO base (1 sec instead of 200 ms, unless BPF choses another value) This caused a reduction of SYN retransmits from 6 to 4 with the default /proc/sys/net/ipv4/tcp_syn_retries value. Fixes: `a41e8a88b0` ("tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Marek Majkowski <marek@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 21:40:49 -04:00
Dexuan Cui	0d9138ffac	vsock: Fix a lockdep warning in __vsock_release() Lockdep is unhappy if two locks from the same class are held. Fix the below warning for hyperv and virtio sockets (vmci socket code doesn't have the issue) by using lock_sock_nested() when __vsock_release() is called recursively: ============================================ WARNING: possible recursive locking detected 5.3.0+ #1 Not tainted -------------------------------------------- server/1795 is trying to acquire lock: ffff8880c5158990 (sk_lock-AF_VSOCK){+.+.}, at: hvs_release+0x10/0x120 [hv_sock] but task is already holding lock: ffff8880c5158150 (sk_lock-AF_VSOCK){+.+.}, at: __vsock_release+0x2e/0xf0 [vsock] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sk_lock-AF_VSOCK); lock(sk_lock-AF_VSOCK); * DEADLOCK * May be due to missing lock nesting notation 2 locks held by server/1795: #0: ffff8880c5d05ff8 (&sb->s_type->i_mutex_key#10){+.+.}, at: __sock_release+0x2d/0xa0 #1: ffff8880c5158150 (sk_lock-AF_VSOCK){+.+.}, at: __vsock_release+0x2e/0xf0 [vsock] stack backtrace: CPU: 5 PID: 1795 Comm: server Not tainted 5.3.0+ #1 Call Trace: dump_stack+0x67/0x90 __lock_acquire.cold.67+0xd2/0x20b lock_acquire+0xb5/0x1c0 lock_sock_nested+0x6d/0x90 hvs_release+0x10/0x120 [hv_sock] __vsock_release+0x24/0xf0 [vsock] __vsock_release+0xa0/0xf0 [vsock] vsock_release+0x12/0x30 [vsock] __sock_release+0x37/0xa0 sock_close+0x14/0x20 __fput+0xc1/0x250 task_work_run+0x98/0xc0 do_exit+0x344/0xc60 do_group_exit+0x47/0xb0 get_signal+0x15c/0xc50 do_signal+0x30/0x720 exit_to_usermode_loop+0x50/0xa0 do_syscall_64+0x24e/0x270 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f4184e85f31 Tested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 21:23:35 -04:00
Johan Hovold	8353da9fa6	hso: fix NULL-deref on tty open Fix NULL-pointer dereference on tty open due to a failure to handle a missing interrupt-in endpoint when probing modem ports: BUG: kernel NULL pointer dereference, address: 0000000000000006 ... RIP: 0010:tiocmget_submit_urb+0x1c/0xe0 [hso] ... Call Trace: hso_start_serial_device+0xdc/0x140 [hso] hso_serial_open+0x118/0x1b0 [hso] tty_open+0xf1/0x490 Fixes: `542f548236` ("tty: Modem functions for the HSO driver") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 21:21:10 -04:00
Oleksij Rempel	569aad4fcd	net: ag71xx: fix mdio subnode support This patch is syncing driver with actual devicetree documentation: Documentation/devicetree/bindings/net/qca,ar71xx.txt \|Optional subnodes: \|- mdio : specifies the mdio bus, used as a container for phy nodes \| according to phy.txt in the same directory The driver was working with fixed phy without any noticeable issues. This bug was uncovered by introducing dsa ar9331-switch driver. Since no one reported this bug until now, I assume no body is using it and this patch should not brake existing system. Fixes: `d51b6ce441` ("net: ethernet: add ag71xx driver") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:19:19 -07:00
David S. Miller	b33210e379	Merge branch 'stmmac-fixes' Jose Abreu says: ==================== net: stmmac: Fixes for -net Misc fixes for -net tree. More info in commit logs. v2 is just a rebase of v1 against -net and we added a new patch (09/09) to fix RSS feature. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:38 -07:00
Jose Abreu	56627336b6	net: stmmac: xgmac: Fix RSS writing wrong keys Commit `b6b6cc9acd`, changed the call to dwxgmac2_rss_write_reg() passing it the variable cfg->key[i]. As key is an u8 but we write 32 bits at a time we need to cast it into an u32 so that the correct key values are written. Notice that the for loop already takes this into account so we don't try to write past the keys size. Fixes: `b6b6cc9acd` ("net: stmmac: selftest: avoid large stack usage") Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:38 -07:00
Jose Abreu	3c72d4d330	net: stmmac: xgmac: Fix RSS not writing all Keys to HW The sizeof(cfg->key) is != ARRAY_SIZE(cfg->key). Fix it. This warning is triggered when running with cc flag -Wsizeof-array-div. Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Nick Desaulniers <ndesaulniers@google.com> Reported-by: Nathan Chancellor <natechancellor@gmail.com> Fixes: `76067459c6` ("net: stmmac: Implement RSS and enable it in XGMAC core") Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:38 -07:00
Jose Abreu	30300d9f91	net: stmmac: xgmac: Disable the Timestamp interrupt by default We don't use it anyway as XGMAC only supports polling for timestamp (in current SW implementation). This greatly reduces the system load by reducing the number of interrupts. Fixes: `2142754f8b` ("net: stmmac: Add MAC related callbacks for XGMAC2") Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:37 -07:00
Jose Abreu	3e2bf04fb0	net: stmmac: Do not stop PHY if WoL is enabled If WoL is enabled we can't really stop the PHY, otherwise we will not receive the WoL packet. Fix this by telling phylink that only the MAC is down and only stop the PHY if WoL is not enabled. Fixes: `74371272f9` ("net: stmmac: Convert to phylink and remove phylib logic") Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:37 -07:00
Jose Abreu	14f347334b	net: stmmac: Correctly take timestamp for PTPv2 The case for PTPV2_EVENT requires event packets to be captured so add this setting to the list of enabled captures. Fixes: `891434b18e` ("stmmac: add IEEE PTPv1 and PTPv2 support.") Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:37 -07:00
Jose Abreu	f79bfda375	net: stmmac: dwmac4: Always update the MAC Hash Filter We need to always update the MAC Hash Filter so that previous entries are invalidated. Found out while running stmmac selftests. Fixes: `b8ef7020d6` ("net: stmmac: add support for hash table size 128/256 in dwmac4") Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:37 -07:00
Jose Abreu	432439fef9	net: stmmac: selftests: Always use max DMA size in Jumbo Test Although some XGMAC setups support frames larger than DMA size, some of them may not. As we can't know before-hand which ones support let's use the maximum DMA buffer size in the Jumbo Tests. User can always reconfigure the MTU to achieve larger frames. Fixes: `427849e8c3` ("net: stmmac: selftests: Add Jumbo Frame tests") Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:37 -07:00
Jose Abreu	c11986b9fa	net: stmmac: xgmac: Detect Hash Table size dinamically Since commit `b8ef7020d6` ("net: stmmac: add support for hash table size 128/256 in dwmac4"), we can detect the Hash Table dinamically. Let's implement this feature in XGMAC cores and fix possible setups that don't support the maximum size for Hash Table. Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:37 -07:00
Jose Abreu	9a2ae7b396	net: stmmac: xgmac: Not all Unicast addresses may be available Some setups may not have all Unicast addresses filters available. Let's check this before trying to setup filters. Fixes: `0efedbf11f` ("net: stmmac: xgmac: Fix XGMAC selftests") Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:12:37 -07:00
Vasundhara Volam	93c2fcb01a	devlink: Fix error handling in param and info_get dumpit cb If any of the param or info_get op returns error, dumpit cb is skipping to dump remaining params or info_get ops for all the drivers. Fix to not return if any of the param/info_get op returns error as not supported and continue to dump remaining information. v2: Modify the patch to return error, except for params/info_get op that return -EOPNOTSUPP as suggested by Andrew Lunn. Also, modify commit message to reflect the same. Cc: Andrew Lunn <andrew@lunn.ch> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:08:30 -07:00
Wen Yang	f32eb9d804	net: dsa: rtl8366rb: add missing of_node_put after calling of_get_child_by_name of_node_put needs to be called when the device node which is got from of_get_child_by_name finished using. irq_domain_add_linear() also calls of_node_get() to increase refcount, so irq_domain will not be affected when it is released. Fixes: `d8652956cf` ("net: dsa: realtek-smi: Add Realtek SMI driver") Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:02:56 -07:00
Wen Yang	d2c50b1cd9	net: mscc: ocelot: add missing of_node_put after calling of_get_child_by_name of_node_put needs to be called when the device node which is got from of_get_child_by_name finished using. In both cases of success and failure, we need to release 'ports', so clean up the code using goto. fixes: `a556c76adc` ("net: mscc: Add initial Ocelot switch support") Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 10:01:58 -07:00
Vladimir Oltean	83c8c3cf45	net: sched: cbs: Avoid division by zero when calculating the port rate As explained in the "net: sched: taprio: Avoid division by zero on invalid link speed" commit, it is legal for the ethtool API to return zero as a link speed. So guard against it to ensure we don't perform a division by zero in kernel. Fixes: `e0a7683d30` ("net/sched: cbs: fix port_rate miscalculation") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 09:51:39 -07:00
Vladimir Oltean	9a9251a353	net: sched: taprio: Avoid division by zero on invalid link speed The check in taprio_set_picos_per_byte is currently not robust enough and will trigger this division by zero, due to e.g. PHYLINK not setting kset->base.speed when there is no PHY connected: [ 27.109992] Division by zero in kernel. [ 27.113842] CPU: 1 PID: 198 Comm: tc Not tainted 5.3.0-rc5-01246-gc4006b8c2637-dirty #212 [ 27.121974] Hardware name: Freescale LS1021A [ 27.126234] [<c03132e0>] (unwind_backtrace) from [<c030d8b8>] (show_stack+0x10/0x14) [ 27.133938] [<c030d8b8>] (show_stack) from [<c10b21b0>] (dump_stack+0xb0/0xc4) [ 27.141124] [<c10b21b0>] (dump_stack) from [<c10af97c>] (Ldiv0_64+0x8/0x18) [ 27.148052] [<c10af97c>] (Ldiv0_64) from [<c0700260>] (div64_u64+0xcc/0xf0) [ 27.154978] [<c0700260>] (div64_u64) from [<c07002d0>] (div64_s64+0x4c/0x68) [ 27.161993] [<c07002d0>] (div64_s64) from [<c0f3d890>] (taprio_set_picos_per_byte+0xe8/0xf4) [ 27.170388] [<c0f3d890>] (taprio_set_picos_per_byte) from [<c0f3f614>] (taprio_change+0x668/0xcec) [ 27.179302] [<c0f3f614>] (taprio_change) from [<c0f2bc24>] (qdisc_create+0x1fc/0x4f4) [ 27.187091] [<c0f2bc24>] (qdisc_create) from [<c0f2c0c8>] (tc_modify_qdisc+0x1ac/0x6f8) [ 27.195055] [<c0f2c0c8>] (tc_modify_qdisc) from [<c0ee9604>] (rtnetlink_rcv_msg+0x268/0x2dc) [ 27.203449] [<c0ee9604>] (rtnetlink_rcv_msg) from [<c0f4fef0>] (netlink_rcv_skb+0xe0/0x114) [ 27.211756] [<c0f4fef0>] (netlink_rcv_skb) from [<c0f4f6cc>] (netlink_unicast+0x1b4/0x22c) [ 27.219977] [<c0f4f6cc>] (netlink_unicast) from [<c0f4fa84>] (netlink_sendmsg+0x284/0x340) [ 27.228198] [<c0f4fa84>] (netlink_sendmsg) from [<c0eae5fc>] (sock_sendmsg+0x14/0x24) [ 27.235988] [<c0eae5fc>] (sock_sendmsg) from [<c0eaedf8>] (___sys_sendmsg+0x214/0x228) [ 27.243863] [<c0eaedf8>] (___sys_sendmsg) from [<c0eb015c>] (__sys_sendmsg+0x50/0x8c) [ 27.251652] [<c0eb015c>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x54) [ 27.259524] Exception stack(0xe8045fa8 to 0xe8045ff0) [ 27.264546] 5fa0: b6f608c8 000000f8 00000003 bed7e2f0 00000000 00000000 [ 27.272681] 5fc0: b6f608c8 000000f8 004ce54c 00000128 5d3ce8c7 00000000 00000026 00505c9c [ 27.280812] 5fe0: 00000070 bed7e298 004ddd64 b6dd1e64 Russell King points out that the ethtool API says zero is a valid return value of __ethtool_get_link_ksettings: * If it is enabled then they are read-only; if the link * is up they represent the negotiated link mode; if the link is down, * the speed is 0, %SPEED_UNKNOWN or the highest enabled speed and * @duplex is %DUPLEX_UNKNOWN or the best enabled duplex mode. So, it seems that taprio is not following the API... I'd suggest either fixing taprio, or getting agreement to change the ethtool API. The chosen path was to fix taprio. Fixes: `7b9eba7ba0` ("net/sched: taprio: fix picos_per_byte miscalculation") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 09:44:03 -07:00
David S. Miller	9cfc370240	A small list of fixes this time: * two null pointer dereference fixes * a fix for preempt-enabled/BHs-enabled (lockdep) splats (that correctly pointed out a bug) * a fix for multi-BSSID ordering assumptions * a fix for the EDMG support, on-stack chandefs need to be initialized properly (now that they're bigger) * beacon (head) data from userspace should be validated -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl2Td78ACgkQB8qZga/f l8SnIA/9GatOHIbmXb0BE/ojm3FEQlQOfzdO2VgC40Z8oR0qMFIwkWPzbsqy2Qhl xjzhh35q6iZunwp49LXRH1kDQn8xqo+RKpYDvrBSPvJW7jQj8l3UUK6tGaPL55RN NN5Tk/nWQVun70qPF/JIFeA/S7GpWJuyAj28hVgyukzNfksaYHqoAZQ1yU1otuou OmzsrXXzGVO9Xu0DU6U5b6UxcTUHiILLywr0kdE35oUATct7AijrU1E4f94/wmXG O4S3BMgLG4Ggxqdn+GPNdHLstEH/z0nyoon3LeautkOSEDgAeZoXNAgRGHSzaLdn YsTZ9mD1uKopSlro0obtyQPYswejnJ1dcEhMV6gpNSUqlf8hrzwzvh1ZxvZwZXpZ bislxnLcA+t10tkRApYQ0JhpvNm2O2lHlXqWz8tug0szoR/GKawrpPEJXrb/9yxF PFVI8TzXA0bLkO6clNV3vWWAf2Hg9My/hmPpbuORWdIw3KbpcMPfoWHkhAspTdpO CmpHurDt1u0Oh/8NawrqUTYXKZkGfseoDXQvQCDOfDCfGl8RPrdthzfPJhh8w4rd NCJa+WYNbFrvYcwi4FLCdRuO2dQjHLTclmZ/yXcVp8mxG5e8eihEIAfm30pyUxov uH29GwctmoA9CBYOAfHsEFWJNeGcpLSa9hzogmBygrPh61eoP14= =nhNe -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2019-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A small list of fixes this time: * two null pointer dereference fixes * a fix for preempt-enabled/BHs-enabled (lockdep) splats (that correctly pointed out a bug) * a fix for multi-BSSID ordering assumptions * a fix for the EDMG support, on-stack chandefs need to be initialized properly (now that they're bigger) * beacon (head) data from userspace should be validated ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 09:28:56 -07:00
Arnd Bergmann	6de6d18591	ionic: select CONFIG_NET_DEVLINK When no other driver selects the devlink library code, ionic produces a link failure: drivers/net/ethernet/pensando/ionic/ionic_devlink.o: In function `ionic_devlink_alloc': ionic_devlink.c:(.text+0xd): undefined reference to `devlink_alloc' drivers/net/ethernet/pensando/ionic/ionic_devlink.o: In function `ionic_devlink_register': ionic_devlink.c:(.text+0x71): undefined reference to `devlink_register' Add the same 'select' statement that the other drivers use here. Fixes: `fbfb803153` ("ionic: Add hardware init and device commands") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 09:24:05 -07:00
Adam Zerella	c5f75a14a0	docs: networking: Add title caret and missing doc Resolving a couple of Sphinx documentation warnings that are generated in the networking section. - WARNING: document isn't included in any toctree - WARNING: Title underline too short. Signed-off-by: Adam Zerella <adam.zerella@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 09:19:49 -07:00
Lorenzo Bianconi	55131dec2b	net: socionext: netsec: always grab descriptor lock Always acquire tx descriptor spinlock even if a xdp program is not loaded on the netsec device since ndo_xdp_xmit can run concurrently with netsec_netdev_start_xmit and netsec_clean_tx_dring. This can happen loading a xdp program on a different device (e.g virtio-net) and xdp_do_redirect_map/xdp_do_redirect_slow can redirect to netsec even if we do not have a xdp program on it. Fixes: `ba2b232108` ("net: netsec: add XDP support") Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-01 09:07:47 -07:00
Johannes Berg	d8dec42b5c	mac80211: keep BHs disabled while calling drv_tx_wake_queue() Drivers typically expect this, as it's the case for almost all cases where this is called (i.e. from the TX path). Also, the code in mac80211 itself (if the driver calls ieee80211_tx_dequeue()) expects this as it uses this_cpu_ptr() without additional protection. This should fix various reports of the problem: https://bugzilla.kernel.org/show_bug.cgi?id=204127 https://lore.kernel.org/linux-wireless/CAN5HydrWb3o_FE6A1XDnP1E+xS66d5kiEuhHfiGKkLNQokx13Q@mail.gmail.com/ https://lore.kernel.org/lkml/nycvar.YFH.7.76.1909111238470.473@cbobk.fhfr.pm/ Cc: stable@vger.kernel.org Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> Reported-by: Aaron Hill <aa1ronham@gmail.com> Reported-by: Lukas Redlinger <rel+kernel@agilox.net> Reported-by: Oleksii Shevchuk <alxchk@gmail.com> Fixes: `21a5d4c3a4` ("mac80211: add stop/start logic for software TXQs") Link: https://lore.kernel.org/r/1569928763-I3e8838c5ecad878e59d4a94eb069a90f6641461a@changeid Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2019-10-01 17:56:19 +02:00
Miaoqing Pan	8ed31a2640	mac80211: fix txq null pointer dereference If the interface type is P2P_DEVICE or NAN, read the file of '/sys/kernel/debug/ieee80211/phyx/netdev:wlanx/aqm' will get a NULL pointer dereference. As for those interface type, the pointer sdata->vif.txq is NULL. Unable to handle kernel NULL pointer dereference at virtual address 00000011 CPU: 1 PID: 30936 Comm: cat Not tainted 4.14.104 #1 task: ffffffc0337e4880 task.stack: ffffff800cd20000 PC is at ieee80211_if_fmt_aqm+0x34/0xa0 [mac80211] LR is at ieee80211_if_fmt_aqm+0x34/0xa0 [mac80211] [...] Process cat (pid: 30936, stack limit = 0xffffff800cd20000) [...] [<ffffff8000b7cd00>] ieee80211_if_fmt_aqm+0x34/0xa0 [mac80211] [<ffffff8000b7c414>] ieee80211_if_read+0x60/0xbc [mac80211] [<ffffff8000b7ccc4>] ieee80211_if_read_aqm+0x28/0x30 [mac80211] [<ffffff80082eff94>] full_proxy_read+0x2c/0x48 [<ffffff80081eef00>] __vfs_read+0x2c/0xd4 [<ffffff80081ef084>] vfs_read+0x8c/0x108 [<ffffff80081ef494>] SyS_read+0x40/0x7c Signed-off-by: Miaoqing Pan <miaoqing@codeaurora.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/1569549796-8223-1-git-send-email-miaoqing@codeaurora.org [trim useless data from commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2019-10-01 17:56:19 +02:00
Miaoqing Pan	b501426cf8	nl80211: fix null pointer dereference If the interface is not in MESH mode, the command 'iw wlanx mpath del' will cause kernel panic. The root cause is null pointer access in mpp_flush_by_proxy(), as the pointer 'sdata->u.mesh.mpp_paths' is NULL for non MESH interface. Unable to handle kernel NULL pointer dereference at virtual address 00000068 [...] PC is at _raw_spin_lock_bh+0x20/0x5c LR is at mesh_path_del+0x1c/0x17c [mac80211] [...] Process iw (pid: 4537, stack limit = 0xd83e0238) [...] [<c021211c>] (_raw_spin_lock_bh) from [<bf8c7648>] (mesh_path_del+0x1c/0x17c [mac80211]) [<bf8c7648>] (mesh_path_del [mac80211]) from [<bf6cdb7c>] (extack_doit+0x20/0x68 [compat]) [<bf6cdb7c>] (extack_doit [compat]) from [<c05c309c>] (genl_rcv_msg+0x274/0x30c) [<c05c309c>] (genl_rcv_msg) from [<c05c25d8>] (netlink_rcv_skb+0x58/0xac) [<c05c25d8>] (netlink_rcv_skb) from [<c05c2e14>] (genl_rcv+0x20/0x34) [<c05c2e14>] (genl_rcv) from [<c05c1f90>] (netlink_unicast+0x11c/0x204) [<c05c1f90>] (netlink_unicast) from [<c05c2420>] (netlink_sendmsg+0x30c/0x370) [<c05c2420>] (netlink_sendmsg) from [<c05886d0>] (sock_sendmsg+0x70/0x84) [<c05886d0>] (sock_sendmsg) from [<c0589f4c>] (___sys_sendmsg.part.3+0x188/0x228) [<c0589f4c>] (___sys_sendmsg.part.3) from [<c058add4>] (__sys_sendmsg+0x4c/0x70) [<c058add4>] (__sys_sendmsg) from [<c0208c80>] (ret_fast_syscall+0x0/0x44) Code: e2822c02 e2822001 e5832004 f590f000 (e1902f9f) ---[ end trace bbd717600f8f884d ]--- Signed-off-by: Miaoqing Pan <miaoqing@codeaurora.org> Link: https://lore.kernel.org/r/1569485810-761-1-git-send-email-miaoqing@codeaurora.org [trim useless data from commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2019-10-01 17:56:19 +02:00
Johannes Berg	f43e5210c7	cfg80211: initialize on-stack chandefs In a few places we don't properly initialize on-stack chandefs, resulting in EDMG data to be non-zero, which broke things. Additionally, in a few places we rely on the driver to init the data completely, but perhaps we shouldn't as non-EDMG drivers may not initialize the EDMG data, also initialize it there. Cc: stable@vger.kernel.org Fixes: `2a38075cd0` ("nl80211: Add support for EDMG channels") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Link: https://lore.kernel.org/r/1569239475-I2dcce394ecf873376c386a78f31c2ec8b538fa25@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2019-10-01 17:56:18 +02:00
Johannes Berg	242b0931c1	cfg80211: validate SSID/MBSSID element ordering assumption The code copying the data assumes that the SSID element is before the MBSSID element, but since the data is untrusted from the AP, this cannot be guaranteed. Validate that this is indeed the case and ignore the MBSSID otherwise, to avoid having to deal with both cases for the copy of data that should be between them. Cc: stable@vger.kernel.org Fixes: `0b8fb8235b` ("cfg80211: Parsing of Multiple BSSID information in scanning") Link: https://lore.kernel.org/r/1569009255-I1673911f5eae02964e21bdc11b2bf58e5e207e59@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2019-10-01 17:56:18 +02:00
Johannes Berg	f88eb7c0d0	nl80211: validate beacon head We currently don't validate the beacon head, i.e. the header, fixed part and elements that are to go in front of the TIM element. This means that the variable elements there can be malformed, e.g. have a length exceeding the buffer size, but most downstream code from this assumes that this has already been checked. Add the necessary checks to the netlink policy. Cc: stable@vger.kernel.org Fixes: `ed1b6cc7f8` ("cfg80211/nl80211: add beacon settings") Link: https://lore.kernel.org/r/1569009255-I7ac7fbe9436e9d8733439eab8acbbd35e55c74ef@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2019-10-01 17:56:18 +02:00
Vladimir Oltean	68ce6688a5	net: sched: taprio: Fix potential integer overflow in taprio_set_picos_per_byte The speed divisor is used in a context expecting an s64, but it is evaluated using 32-bit arithmetic. To avoid that happening, instead of multiplying by 1,000,000 in the first place, simplify the fraction and do a standard 32 bit division instead. Fixes: `f04b514c0c` ("taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte") Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-30 18:32:20 -07:00
Navid Emamdoost	68501df92d	net: dsa: sja1105: Prevent leaking memory In sja1105_static_config_upload, in two cases memory is leaked: when static_config_buf_prepare_for_upload fails and when sja1105_inhibit_tx fails. In both cases config_buf should be released. Fixes: `8aa9ebccae` ("net: dsa: Introduce driver for NXP SJA1105 5-port L2 switch") Fixes: `1a4c69406c` ("net: dsa: sja1105: Prevent PHY jabbering during switch reset") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-30 17:24:43 -07:00
Vladimir Oltean	b6f2494d31	net: dsa: sja1105: Ensure PTP time for rxtstamp reconstruction is not in the past Sometimes the PTP synchronization on the switch 'jumps': ptp4l[11241.155]: rms 8 max 16 freq -21732 +/- 11 delay 742 +/- 0 ptp4l[11243.157]: rms 7 max 17 freq -21731 +/- 10 delay 744 +/- 0 ptp4l[11245.160]: rms 33592410 max 134217731 freq +192422 +/- 8530253 delay 743 +/- 0 ptp4l[11247.163]: rms 811631 max 964131 freq +10326 +/- 557785 delay 743 +/- 0 ptp4l[11249.166]: rms 261936 max 533876 freq -304323 +/- 126371 delay 744 +/- 0 ptp4l[11251.169]: rms 48700 max 57740 freq -20218 +/- 30532 delay 744 +/- 0 ptp4l[11253.171]: rms 14570 max 30163 freq -5568 +/- 7563 delay 742 +/- 0 ptp4l[11255.174]: rms 2914 max 3440 freq -22001 +/- 1667 delay 744 +/- 1 ptp4l[11257.177]: rms 811 max 1710 freq -22653 +/- 451 delay 744 +/- 1 ptp4l[11259.180]: rms 177 max 218 freq -21695 +/- 89 delay 741 +/- 0 ptp4l[11261.182]: rms 45 max 92 freq -21677 +/- 32 delay 742 +/- 0 ptp4l[11263.186]: rms 14 max 32 freq -21733 +/- 11 delay 742 +/- 0 ptp4l[11265.188]: rms 9 max 14 freq -21725 +/- 12 delay 742 +/- 0 ptp4l[11267.191]: rms 9 max 16 freq -21727 +/- 13 delay 742 +/- 0 ptp4l[11269.194]: rms 6 max 15 freq -21726 +/- 9 delay 743 +/- 0 ptp4l[11271.197]: rms 8 max 15 freq -21728 +/- 11 delay 743 +/- 0 ptp4l[11273.200]: rms 6 max 12 freq -21727 +/- 8 delay 743 +/- 0 ptp4l[11275.202]: rms 9 max 17 freq -21720 +/- 11 delay 742 +/- 0 ptp4l[11277.205]: rms 9 max 18 freq -21725 +/- 12 delay 742 +/- 0 Background: the switch only offers partial RX timestamps (24 bits) and it is up to the driver to read the PTP clock to fill those timestamps up to 64 bits. But the PTP clock readout needs to happen quickly enough (in 0.135 seconds, in fact), otherwise the PTP clock will wrap around 24 bits, condition which cannot be detected. Looking at the 'max 134217731' value on output line 3, one can see that in hex it is 0x8000003. Because the PTP clock resolution is 8 ns, that means 0x1000000 in ticks, which is exactly 2^24. So indeed this is a PTP clock wraparound, but the reason might be surprising. What is going on is that sja1105_tstamp_reconstruct(priv, now, ts) expects a "now" time that is later than the "ts" was snapshotted at. This, of course, is obvious: we read the PTP time _after_ the partial RX timestamp was received. However, the workqueue is processing frames from a skb queue and reuses the same PTP time, read once at the beginning. Normally the skb queue only contains one frame and all goes well. But when the skb queue contains two frames, the second frame that gets dequeued might have been partially timestamped by the RX MAC _after_ we had read our PTP time initially. The code was originally like that due to concerns that SPI access for PTP time readout is a slow process, and we are time-constrained anyway (aka: premature optimization). But some timing analysis reveals that the time spent until the RX timestamp is completely reconstructed is 1 order of magnitude lower than the 0.135 s deadline even under worst-case conditions. So we can afford to read the PTP time for each frame in the RX timestamping queue, which of course ensures that the full PTP time is in the partial timestamp's future. Fixes: `f3097be21b` ("net: dsa: sja1105: Add a state machine for RX timestamping") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-30 17:20:50 -07:00
David S. Miller	3755ee2257	Merge tag 'ieee802154-for-davem-2019-09-28' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2019-09-28 An update from ieee802154 for your net tree. Three driver fixes. Navid Emamdoost fixed a memory leak on an error path in the ca8210 driver, Johan Hovold fixed a use-after-free found by syzbot in the atusb driver and Christophe JAILLET makes sure __skb_put_data is used instead of memcpy in the mcr20a driver I switched from branches to tags here to be pulled from. So far not annotated and not signed. Once I fixed my scripts it should contain this messages as annotations. If you want it signed as well just tell me. If there are any problems let me know. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-30 17:14:45 -07:00
Martin KaFai Lau	8c7138b33e	net: Unpublish sk from sk_reuseport_cb before call_rcu The "reuse->sock[]" array is shared by multiple sockets. The going away sk must unpublish itself from "reuse->sock[]" before making call_rcu() call. However, this unpublish-action is currently done after a grace period and it may cause use-after-free. The fix is to move reuseport_detach_sock() to sk_destruct(). Due to the above reason, any socket with sk_reuseport_cb has to go through the rcu grace period before freeing it. It is a rather old bug (~3 yrs). The Fixes tag is not necessary the right commit but it is the one that introduced the SOCK_RCU_FREE logic and this fix is depending on it. Fixes: `a4298e4522` ("net: add SOCK_RCU_FREE socket flag") Cc: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-30 17:13:43 -07:00
Haishuang Yan	0e141f757b	erspan: remove the incorrect mtu limit for erspan erspan driver calls ether_setup(), after commit `61e84623ac` ("net: centralize net_device min/max MTU checking"), the range of mtu is [min_mtu, max_mtu], which is [68, 1500] by default. It causes the dev mtu of the erspan device to not be greater than 1500, this limit value is not correct for ipgre tap device. Tested: Before patch: # ip link set erspan0 mtu 1600 Error: mtu greater than device maximum. After patch: # ip link set erspan0 mtu 1600 # ip -d link show erspan0 21: erspan0@NONE: <BROADCAST,MULTICAST> mtu 1600 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 0 Fixes: `61e84623ac` ("net: centralize net_device min/max MTU checking") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-30 11:10:07 -07:00
Eric Dumazet	e9789c7cc1	sch_cbq: validate TCA_CBQ_WRROPT to avoid crash syzbot reported a crash in cbq_normalize_quanta() caused by an out of range cl->priority. iproute2 enforces this check, but malicious users do not. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN PTI Modules linked in: CPU: 1 PID: 26447 Comm: syz-executor.1 Not tainted 5.3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:cbq_normalize_quanta.part.0+0x1fd/0x430 net/sched/sch_cbq.c:902 RSP: 0018:ffff8801a5c333b0 EFLAGS: 00010206 RAX: 0000000020000003 RBX: 00000000fffffff8 RCX: ffffc9000712f000 RDX: 00000000000043bf RSI: ffffffff83be8962 RDI: 0000000100000018 RBP: ffff8801a5c33420 R08: 000000000000003a R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000002ef R13: ffff88018da95188 R14: dffffc0000000000 R15: 0000000000000015 FS: 00007f37d26b1700(0000) GS:ffff8801dad00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004c7cec CR3: 00000001bcd0a006 CR4: 00000000001626f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff83be9d57>] cbq_normalize_quanta include/net/pkt_sched.h:27 [inline] [<ffffffff83be9d57>] cbq_addprio net/sched/sch_cbq.c:1097 [inline] [<ffffffff83be9d57>] cbq_set_wrr+0x2d7/0x450 net/sched/sch_cbq.c:1115 [<ffffffff83bee8a7>] cbq_change_class+0x987/0x225b net/sched/sch_cbq.c:1537 [<ffffffff83b96985>] tc_ctl_tclass+0x555/0xcd0 net/sched/sch_api.c:2329 [<ffffffff83a84655>] rtnetlink_rcv_msg+0x485/0xc10 net/core/rtnetlink.c:5248 [<ffffffff83cadf0a>] netlink_rcv_skb+0x17a/0x460 net/netlink/af_netlink.c:2510 [<ffffffff83a7db6d>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5266 [<ffffffff83cac2c6>] netlink_unicast_kernel net/netlink/af_netlink.c:1324 [inline] [<ffffffff83cac2c6>] netlink_unicast+0x536/0x720 net/netlink/af_netlink.c:1350 [<ffffffff83cacd4a>] netlink_sendmsg+0x89a/0xd50 net/netlink/af_netlink.c:1939 [<ffffffff8399d46e>] sock_sendmsg_nosec net/socket.c:673 [inline] [<ffffffff8399d46e>] sock_sendmsg+0x12e/0x170 net/socket.c:684 [<ffffffff8399f1fd>] ___sys_sendmsg+0x81d/0x960 net/socket.c:2359 [<ffffffff839a2d05>] __sys_sendmsg+0x105/0x1d0 net/socket.c:2397 [<ffffffff839a2df9>] SYSC_sendmsg net/socket.c:2406 [inline] [<ffffffff839a2df9>] SyS_sendmsg+0x29/0x30 net/socket.c:2404 [<ffffffff8101ccc8>] do_syscall_64+0x528/0x770 arch/x86/entry/common.c:305 [<ffffffff84400091>] entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-30 11:07:46 -07:00
Michal Vokáč	7ae6d93c8f	net: dsa: qca8k: Use up to 7 ports for all operations The QCA8K family supports up to 7 ports. So use the existing QCA8K_NUM_PORTS define to allocate the switch structure and limit all operations with the switch ports. This was not an issue until commit `0394a63acf` ("net: dsa: enable and disable all ports") disabled all unused ports. Since the unused ports 7-11 are outside of the correct register range on this switch some registers were rewritten with invalid content. Fixes: `6b93fb4648` ("net-next: dsa: add new driver for qca8xxx family") Fixes: `a0c02161ec` ("net: dsa: variable number of ports") Fixes: `0394a63acf` ("net: dsa: enable and disable all ports") Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-09-30 11:06:38 -07:00
Linus Torvalds	02dc96ef6c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: 1) Sanity check URB networking device parameters to avoid divide by zero, from Oliver Neukum. 2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6 don't work properly. Longer term this needs a better fix tho. From Vijay Khemka. 3) Small fixes to selftests (use ping when ping6 is not present, etc.) from David Ahern. 4) Bring back rt_uses_gateway member of struct rtable, it's semantics were not well understood and trying to remove it broke things. From David Ahern. 5) Move usbnet snaity checking, ignore endpoints with invalid wMaxPacketSize. From Bjørn Mork. 6) Missing Kconfig deps for sja1105 driver, from Mao Wenan. 7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel, Alex Vesker, and Yevgeny Kliteynik 8) Missing CAP_NET_RAW checks in various places, from Ori Nimron. 9) Fix crash when removing sch_cbs entry while offloading is enabled, from Vinicius Costa Gomes. 10) Signedness bug fixes, generally in looking at the result given by of_get_phy_mode() and friends. From Dan Crapenter. 11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet. 12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern. 13) Fix quantization code in tcp_bbr, from Kevin Yang. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits) net: tap: clean up an indentation issue nfp: abm: fix memory leak in nfp_abm_u32_knode_replace tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state sk_buff: drop all skb extensions on free and skb scrubbing tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions Documentation: Clarify trap's description mlxsw: spectrum: Clear VLAN filters during port initialization net: ena: clean up indentation issue NFC: st95hf: clean up indentation issue net: phy: micrel: add Asym Pause workaround for KSZ9021 net: socionext: ave: Avoid using netdev_err() before calling register_netdev() ptp: correctly disable flags on old ioctls lib: dimlib: fix help text typos net: dsa: microchip: Always set regmap stride to 1 nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled net: sched: sch_sfb: don't call qdisc_put() while holding tree lock ...	2019-09-28 17:47:33 -07:00
Linus Torvalds	edf445ad7c	Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes) Merge hugepage allocation updates from David Rientjes: "We (mostly Linus, Andrea, and myself) have been discussing offlist how to implement a sane default allocation strategy for hugepages on NUMA platforms. With these reverts in place, the page allocator will happily allocate a remote hugepage immediately rather than try to make a local hugepage available. This incurs a substantial performance degradation when memory compaction would have otherwise made a local hugepage available. This series reverts those reverts and attempts to propose a more sane default allocation strategy specifically for hugepages. Andrea acknowledges this is likely to fix the swap storms that he originally reported that resulted in the patches that removed __GFP_THISNODE from hugepage allocations. The immediate goal is to return 5.3 to the behavior the kernel has implemented over the past several years so that remote hugepages are not immediately allocated when local hugepages could have been made available because the increased access latency is untenable. The next goal is to introduce a sane default allocation strategy for hugepages allocations in general regardless of the configuration of the system so that we prevent thrashing of local memory when compaction is unlikely to succeed and can prefer remote hugepages over remote native pages when the local node is low on memory." Note on timing: this reverts the hugepage VM behavior changes that got introduced fairly late in the 5.3 cycle, and that fixed a huge performance regression for certain loads that had been around since 4.18. Andrea had this note: "The regression of 4.18 was that it was taking hours to start a VM where 3.10 was only taking a few seconds, I reported all the details on lkml when it was finally tracked down in August 2018. https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/ __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio workload degrade like in the "current upstream" above. And it still would have been that bad as above until 5.3-rc5" where the bad behavior ends up happening as you fill up a local node, and without that change, you'd get into the nasty swap storm behavior due to compaction working overtime to make room for more memory on the nodes. As a result 5.3 got the two performance fix reverts in rc5. However, David Rientjes then noted that those performance fixes in turn regressed performance for other loads - although not quite to the same degree. He suggested reverting the reverts and instead replacing them with two small changes to how hugepage allocations are done (patch descriptions rephrased by me): - "avoid expensive reclaim when compaction may not succeed": just admit that the allocation failed when you're trying to allocate a huge-page and compaction wasn't successful. - "allow hugepage fallback to remote nodes when madvised": when that node-local huge-page allocation failed, retry without forcing the local node. but by then I judged it too late to replace the fixes for a 5.3 release. So 5.3 was released with behavior that harked back to the pre-4.18 logic. But now we're in the merge window for 5.4, and we can see if this alternate model fixes not just the horrendous swap storm behavior, but also restores the performance regression that the late reverts caused. Fingers crossed. * emailed patches from David Rientjes <rientjes@google.com>: mm, page_alloc: allow hugepage fallback to remote nodes when madvised mm, page_alloc: avoid expensive reclaim when compaction may not succeed Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" Revert "Revert "mm, thp: restore node-local hugepage allocations""	2019-09-28 14:26:47 -07:00
David Rientjes	76e654cc91	mm, page_alloc: allow hugepage fallback to remote nodes when madvised For systems configured to always try hard to allocate transparent hugepages (thp defrag setting of "always") or for memory that has been explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to remote memory to allocate the hugepage if the local allocation fails first. The point is to allow the initial call to __alloc_pages_node() to attempt to defragment local memory to make a hugepage available, if possible, rather than immediately fallback to remote memory. Local hugepages will always have a better access latency than remote (huge)pages, so an attempt to make a hugepage available locally is always preferred. If memory compaction cannot be successful locally, however, it is likely better to fallback to remote memory. This could take on two forms: either allow immediate fallback to remote memory or do per-zone watermark checks. It would be possible to fallback only when per-zone watermarks fail for order-0 memory, since that would require local reclaim for all subsequent faults so remote huge allocation is likely better than thrashing the local zone for large workloads. In this case, it is assumed that because the system is configured to try hard to allocate hugepages or the vma is advised to explicitly want to try hard for hugepages that remote allocation is better when local allocation and memory compaction have both failed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-28 14:05:38 -07:00
David Rientjes	b39d0ee263	mm, page_alloc: avoid expensive reclaim when compaction may not succeed Memory compaction has a couple significant drawbacks as the allocation order increases, specifically: - isolate_freepages() is responsible for finding free pages to use as migration targets and is implemented as a linear scan of memory starting at the end of a zone, - failing order-0 watermark checks in memory compaction does not account for how far below the watermarks the zone actually is: to enable migration, there must be some free memory available. Per the above, watermarks are not always suffficient if isolate_freepages() cannot find the free memory but it could require hundreds of MBs of reclaim to even reach this threshold (read: potentially very expensive reclaim with no indication compaction can be successful), and - if compaction at this order has failed recently so that it does not even run as a result of deferred compaction, looping through reclaim can often be pointless. For hugepage allocations, these are quite substantial drawbacks because these are very high order allocations (order-9 on x86) and falling back to doing reclaim can potentially be very expensive without any indication that compaction would even be successful. Reclaim itself is unlikely to free entire pageblocks and certainly no reliance should be put on it to do so in isolation (recall lumpy reclaim). This means we should avoid reclaim and simply fail hugepage allocation if compaction is deferred. It is also not helpful to thrash a zone by doing excessive reclaim if compaction may not be able to access that memory. If order-0 watermarks fail and the allocation order is sufficiently large, it is likely better to fail the allocation rather than thrashing the zone. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-28 14:05:38 -07:00
David Rientjes	19deb7695e	Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" This reverts commit `92717d429b`. Since commit `a8282608c8` ("Revert "mm, thp: restore node-local hugepage allocations"") is reverted in this series, it is better to restore the previous 5.2 behavior between the thp allocation and the page allocator rather than to attempt any consolidation or cleanup for a policy that is now reverted. It's less risky during an rc cycle and subsequent patches in this series further modify the same policy that the pre-5.3 behavior implements. Consolidation and cleanup can be done subsequent to a sane default page allocation strategy, so this patch reverts a cleanup done on a strategy that is now reverted and thus is the least risky option. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-28 14:05:38 -07:00
David Rientjes	ac79f78dab	Revert "Revert "mm, thp: restore node-local hugepage allocations"" This reverts commit `a8282608c8`. The commit references the original intended semantic for MADV_HUGEPAGE which has subsequently taken on three unique purposes: - enables or disables thp for a range of memory depending on the system's config (is thp "enabled" set to "always" or "madvise"), - determines the synchronous compaction behavior for thp allocations at fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"), and - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only clear previous hugepage advice). These are the three purposes that currently exist in 5.2 and over the past several years that userspace has been written around. Adding a NUMA locality preference adds a fourth dimension to an already conflated advice mode. Based on the semantic that MADV_HUGEPAGE has provided over the past several years, there exist workloads that use the tunable based on these principles: specifically that the allocation should attempt to defragment a local node before falling back. It is agreed that remote hugepages typically (but not always) have a better access latency than remote native pages, although on Naples this is at parity for intersocket. The revert commit that this patch reverts allows hugepage allocation to immediately allocate remotely when local memory is fragmented. This is contrary to the semantic of MADV_HUGEPAGE over the past several years: that is, memory compaction should be attempted locally before falling back. The performance degradation of remote hugepages over local hugepages on Rome, for example, is 53.5% increased access latency. For this reason, the goal is to revert back to the 5.2 and previous behavior that would attempt local defragmentation before falling back. With the patch that is reverted by this patch, we see performance degradations at the tail because the allocator happily allocates the remote hugepage rather than even attempting to make a local hugepage available. zone_reclaim_mode is not a solution to this problem since it does not only impact hugepage allocations but rather changes the memory allocation strategy for all page allocations. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-28 14:05:38 -07:00
Linus Torvalds	a2953204b5	powerpc fixes for 5.4 #2 An assortment of fixes that were either missed by me, or didn't arrive quite in time for the first v5.4 pull. Most notable is a fix for an issue with tlbie (broadcast TLB invalidation) on Power9, when using the Radix MMU. The tlbie can race with an mtpid (move to PID register, essentially MMU context switch) on another thread of the core, which can cause stores to continue to go to a page after it's unmapped. A fix in our KVM code to add a missing barrier, the lack of which has been observed to cause missed IPIs and subsequently stuck CPUs in the host. A change to the way we initialise PCR (Processor Compatibility Register) to make it forward compatible with future CPUs. On some older PowerVM systems our H_BLOCK_REMOVE support could oops, fix it to detect such systems and fallback to the old invalidation method. A fix for an oops seen on some machines when using KASAN on 32-bit. A handful of other minor fixes, and two new selftests. Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy, Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael Roth, Oliver O'Halloran. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2PRGITHG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgClRD/9jKIT6GjVpRMc+Dg9zHB5/Pir7gePk ztXKI+u15GrrXgjtWEZ1PaaXvtNIfs/IZHDQm5gyJjiBAKcGl2v+9ETaMzO5sjZ7 GSe1F8VX/MwzRnET8Jph8w/b0cy0Q8xndkEOjcJqJ7+TF+SSWqmJEdmBfkU23jWD B3kW4W1x2xt/XGsX25l1HpUgpcJqzukCeYUSCdqUu2j+sXAEZmfgTRG8uD4HffzZ 3As76TrBiJsDnkyH0qi2G1BuLXrQbAMdjTeSGi+cb0gTIunCr190gI4+Tjdu2/z7 ywWR2ZUkueCNDcLsqXaqpZx50utPJ44//uY750sk72vixJJOVuOWM6+5HKVW83se /v0zkOcI9+ywNHe0vLfP3Jm/OMMHYxkIwz6kVu2NSR6sE79B9AZpBFU+Nynq7kKl +Hc6md/HATvR+NK6LtQKtEGydRhvxU5n3KBmjq3SQj+B/ZlU6IdgerfhUWrNvg0B zzHeT35X6UBpswonhkQLgqJuaWpkClK9wsUy85MuA7aub1EP8S6/X7paKoiOtAHK NjlXM2JYV5OKwhjGgdCiI94Bdune7yudKPdsXV3Gr8Iw7wf2bQk1p7VH+LcruyE9 YJdXwCgN0PaoFUQh3AR4CqzzFwqDya8FQqdkFN3kqhRLVGAMq/PsV8/Tn+myTgQP rZnWnbfZh9BMjw== =dF42 -----END PGP SIGNATURE----- Merge tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "An assortment of fixes that were either missed by me, or didn't arrive quite in time for the first v5.4 pull. - Most notable is a fix for an issue with tlbie (broadcast TLB invalidation) on Power9, when using the Radix MMU. The tlbie can race with an mtpid (move to PID register, essentially MMU context switch) on another thread of the core, which can cause stores to continue to go to a page after it's unmapped. - A fix in our KVM code to add a missing barrier, the lack of which has been observed to cause missed IPIs and subsequently stuck CPUs in the host. - A change to the way we initialise PCR (Processor Compatibility Register) to make it forward compatible with future CPUs. - On some older PowerVM systems our H_BLOCK_REMOVE support could oops, fix it to detect such systems and fallback to the old invalidation method. - A fix for an oops seen on some machines when using KASAN on 32-bit. - A handful of other minor fixes, and two new selftests. Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy, Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael Roth, Oliver O'Halloran" * tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devices powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error powerpc/nvdimm: Use HCALL error as the return value selftests/powerpc: Add test case for tlbie vs mtpidr ordering issue powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9 powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions powerpc/pseries: Call H_BLOCK_REMOVE when supported powerpc/pseries: Read TLB Block Invalidate Characteristics KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag powerpc/mm: Fix an Oops in kasan_mmu_init() powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY powerpc/64s: Set reserved PCR bits powerpc: Fix definition of PCR bits to work with old binutils powerpc/book3s64/radix: Remove WARN_ON in destroy_context() powerpc/tm: Add tm-poison test	2019-09-28 13:43:00 -07:00

1 2 3 4 5 ...

870909 Commits