linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-30 08:01:59 +00:00

Author	SHA1	Message	Date
Takashi Iwai	1225675ca7	ALSA: PCM: Allow resume only for suspended streams snd_pcm_resume() should bail out if the stream isn't in a suspended state. Otherwise it'd allow doubly resume. Link: https://patch.msgid.link/20240624125443.27808-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>	2024-06-25 11:53:35 +02:00
Paolo Abeni	1d70687592	Merge branch 'net-macb-wol-enhancements' Vineeth Karumanchi says: ==================== net: macb: WOL enhancements - Add provisioning for queue tie-off and queue disable during suspend. - Add support for ARP packet types to WoL. - Advertise WoL attributes by default. - Extend MACB supported WoL modes to the PHY supported WoL modes. - Deprecate magic-packet property. v6: https://lore.kernel.org/netdev/20240617070413.2291511-1-vineeth.karumanchi@amd.com/ v5: https://lore.kernel.org/netdev/20240611162827.887162-1-vineeth.karumanchi@amd.com/ v4: https://lore.kernel.org/lkml/20240610053936.622237-1-vineeth.karumanchi@amd.com/ v3: https://lore.kernel.org/netdev/20240605102457.4050539-1-vineeth.karumanchi@amd.com/ v2: https://lore.kernel.org/netdev/20240222153848.2374782-1-vineeth.karumanchi@amd.com/ v1: https://lore.kernel.org/lkml/20240130104845.3995341-1-vineeth.karumanchi@amd.com/#t ==================== Link: https://lore.kernel.org/r/20240621045735.3031357-1-vineeth.karumanchi@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:09 +02:00
Vineeth Karumanchi	783bfe279e	dt-bindings: net: cdns,macb: Deprecate magic-packet property WOL modes such as magic-packet should be an OS policy. By default, advertise supported modes and use ethtool to activate the required mode. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:07 +02:00
Vineeth Karumanchi	0cb8de39a7	net: macb: Add ARP support to WOL Extend wake-on LAN support with an ARP packet. Currently, if PHY supports WOL, ethtool ignores the modes supported by MACB. This change extends the WOL modes with MACB supported modes. Advertise wake-on LAN supported modes by default without relying on dt node. By default, wake-on LAN will be in disabled state. Using ethtool, users can enable/disable or choose packet types. For wake-on LAN via ARP, ensure the IP address is assigned and report an error otherwise. Co-developed-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> Tested-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> # on SAMA7G5 Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:07 +02:00
Vineeth Karumanchi	3650a8cc5b	net: macb: Enable queue disable Enable queue disable for Versal devices. Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:07 +02:00
Vineeth Karumanchi	759cc793eb	net: macb: queue tie-off or disable during WOL suspend When GEM is used as a wake device, it is not mandatory for the RX DMA to be active. The RX engine in IP only needs to receive and identify a wake packet through an interrupt. The wake packet is of no further significance; hence, it is not required to be copied into memory. By disabling RX DMA during suspend, we can avoid unnecessary DMA processing of any incoming traffic. During suspend, perform either of the below operations: - tie-off/dummy descriptor: Disable unused queues by connecting them to a looped descriptor chain without free slots. - queue disable: The newer IP version allows disabling individual queues. Co-developed-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> Tested-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> # on SAMA7G5 Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:07 +02:00
Takashi Iwai	c5ab94ea28	ALSA: seq: Fix missing channel at encoding RPN/NRPN MIDI2 messages The conversion from the legacy event to MIDI2 UMP for RPN and NRPN missed the setup of the channel number, resulting in always the channel 0. Fix it. Fixes: `e9e02819a9` ("ALSA: seq: Automatic conversion of UMP events") Link: https://patch.msgid.link/20240625095200.25745-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>	2024-06-25 11:53:00 +02:00
luoxuanqiang	ff46e3b442	Fix race for duplicate reqsk on identical SYN When bonding is configured in BOND_MODE_BROADCAST mode, if two identical SYN packets are received at the same time and processed on different CPUs, it can potentially create the same sk (sock) but two different reqsk (request_sock) in tcp_conn_request(). These two different reqsk will respond with two SYNACK packets, and since the generation of the seq (ISN) incorporates a timestamp, the final two SYNACK packets will have different seq values. The consequence is that when the Client receives and replies with an ACK to the earlier SYNACK packet, we will reset(RST) it. ======================================================================== This behavior is consistently reproducible in my local setup, which comprises: \| NETA1 ------ NETB1 \| PC_A --- bond --- \| \| --- bond --- PC_B \| NETA2 ------ NETB2 \| - PC_A is the Server and has two network cards, NETA1 and NETA2. I have bonded these two cards using BOND_MODE_BROADCAST mode and configured them to be handled by different CPU. - PC_B is the Client, also equipped with two network cards, NETB1 and NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode. If the client attempts a TCP connection to the server, it might encounter a failure. Capturing packets from the server side reveals: 10.10.10.10.45182 > localhost: Flags [S], seq 320236027, 10.10.10.10.45182 > localhost: Flags [S], seq 320236027, localhost > 10.10.10.10.45182: Flags [S.], seq 2967855116, localhost > 10.10.10.10.45182: Flags [S.], seq 2967855123, <== 10.10.10.10.45182 > localhost: Flags [.], ack 4294967290, 10.10.10.10.45182 > localhost: Flags [.], ack 4294967290, localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, <== localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, Two SYNACKs with different seq numbers are sent by localhost, resulting in an anomaly. ======================================================================== The attempted solution is as follows: Add a return value to inet_csk_reqsk_queue_hash_add() to confirm if the ehash insertion is successful (Up to now, the reason for unsuccessful insertion is that a reqsk for the same connection has already been inserted). If the insertion fails, release the reqsk. Due to the refcnt, Kuniyuki suggests also adding a return value check for the DCCP module; if ehash insertion fails, indicating a successful insertion of the same connection, simply release the reqsk as well. Simultaneously, In the reqsk_queue_hash_req(), the start of the req->rsk_timer is adjusted to be after successful insertion. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240621013929.1386815-1-luoxuanqiang@kylinos.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:37:45 +02:00
Paolo Abeni	7e7c714a36	Merge branch 'af_unix-remove-spin_lock_nested-and-convert-to-lock_cmp_fn' Kuniyuki Iwashima says: ==================== af_unix: Remove spin_lock_nested() and convert to lock_cmp_fn. This series removes spin_lock_nested() in AF_UNIX and instead defines the locking orders as functions tied to each lock by lockdep_set_lock_cmp_fn(). When the defined function returns a negative value, lockdep considers it will not cause deadlock. (See ->cmp_fn() in check_deadlock() and check_prev_add().) When we cannot define the total ordering, we return -1 for the allowed ordering and otherwise 0 as undefined. [0] [0]: https://lore.kernel.org/netdev/thzkgbuwuo3knevpipu4rzsh5qgmwhklihypdgziiruabvh46f@uwdkpcfxgloo/ Changes: v4: * Patch 4 * Make unix_state_lock_cmp_fn() symmetric. v3: https://lore.kernel.org/netdev/20240614200715.93150-1-kuniyu@amazon.com/ * Patch 3 * Cache sk->sk_state * s/unix_state_lock()/unix_state_unlock()/ * Patch 8 * Add embryo -> listener locking order v2: https://lore.kernel.org/netdev/20240611222905.34695-1-kuniyu@amazon.com/ * Patch 1 & 2 * Use (((l) > (r)) - ((l) < (r))) for comparison v1: https://lore.kernel.org/netdev/20240610223501.73191-1-kuniyu@amazon.com/ ==================== Link: https://lore.kernel.org/r/20240620205623.60139-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:21 +02:00
Kuniyuki Iwashima	22e5751b05	af_unix: Don't use spin_lock_nested() in copy_peercred(). When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket, the listener's sk_peer_pid/sk_peer_cred are copied to the client in copy_peercred(). Then, two sk_peer_locks are held there; one is client's and another is listener's. However, the latter is not needed because we hold the listner's unix_state_lock() there and unix_listen() cannot update the cred concurrently. Let's drop the unnecessary spin_lock() and use the bare spin_lock() for the client to protect concurrent read by getsockopt(SO_PEERCRED). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:19 +02:00
Kuniyuki Iwashima	e4bd881d98	af_unix: Remove put_pid()/put_cred() in copy_peercred(). When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket, the listener's sk_peer_pid/sk_peer_cred are copied to the client in copy_peercred(). Then, the client's sk_peer_pid and sk_peer_cred are always NULL, so we need not call put_pid() and put_cred() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	faf489e689	af_unix: Set sk_peer_pid/sk_peer_cred locklessly for new socket. init_peercred() is called in 3 places: 1. socketpair() : both sockets 2. connect() : child socket 3. listen() : listening socket The first two need not hold sk_peer_lock because no one can touch the socket. Let's set cred/pid without holding lock for the two cases and rename the old init_peercred() to update_peercred() to properly reflect the use case. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	8647ece481	af_unix: Define locking order for U_RECVQ_LOCK_EMBRYO in unix_collect_skb(). While GC is cleaning up cyclic references by SCM_RIGHTS, unix_collect_skb() collects skb in the socket's recvq. If the socket is TCP_LISTEN, we need to collect skb in the embryo's queue. Then, both the listener's recvq lock and the embroy's one are held. The locking is always done in the listener -> embryo order. Let's define it as unix_recvq_lock_cmp_fn() instead of using spin_lock_nested(). Note that the reverse order is defined for consistency. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	7202cb5916	af_unix: Remove U_LOCK_GC_LISTENER. Commit `1971d13ffa` ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") added U_LOCK_GC_LISTENER for the old GC, but it's no longer needed for the new GC. Let's remove U_LOCK_GC_LISTENER and unix_state_lock_nested() as there's no user. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	c4da4661d9	af_unix: Remove U_LOCK_DIAG. sk_diag_dump_icons() acquires embryo's lock by unix_state_lock_nested() to fetch its peer. The embryo's ->peer is set to NULL only when its parent listener is close()d. Then, unix_release_sock() is called for each embryo after unlinking skb by skb_dequeue(). In sk_diag_dump_icons(), we hold the parent's recvq lock, so we need not acquire unix_state_lock_nested(), and peer is always non-NULL. Let's remove unnecessary unix_state_lock_nested() and non-NULL test for peer. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	b380b18102	af_unix: Don't acquire unix_state_lock() for sock_i_ino(). sk_diag_dump_peer() and sk_diag_dump() call unix_state_lock() for sock_i_ino() which reads SOCK_INODE(sk->sk_socket)->i_ino, but it's protected by sk->sk_callback_lock. Let's remove unnecessary unix_state_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	98f706de44	af_unix: Define locking order for U_LOCK_SECOND in unix_stream_connect(). While a SOCK_(STREAM\|SEQPACKET) socket connect()s to another, we hold two locks of them by unix_state_lock() and unix_state_lock_nested() in unix_stream_connect(). Before unix_state_lock_nested(), the following is guaranteed by checking sk->sk_state: 1. The first socket is TCP_LISTEN 2. The second socket is not the first one 3. Simultaneous connect() must fail So, the client state can be TCP_CLOSE or TCP_LISTEN or TCP_ESTABLISHED. Let's define the expected states as unix_state_lock_cmp_fn() instead of using unix_state_lock_nested(). Note that 2. is detected by debug_spin_lock_before() and 3. cannot be expressed as lock_cmp_fn. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	1ca27e0c8c	af_unix: Don't retry after unix_state_lock_nested() in unix_stream_connect(). When a SOCK_(STREAM\|SEQPACKET) socket connect()s to another one, we need to lock the two sockets to check their states in unix_stream_connect(). We use unix_state_lock() for the server and unix_state_lock_nested() for client with tricky sk->sk_state check to avoid deadlock. The possible deadlock scenario are the following: 1) Self connect() 2) Simultaneous connect() The former is simple, attempt to grab the same lock, and the latter is AB-BA deadlock. After the server's unix_state_lock(), we check the server socket's state, and if it's not TCP_LISTEN, connect() fails with -EINVAL. Then, we avoid the former deadlock by checking the client's state before unix_state_lock_nested(). If its state is not TCP_LISTEN, we can make sure that the client and the server are not identical based on the state. Also, the latter deadlock can be avoided in the same way. Due to the server sk->sk_state requirement, AB-BA deadlock could happen only with TCP_LISTEN sockets. So, if the client's state is TCP_LISTEN, we can give up the second lock to avoid the deadlock. CPU 1 CPU 2 CPU 3 connect(A -> B) connect(B -> A) listen(A) --- --- --- unix_state_lock(B) B->sk_state == TCP_LISTEN READ_ONCE(A->sk_state) == TCP_CLOSE ^^^^^^^^^ ok, will lock A unix_state_lock(A) .--------------' WRITE_ONCE(A->sk_state, TCP_LISTEN) \| unix_state_unlock(A) \| \| unix_state_lock(A) \| A->sk_sk_state == TCP_LISTEN \| READ_ONCE(B->sk_state) == TCP_LISTEN v ^^^^^^^^^^ unix_state_lock_nested(A) Don't lock B !! Currently, while checking the client's state, we also check if it's TCP_ESTABLISHED, but this is unlikely and can be checked after we know the state is not TCP_CLOSE. Moreover, if it happens after the second lock, we now jump to the restart label, but it's unlikely that the server is not found during the retry, so the jump is mostly to revist the client state check. Let's remove the retry logic and check the state against TCP_CLOSE first. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	ed99822817	af_unix: Define locking order for U_LOCK_SECOND in unix_state_double_lock(). unix_dgram_connect() and unix_dgram_{send,recv}msg() lock the socket and peer in ascending order of the socket address. Let's define the order as unix_state_lock_cmp_fn() instead of using unix_state_lock_nested(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	3955802f16	af_unix: Define locking order for unix_table_double_lock(). When created, AF_UNIX socket is put into net->unx.table.buckets[], and the hash is stored in sk->sk_hash. * unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD When bind() is called, the socket could be moved to another bucket. * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1 Then, we call unix_table_double_lock() which locks a single bucket or two. Let's define the order as unix_table_lock_cmp_fn() instead of using spin_lock_nested(). The locking is always done in ascending order of sk->sk_hash, which is the index of buckets/locks array allocated by kvmalloc_array(). sk_hash_A < sk_hash_B <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map So, the relation of two sk->sk_hash can be derived from the addresses of dep_map in the array of locks. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Nick Child	0983d288ca	ibmvnic: Add tx check to prevent skb leak Below is a summary of how the driver stores a reference to an skb during transmit: tx_buff[free_map[consumer_index]]->skb = new_skb; free_map[consumer_index] = IBMVNIC_INVALID_MAP; consumer_index ++; Where variable data looks like this: free_map == [4, IBMVNIC_INVALID_MAP, IBMVNIC_INVALID_MAP, 0, 3] consumer_index^ tx_buff == [skb=null, skb=<ptr>, skb=<ptr>, skb=null, skb=null] The driver has checks to ensure that free_map[consumer_index] pointed to a valid index but there was no check to ensure that this index pointed to an unused/null skb address. So, if, by some chance, our free_map and tx_buff lists become out of sync then we were previously risking an skb memory leak. This could then cause tcp congestion control to stop sending packets, eventually leading to ETIMEDOUT. Therefore, add a conditional to ensure that the skb address is null. If not then warn the user (because this is still a bug that should be patched) and free the old pointer to prevent memleak/tcp problems. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 10:43:42 +02:00
Andrew Bresticker	ab1ffc86cb	mm/memory: don't require head page for do_set_pmd() The requirement that the head page be passed to do_set_pmd() was added in commit `ef37b2ea08` ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte\|pmd]()") and prevents pmd-mapping in the finish_fault() and filemap_map_pages() paths if the page to be inserted is anything but the head page for an otherwise suitable vma and pmd-sized page. Matthew said: : We're going to stop using PMDs to map large folios unless the fault is : within the first 4KiB of the PMD. No idea how many workloads that : affects, but it only needs to be backported as far as v6.8, so we may : as well backport it. Link: https://lkml.kernel.org/r/20240611153216.2794513-1-abrestic@rivosinc.com Fixes: `ef37b2ea08` ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte\|pmd]()") Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:11 -07:00
yangge	bf14ed81f5	mm/page_alloc: Separate THP PCP into movable and non-movable categories Since commit `5d0a661d80` ("mm/page_alloc: use only one PCP list for THP-sized allocations") no longer differentiates the migration type of pages in THP-sized PCP list, it's possible that non-movable allocation requests may get a CMA page from the list, in some cases, it's not acceptable. If a large number of CMA memory are configured in system (for example, the CMA memory accounts for 50% of the system memory), starting a virtual machine with device passthrough will get stuck. During starting the virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory. Normally if a page is present and in CMA area, pin_user_pages_remote() will migrate the page from CMA area to non-CMA area because of FOLL_LONGTERM flag. But if non-movable allocation requests return CMA memory, migrate_longterm_unpinnable_pages() will migrate a CMA page to another CMA page, which will fail to pass the check in check_and_migrate_movable_pages() and cause migration endless. Call trace: pin_user_pages_remote --__gup_longterm_locked // endless loops in this function ----_get_user_pages_locked ----check_and_migrate_movable_pages ------migrate_longterm_unpinnable_pages --------alloc_migration_target This problem will also have a negative impact on CMA itself. For example, when CMA is borrowed by THP, and we need to reclaim it through cma_alloc() or dma_alloc_coherent(), we must move those pages out to ensure CMA's users can retrieve that contigous memory. Currently, CMA's memory is occupied by non-movable pages, meaning we can't relocate them. As a result, cma_alloc() is more likely to fail. To fix the problem above, we add one PCP list for THP, which will not introduce a new cacheline for struct per_cpu_pages. THP will have 2 PCP lists, one PCP list is used by MOVABLE allocation, and the other PCP list is used by UNMOVABLE allocation. MOVABLE allocation contains GPF_MOVABLE, and UNMOVABLE allocation contains GFP_UNMOVABLE and GFP_RECLAIMABLE. Link: https://lkml.kernel.org/r/1718845190-4456-1-git-send-email-yangge1116@126.com Fixes: `5d0a661d80` ("mm/page_alloc: use only one PCP list for THP-sized allocations") Signed-off-by: yangge <yangge1116@126.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:11 -07:00
Christoph Hellwig	54e7d59841	nfs: drop the incorrect assertion in nfs_swap_rw() Since commit `2282679fb2` ("mm: submit multipage write for SWP_FS_OPS swap-space"), we can plug multiple pages then unplug them all together. That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it actually equals the size of iov_iter_npages(iter, INT_MAX). Note this issue has nothing to do with large folios as we don't support THP_SWPOUT to non-block devices. [v-songbaohua@oppo.com: figure out the cause and correct the commit message] Link: https://lkml.kernel.org/r/20240618065647.21791-1-21cnbao@gmail.com Fixes: `2282679fb2` ("mm: submit multipage write for SWP_FS_OPS swap-space") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Closes: https://lore.kernel.org/linux-mm/20240617053201.GA16852@lst.de/ Reviewed-by: Martin Wege <martin.l.wege@gmail.com> Cc: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <anna@kernel.org> Cc: Steve French <sfrench@samba.org> Cc: Trond Myklebust <trondmy@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:11 -07:00
Zi Yan	c640825070	mm/migrate: make migrate_pages_batch() stats consistent As Ying pointed out in [1], stats->nr_thp_failed needs to be updated to avoid stats inconsistency between MIGRATE_SYNC and MIGRATE_ASYNC when calling migrate_pages_batch(). Because if not, when migrate_pages_batch() is called via migrate_pages(MIGRATE_ASYNC), nr_thp_failed will not be increased and when migrate_pages_batch() is called via migrate_pages(MIGRATE_SYNC*), nr_thp_failed will be increase in migrate_pages_sync() by stats->nr_thp_failed += astats.nr_thp_split. [1] https://lore.kernel.org/linux-mm/87msnq7key.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lkml.kernel.org/r/20240620012712.19804-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20240618134151.29214-1-zi.yan@sent.com Fixes: `7262f208ca` ("mm/migrate: split source folio if it is on deferred split list") Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:10 -07:00
Jarkko Sakkinen	f3228a2d4c	MAINTAINERS: TPM DEVICE DRIVER: update the W-tag Git hosting for the test suite has been migrated from Gitlab to Codeberg, given the "less hostile environment". Link: https://lkml.kernel.org/r/20240618133556.105604-1-jarkko@kernel.org Link: https://codeberg.org/jarkko/linux-tpmdd-test Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:10 -07:00
aigourensheng	8b8546d298	selftests/mm:fix test_prctl_fork_exec return failure After calling fork() in test_prctl_fork_exec(), the global variable ksm_full_scans_fd is initialized to 0 in the child process upon entering the main function of ./ksm_functional_tests. In the function call chain test_child_ksm() -> __mmap_and_merge_range -> ksm_merge-> ksm_get_full_scans, start_scans = ksm_get_full_scans() will return an error. Therefore, the value of ksm_full_scans_fd needs to be initialized before calling test_child_ksm in the child process. Link: https://lkml.kernel.org/r/20240617052934.5834-1-shechenglong001@gmail.com Signed-off-by: aigourensheng <shechenglong001@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:10 -07:00
Stephen Brennan	ff202303c3	mm: convert page type macros to enum Changing PG_slab from a page flag to a page type in commit `46df8e73a4` ("mm: free up PG_slab") in has the unintended consequence of removing the PG_slab constant from kernel debuginfo. The commit does add the value to the vmcoreinfo note, which allows debuggers to find the value without hardcoding it. However it's most flexible to continue representing the constant with an enum. To that end, convert the page type fields into an enum. Debuggers will now be able to detect that PG_slab's type has changed from enum pageflags to enum pagetype. Link: https://lkml.kernel.org/r/20240607202954.1198180-1-stephen.s.brennan@oracle.com Fixes: `46df8e73a4` ("mm: free up PG_slab") Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hao Ge <gehao@kylinos.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Omar Sandoval <osandov@osandov.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:10 -07:00
Jan Kara	be346c1a6e	ocfs2: fix DIO failure due to insufficient transaction credits The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 #1 __crash_kexec at ffffffff8c1338fa #2 panic at ffffffff8c1d69b9 #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] #10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] #11 dio_complete at ffffffff8c2b9fa7 #12 do_blockdev_direct_IO at ffffffff8c2bc09f #13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] #14 generic_file_direct_write at ffffffff8c1dcf14 #15 __generic_file_write_iter at ffffffff8c1dd07b #16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] #17 aio_write at ffffffff8c2cc72e #18 kmem_cache_alloc at ffffffff8c248dde #19 do_io_submit at ffffffff8c2ccada #20 do_syscall_64 at ffffffff8c004984 #21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz Fixes: `c15471f795` ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:10 -07:00
Andrey Konovalov	1c61990d37	kasan: fix bad call to unpoison_slab_object Commit `29d7355a9d` ("kasan: save alloc stack traces for mempool") messed up one of the calls to unpoison_slab_object: the last two arguments are supposed to be GFP flags and whether to init the object memory. Fix the call. Without this fix, __kasan_mempool_unpoison_object provides the object's size as GFP flags to unpoison_slab_object, which can cause LOCKDEP reports (and probably other issues). Link: https://lkml.kernel.org/r/20240614143238.60323-1-andrey.konovalov@linux.dev Fixes: `29d7355a9d` ("kasan: save alloc stack traces for mempool") Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reported-by: Brad Spengler <spender@grsecurity.net> Acked-by: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:09 -07:00
Suren Baghdasaryan	34a023dc88	mm: handle profiling for fake memory allocations during compaction During compaction isolated free pages are marked allocated so that they can be split and/or freed. For that, post_alloc_hook() is used inside split_map_pages() and release_free_list(). split_map_pages() marks free pages allocated, splits the pages and then lets alloc_contig_range_noprof() free those pages. release_free_list() marks free pages and immediately frees them. This usage of post_alloc_hook() affect memory allocation profiling because these functions might not be called from an instrumented allocator, therefore current->alloc_tag is NULL and when debugging is enabled (CONFIG_MEM_ALLOC_PROFILING_DEBUG=y) that causes warnings. To avoid that, wrap such post_alloc_hook() calls into an instrumented function which acts as an allocator which will be charged for these fake allocations. Note that these allocations are very short lived until they are freed, therefore the associated counters should usually read 0. Link: https://lkml.kernel.org/r/20240614230504.3849136-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:09 -07:00
Suren Baghdasaryan	b4601d096a	mm/slab: fix 'variable obj_exts set but not used' warning slab_post_alloc_hook() uses prepare_slab_obj_exts_hook() to obtain slabobj_ext object. Currently the only user of slabobj_ext object in this path is memory allocation profiling, therefore when it's not enabled this object is not needed. This also generates a warning when compiling with CONFIG_MEM_ALLOC_PROFILING=n. Move the code under this configuration to fix the warning. If more slabobj_ext users appear in the future, the code will have to be changed back to call prepare_slab_obj_exts_hook(). Link: https://lkml.kernel.org/r/20240614225951.3845577-1-surenb@google.com Fixes: `4b87369646` ("mm/slab: add allocation accounting into slab allocation and free paths") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202406150444.F6neSaiy-lkp@intel.com/ Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Kees Cook <keescook@chromium.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:09 -07:00
Jeff Xu	399ab86ea5	/proc/pid/smaps: add mseal info for vma Add sl in /proc/pid/smaps to indicate vma is sealed Link: https://lkml.kernel.org/r/20240614232014.806352-2-jeffxu@google.com Fixes: `8be7258aad` ("mseal: add mseal syscall") Signed-off-by: Jeff Xu <jeffxu@chromium.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:09 -07:00
Zhaoyang Huang	8c61291fd8	mm: fix incorrect vbq reference in purge_fragmented_block xa_for_each() in _vm_unmap_aliases() loops through all vbs. However, since commit `062eacf57a` ("mm: vmalloc: remove a global vmap_blocks xarray") the vb from xarray may not be on the corresponding CPU vmap_block_queue. Consequently, purge_fragmented_block() might use the wrong vbq->lock to protect the free list, leading to vbq->free breakage. Incorrect lock protection can exhaust all vmalloc space as follows: CPU0 CPU1 +--------------------------------------------+ \| +--------------------+ +-----+ \| +--> \| \|---->\| \|------+ \| CPU1:vbq free_list \| \| vb1 \| +--- \| \|<----\| \|<-----+ \| +--------------------+ +-----+ \| +--------------------------------------------+ _vm_unmap_aliases() vb_alloc() new_vmap_block() xa_for_each(&vbq->vmap_blocks, idx, vb) --> vb in CPU1:vbq->freelist purge_fragmented_block(vb) spin_lock(&vbq->lock) spin_lock(&vbq->lock) --> use CPU0:vbq->lock --> use CPU1:vbq->lock list_del_rcu(&vb->free_list) list_add_tail_rcu(&vb->free_list, &vbq->free) __list_del(vb->prev, vb->next) next->prev = prev +--------------------+ \| \| \| CPU1:vbq free_list \| +---\| \|<--+ \| +--------------------+ \| +----------------------------+ __list_add(new, head->prev, head) +--------------------------------------------+ \| +--------------------+ +-----+ \| +--> \| \|---->\| \|------+ \| CPU1:vbq free_list \| \| vb2 \| +--- \| \|<----\| \|<-----+ \| +--------------------+ +-----+ \| +--------------------------------------------+ prev->next = next +--------------------------------------------+ \|----------------------------+ \| \| +--------------------+ \| +-----+ \| +--> \| \|--+ \| \|------+ \| CPU1:vbq free_list \| \| vb2 \| +--- \| \|<----\| \|<-----+ \| +--------------------+ +-----+ \| +--------------------------------------------+ Here’s a list breakdown. All vbs, which were to be added to ‘prev’, cannot be used by list_for_each_entry_rcu(vb, &vbq->free, free_list) in vb_alloc(). Thus, vmalloc space is exhausted. This issue affects both erofs and f2fs, the stacktrace is as follows: erofs: [<ffffffd4ffb93ad4>] __switch_to+0x174 [<ffffffd4ffb942f0>] __schedule+0x624 [<ffffffd4ffb946f4>] schedule+0x7c [<ffffffd4ffb947cc>] schedule_preempt_disabled+0x24 [<ffffffd4ffb962ec>] __mutex_lock+0x374 [<ffffffd4ffb95998>] __mutex_lock_slowpath+0x14 [<ffffffd4ffb95954>] mutex_lock+0x24 [<ffffffd4fef2900c>] reclaim_and_purge_vmap_areas+0x44 [<ffffffd4fef25908>] alloc_vmap_area+0x2e0 [<ffffffd4fef24ea0>] vm_map_ram+0x1b0 [<ffffffd4ff1b46f4>] z_erofs_lz4_decompress+0x278 [<ffffffd4ff1b8ac4>] z_erofs_decompress_queue+0x650 [<ffffffd4ff1b8328>] z_erofs_runqueue+0x7f4 [<ffffffd4ff1b66a8>] z_erofs_read_folio+0x104 [<ffffffd4feeb6fec>] filemap_read_folio+0x6c [<ffffffd4feeb68c4>] filemap_fault+0x300 [<ffffffd4fef0ecac>] __do_fault+0xc8 [<ffffffd4fef0c908>] handle_mm_fault+0xb38 [<ffffffd4ffb9f008>] do_page_fault+0x288 [<ffffffd4ffb9ed64>] do_translation_fault[jt]+0x40 [<ffffffd4fec39c78>] do_mem_abort+0x58 [<ffffffd4ffb8c3e4>] el0_ia+0x70 [<ffffffd4ffb8c260>] el0t_64_sync_handler[jt]+0xb0 [<ffffffd4fec11588>] ret_to_user[jt]+0x0 f2fs: [<ffffffd4ffb93ad4>] __switch_to+0x174 [<ffffffd4ffb942f0>] __schedule+0x624 [<ffffffd4ffb946f4>] schedule+0x7c [<ffffffd4ffb947cc>] schedule_preempt_disabled+0x24 [<ffffffd4ffb962ec>] __mutex_lock+0x374 [<ffffffd4ffb95998>] __mutex_lock_slowpath+0x14 [<ffffffd4ffb95954>] mutex_lock+0x24 [<ffffffd4fef2900c>] reclaim_and_purge_vmap_areas+0x44 [<ffffffd4fef25908>] alloc_vmap_area+0x2e0 [<ffffffd4fef24ea0>] vm_map_ram+0x1b0 [<ffffffd4ff1a3b60>] f2fs_prepare_decomp_mem+0x144 [<ffffffd4ff1a6c24>] f2fs_alloc_dic+0x264 [<ffffffd4ff175468>] f2fs_read_multi_pages+0x428 [<ffffffd4ff17b46c>] f2fs_mpage_readpages+0x314 [<ffffffd4ff1785c4>] f2fs_readahead+0x50 [<ffffffd4feec3384>] read_pages+0x80 [<ffffffd4feec32c0>] page_cache_ra_unbounded+0x1a0 [<ffffffd4feec39e8>] page_cache_ra_order+0x274 [<ffffffd4feeb6cec>] do_sync_mmap_readahead+0x11c [<ffffffd4feeb6764>] filemap_fault+0x1a0 [<ffffffd4ff1423bc>] f2fs_filemap_fault+0x28 [<ffffffd4fef0ecac>] __do_fault+0xc8 [<ffffffd4fef0c908>] handle_mm_fault+0xb38 [<ffffffd4ffb9f008>] do_page_fault+0x288 [<ffffffd4ffb9ed64>] do_translation_fault[jt]+0x40 [<ffffffd4fec39c78>] do_mem_abort+0x58 [<ffffffd4ffb8c3e4>] el0_ia+0x70 [<ffffffd4ffb8c260>] el0t_64_sync_handler[jt]+0xb0 [<ffffffd4fec11588>] ret_to_user[jt]+0x0 To fix this, introducee cpu within vmap_block to record which this vb belongs to. Link: https://lkml.kernel.org/r/20240614021352.1822225-1-zhaoyang.huang@unisoc.com Link: https://lkml.kernel.org/r/20240607023116.1720640-1-zhaoyang.huang@unisoc.com Fixes: `fc1e0d9800` ("mm/vmalloc: prevent stale TLBs in fully utilized blocks") Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Suggested-by: Hailong.Liu <hailong.liu@oppo.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 20:52:08 -07:00
Jakub Kicinski	482000cf7f	bpf-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZnlmXgAKCRDbK58LschI g2ovAP9iynwwFEjMSxHjQVXSq1J1PMqF4966vmy30RCKJMMN/QD/SRsRRKcfsPis BzKOdsOVbWlDl2CUqvBrPZGT6laKoQc= =6/0V -----END PGP SIGNATURE----- Merge tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-06-24 We've added 12 non-merge commits during the last 10 day(s) which contain a total of 10 files changed, 412 insertions(+), 16 deletions(-). The main changes are: 1) Fix a BPF verifier issue validating may_goto with a negative offset, from Alexei Starovoitov. 2) Fix a BPF verifier validation bug with may_goto combined with jump to the first instruction, also from Alexei Starovoitov. 3) Fix a bug with overrunning reservations in BPF ring buffer, from Daniel Borkmann. 4) Fix a bug in BPF verifier due to missing proper var_off setting related to movsx instruction, from Yonghong Song. 5) Silence unnecessary syzkaller-triggered warning in __xdp_reg_mem_model(), from Daniil Dulov. * tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xdp: Remove WARN() from __xdp_reg_mem_model() selftests/bpf: Add tests for may_goto with negative offset. bpf: Fix may_goto with negative offset. selftests/bpf: Add more ring buffer test coverage bpf: Fix overrunning reservations in ringbuf selftests/bpf: Tests with may_goto and jumps to the 1st insn bpf: Fix the corner case with may_goto and jump to the 1st insn. bpf: Update BPF LSM maintainer list bpf: Fix remap of arena. selftests/bpf: Add a few tests to cover bpf: Add missed var_off setting in coerce_subreg_to_size_sx() bpf: Add missed var_off setting in set_sext32_default_val() ==================== Link: https://patch.msgid.link/20240624124330.8401-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 18:15:22 -07:00
Jakub Kicinski	bf2468f9af	Merge branch 'locking-introduce-nested-bh-locking' Sebastian Andrzej Siewior says: ==================== locking: Introduce nested-BH locking. Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within local_bh_disable() section remains preemtible. As a result high prior tasks (or threaded interrupts) will be blocked by lower-prio task (or threaded interrupts) which are long running which includes softirq sections. The proposed way out is to introduce explicit per-CPU locks for resources which are protected by local_bh_disable() and use those only on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. The series introduces the infrastructure and converts large parts of networking which is largest stake holder here. Once this done the per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted. Performance testing. Baseline is net-next as of commit `93bda33046` ("Merge branch'net-constify-ctl_table-arguments-of-utility-functions'") plus v6.10-rc1. A 10GiG link is used between two hosts. The command xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e was invoked on the receiving side with a ixgbe. The sending side uses pktgen_sample03_burst_single_flow.sh on i40e. Baseline: \| eth1->? 9,018,604 rx/s 0 err,drop/s \| receive total 9,018,604 pkt/s 0 drop/s 0 error/s \| cpu:7 9,018,604 pkt/s 0 drop/s 0 error/s \| enqueue to cpu 3 9,018,602 pkt/s 0 drop/s 7.00 bulk-avg \| cpu:7->3 9,018,602 pkt/s 0 drop/s 7.00 bulk-avg \| kthread total 9,018,606 pkt/s 0 drop/s 214,698 sched \| cpu:3 9,018,606 pkt/s 0 drop/s 214,698 sched \| xdp_stats 0 pass/s 9,018,606 drop/s 0 redir/s \| cpu:3 0 pass/s 9,018,606 drop/s 0 redir/s \| redirect_err 0 error/s \| xdp_exception 0 hit/s perf top --sort cpu,symbol --no-children: \| 18.14% 007 [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash \| 13.29% 007 [k] ixgbe_poll \| 12.66% 003 [k] cpu_map_kthread_run \| 7.23% 003 [k] page_frag_free \| 6.76% 007 [k] xdp_do_redirect \| 3.76% 007 [k] cpu_map_redirect \| 3.13% 007 [k] bq_flush_to_queue \| 2.51% 003 [k] xdp_return_frame \| 1.93% 007 [k] try_to_wake_up \| 1.78% 007 [k] _raw_spin_lock \| 1.74% 007 [k] cpu_map_enqueue \| 1.56% 003 [k] bpf_prog_57cd311f2e27366b_cpumap_drop With this series applied: \| eth1->? 10,329,340 rx/s 0 err,drop/s \| receive total 10,329,340 pkt/s 0 drop/s 0 error/s \| cpu:6 10,329,340 pkt/s 0 drop/s 0 error/s \| enqueue to cpu 3 10,329,338 pkt/s 0 drop/s 8.00 bulk-avg \| cpu:6->3 10,329,338 pkt/s 0 drop/s 8.00 bulk-avg \| kthread total 10,329,321 pkt/s 0 drop/s 96,297 sched \| cpu:3 10,329,321 pkt/s 0 drop/s 96,297 sched \| xdp_stats 0 pass/s 10,329,321 drop/s 0 redir/s \| cpu:3 0 pass/s 10,329,321 drop/s 0 redir/s \| redirect_err 0 error/s \| xdp_exception 0 hit/s perf top --sort cpu,symbol --no-children: \| 20.90% 006 [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash \| 12.62% 006 [k] ixgbe_poll \| 9.82% 003 [k] page_frag_free \| 8.73% 003 [k] cpu_map_bpf_prog_run_xdp \| 6.63% 006 [k] xdp_do_redirect \| 4.94% 003 [k] cpu_map_kthread_run \| 4.28% 006 [k] cpu_map_redirect \| 4.03% 006 [k] bq_flush_to_queue \| 3.01% 003 [k] xdp_return_frame \| 1.95% 006 [k] _raw_spin_lock \| 1.94% 003 [k] bpf_prog_57cd311f2e27366b_cpumap_drop This diff appears to be noise. v8: https://lore.kernel.org/all/20240619072253.504963-1-bigeasy@linutronix.de v7: https://lore.kernel.org/all/20240618072526.379909-1-bigeasy@linutronix.de v6: https://lore.kernel.org/all/20240612170303.3896084-1-bigeasy@linutronix.de v5: https://lore.kernel.org/all/20240607070427.1379327-1-bigeasy@linutronix.de v4: https://lore.kernel.org/all/20240604154425.878636-1-bigeasy@linutronix.de v3: https://lore.kernel.org/all/20240529162927.403425-1-bigeasy@linutronix.de v2: https://lore.kernel.org/all/20240503182957.1042122-1-bigeasy@linutronix.de v1: https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de ==================== Link: https://patch.msgid.link/20240620132727.660738-1-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:26 -07:00
Sebastian Andrzej Siewior	3f9fe37d9e	net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT. The per-CPU flush lists, which are accessed from within the NAPI callback (xdp_do_flush() for instance), are per-CPU. There are subject to the same problem as struct bpf_redirect_info. Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the access. The lists initialized on first usage (similar to bpf_net_ctx_get_ri()). Cc: "Björn Töpel" <bjorn@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Yonghong Song <yonghong.song@linux.dev> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-16-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:24 -07:00
Sebastian Andrzej Siewior	401cb7dae8	net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. The XDP redirect process is two staged: - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the packet and makes decisions. While doing that, the per-CPU variable bpf_redirect_info is used. - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info and it may also access other per-CPU variables like xskmap_flush_list. At the very end of the NAPI callback, xdp_do_flush() is invoked which does not access bpf_redirect_info but will touch the individual per-CPU lists. The per-CPU variables are only used in the NAPI callback hence disabling bottom halves is the only protection mechanism. Users from preemptible context (like cpu_map_kthread_run()) explicitly disable bottom halves for protections reasons. Without locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. PREEMPT_RT has forced-threaded interrupts enabled and every NAPI-callback runs in a thread. If each thread has its own data structure then locking can be avoided. Create a struct bpf_net_context which contains struct bpf_redirect_info. Define the variable on stack, use bpf_net_ctx_set() to save a pointer to it, bpf_net_ctx_clear() removes it again. The bpf_net_ctx_set() may nest. For instance a function can be used from within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and NET_TX_SOFTIRQ which does not. Therefore only the first invocations updates the pointer. Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct bpf_redirect_info. The returned data structure is zero initialized to ensure nothing is leaked from stack. This is done on first usage of the struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to note that initialisation is required. First invocation of bpf_net_ctx_get_ri() will memset() the data structure and update bpf_redirect_info::kern_flags. bpf_redirect_info::nh is excluded from memset because it is only used once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is moved past nh to exclude it from memset. The pointer to bpf_net_context is saved task's task_struct. Using always the bpf_net_context approach has the advantage that there is almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds. Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Yonghong Song <yonghong.song@linux.dev> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:24 -07:00
Sebastian Andrzej Siewior	78f520b7bb	net: Use nested-BH locking for bpf_scratchpad. bpf_scratchpad is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-14-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	d1542d4ae4	seg6: Use nested-BH locking for seg6_bpf_srh_states. The access to seg6_bpf_srh_states is protected by disabling preemption. Based on the code, the entry point is input_action_end_bpf() and every other function (the bpf helper functions bpf_lwt_seg6_*()), that is accessing seg6_bpf_srh_states, should be called from within input_action_end_bpf(). input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of the function and then disables preemption. This looks wrong because if preemption needs to be disabled as part of the locking mechanism then the variable shouldn't be accessed beforehand. Looking at how it is used via test_lwt_seg6local.sh then input_action_end_bpf() is always invoked from softirq context. If this is always the case then the preempt_disable() statement is superfluous. If this is not always invoked from softirq then disabling only preemption is not sufficient. Replace the preempt_disable() statement with nested-BH locking. This is not an equivalent replacement as it assumes that the invocation of input_action_end_bpf() always occurs in softirq context and thus the preempt_disable() is superfluous. Add a local_lock_t the data structure and use local_lock_nested_bh() for locking. Add lockdep_assert_held() to ensure the lock is held while the per-CPU variable is referenced in the helper functions. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: David Ahern <dsahern@kernel.org> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-13-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	3414adbd6a	lwt: Don't disable migration prio invoking BPF. There is no need to explicitly disable migration if bottom halves are also disabled. Disabling BH implies disabling migration. Remove migrate_disable() and rely solely on disabling BH to remain on the same CPU. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-12-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	b22800f9d3	dev: Use nested-BH locking for softnet_data.process_queue. softnet_data::process_queue is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. softnet_data::input_queue_head can be updated lockless. This is fine because this value is only update CPU local by the local backlog_napi thread. Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking of process_queue. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-11-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	a8760d0d14	dev: Remove PREEMPT_RT ifdefs from backlog_lock.(). The backlog_napi locking (previously RPS) relies on explicit locking if either RPS or backlog NAPI is enabled. If both are disabled then locking was achieved by disabling interrupts except on PREEMPT_RT. PREEMPT_RT was excluded because the needed synchronisation was already provided local_bh_disable(). Since the introduction of backlog NAPI and making it mandatory for PREEMPT_RT the ifdef within backlog_lock.() is obsolete and can be removed. Remove the ifdefs in backlog_lock.*(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-10-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	ecefbc09e8	net: softnet_data: Make xmit per task. Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in local_bh_disable() there is no guarantee that only one device is transmitting at a time. With preemption and multiple senders it is possible that the per-CPU `recursion' counter gets incremented by different threads and exceeds XMIT_RECURSION_LIMIT leading to a false positive recursion alert. The `more' member is subject to similar problems if set by one thread for one driver and wrongly used by another driver within another thread. Instead of adding a lock to protect the per-CPU variable it is simpler to make xmit per-task. Sending and receiving skbs happens always in thread context anyway. Having a lock to protected the per-CPU counter would block/ serialize two sending threads needlessly. It would also require a recursive lock to ensure that the owner can increment the counter further. Make the softnet_data.xmit a task_struct member on PREEMPT_RT. Add needed wrapper. Cc: Ben Segall <bsegall@google.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-9-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	c67ef53a88	netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage. brnf_frag_data_storage is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: Florian Westphal <fw@strlen.de> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Roopa Prabhu <roopa@nvidia.com> Cc: bridge@lists.linux.dev Cc: coreteam@netfilter.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-8-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	ebad6d0334	net/ipv4: Use nested-BH locking for ipv4_tcp_sk. ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Make a struct with a sock member (original ipv4_tcp_sk) and a local_lock_t and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: David Ahern <dsahern@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-7-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:22 -07:00
Sebastian Andrzej Siewior	585aa621af	net/tcp_sigpool: Use nested-BH locking for sigpool_scratch. sigpool_scratch is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Make a struct with a pad member (original sigpool_scratch) and a local_lock_t and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: David Ahern <dsahern@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-6-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:22 -07:00
Sebastian Andrzej Siewior	bdacf3e349	net: Use nested-BH locking for napi_alloc_cache. napi_alloc_cache is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-5-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:22 -07:00
Sebastian Andrzej Siewior	43d7ca2907	net: Use __napi_alloc_frag_align() instead of open coding it. The else condition within __netdev_alloc_frag_align() is an open coded __napi_alloc_frag_align(). Use __napi_alloc_frag_align() instead of open coding it. Move fragsz assignment before page_frag_alloc_align() invocation because __napi_alloc_frag_align() also contains this statement. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-4-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:22 -07:00
Sebastian Andrzej Siewior	c5bcab7558	locking/local_lock: Add local nested BH locking infrastructure. Add local_lock_nested_bh() locking. It is based on local_lock_t and the naming follows the preempt_disable_nested() example. For !PREEMPT_RT + !LOCKDEP it is a per-CPU annotation for locking assumptions based on local_bh_disable(). The macro is optimized away during compilation. For !PREEMPT_RT + LOCKDEP the local_lock_nested_bh() is reduced to the usual lock-acquire plus lockdep_assert_in_softirq() - ensuring that BH is disabled. For PREEMPT_RT local_lock_nested_bh() acquires the specified per-CPU lock. It does not disable CPU migration because it relies on local_bh_disable() disabling CPU migration. With LOCKDEP it performans the usual lockdep checks as with !PREEMPT_RT. Due to include hell the softirq check has been moved spinlock.c. The intention is to use this locking in places where locking of a per-CPU variable relies on BH being disabled. Instead of treating disabled bottom halves as a big per-CPU lock, PREEMPT_RT can use this to reduce the locking scope to what actually needs protecting. A side effect is that it also documents the protection scope of the per-CPU variables. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-3-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:22 -07:00

... 2 3 4 5 6 ...

1281745 Commits