linux

Author	SHA1	Message	Date
Sven Schnelle	bb793562f0	entry: Rename exit_to_user_mode() In order to make this function publicly available rename it so it can still be inlined. An additional exit_to_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201201142755.31931-3-svens@linux.ibm.com	2020-12-02 15:07:57 +01:00
Sven Schnelle	6666bb714f	entry: Rename enter_from_user_mode() In order to make this function publicly available rename it so it can still be inlined. An additional enter_from_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201201142755.31931-2-svens@linux.ibm.com	2020-12-02 15:07:57 +01:00
Gabriel Krisman Bertazi	11894468e3	entry: Support Syscall User Dispatch on common syscall entry Syscall User Dispatch (SUD) must take precedence over seccomp and ptrace, since the use case is emulation (it can be invoked with a different ABI) such that seccomp filtering by syscall number doesn't make sense in the first place. In addition, either the syscall is dispatched back to userspace, in which case there is no resource for to trace, or the syscall will be executed, and seccomp/ptrace will execute next. Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as well, just to prevent a trace exit event when dispatch was triggered. For that, the on_syscall_dispatch() examines context to skip the tracepoint, audit and other work. [ tglx: Add a comment on the exit side ] Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201127193238.821364-5-krisman@collabora.com	2020-12-02 15:07:56 +01:00
Gabriel Krisman Bertazi	1446e1df9e	kernel: Implement selective syscall userspace redirection Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine. The proposed interface looks like this: prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector]) The range [<offset>,<offset>+<length>) is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn. selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel. The feature is meant to be set per-thread and it is disabled on fork/clone/execv. Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher. For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com	2020-12-02 15:07:56 +01:00
Steven Rostedt (VMware)	5b7be9c709	ring-buffer: Add test to validate the time stamp deltas While debugging a situation where a delta for an event was calucalted wrong, I realize there was nothing making sure that the delta of events are correct. If a single event has an incorrect delta, then all events after it will also have one. If the discrepency gets large enough, it could cause the time stamps to go backwards when crossing sub buffers, that record a full 64 bit time stamp, and the new deltas are added to that. Add a way to validate the events at most events and when crossing a buffer page. This will help make sure that the deltas are always correct. This test will detect if they are ever corrupted. The test adds a high overhead to the ring buffer recording, as it does the audit for almost every event, and should only be used for testing the ring buffer. This will catch the bug that is fixed by commit `55ea4cf403` ("ring-buffer: Update write stamp with the correct ts"), which is not applied when this commit is applied. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-12-01 22:00:46 -05:00
Jason Gunthorpe	2b0a999ba0	Merge tag 'v5.10-rc6' into rdma.git for-next For dependencies in following patches Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2020-12-01 20:40:50 -04:00
Linus Torvalds	ef6900acc8	Merge tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Use correct timestamp variable for ring buffer write stamp update - Fix up before stamp and write stamp when crossing ring buffer sub buffers - Keep a zero delta in ring buffer in slow path if cmpxchg fails - Fix trace_printk static buffer for archs that care - Fix ftrace record accounting for ftrace ops with trampolines - Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency - Remove WARN_ON in hwlat tracer that triggers on something that is OK - Make "my_tramp" trampoline in ftrace direct sample code global - Fixes in the bootconfig tool for better alignment management * tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Always check to put back before stamp when crossing pages ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency ftrace: Fix updating FTRACE_FL_TRAMP tracing: Fix alignment of static buffer tracing: Remove WARN_ON in start_thread() samples/ftrace: Mark my_tramp[12]? global ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next() ring-buffer: Update write stamp with the correct ts docs: bootconfig: Update file format on initrd image tools/bootconfig: Align the bootconfig applied initrd image size to 4 tools/bootconfig: Fix to check the write failure correctly tools/bootconfig: Fix errno reference after printf()	2020-12-01 15:30:18 -08:00
Christoph Hellwig	0d02129e76	block: merge struct block_device and struct hd_struct Instead of having two structures that represent each block device with different life time rules, merge them into a single one. This also greatly simplifies the reference counting rules, as we can use the inode reference count as the main reference count for the new struct block_device, with the device model reference front ending it for device model interaction. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-01 14:53:40 -07:00
Christoph Hellwig	29ff57c610	block: move the start_sect field to struct block_device Move the start_sect field to struct block_device in preparation of killing struct hd_struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-01 14:53:40 -07:00
Christoph Hellwig	a782483cc1	block: remove the nr_sects field in struct hd_struct Now that the hd_struct always has a block device attached to it, there is no need for having two size field that just get out of sync. Additionally the field in hd_struct did not use proper serialization, possibly allowing for torn writes. By only using the block_device field this problem also gets fixed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-01 14:53:40 -07:00
Sami Tolvanen	a2abe7cbd8	scs: switch to vmapped shadow stacks The kernel currently uses kmem_cache to allocate shadow call stacks, which means an overflows may not be immediately detected and can potentially result in another task's shadow stack to be overwritten. This change switches SCS to use virtually mapped shadow stacks for tasks, which increases shadow stack size to a full page and provides more robust overflow detection, similarly to VMAP_STACK. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20201130233442.2562064-2-samitolvanen@google.com Signed-off-by: Will Deacon <will@kernel.org>	2020-12-01 10:30:28 +00:00
Steven Rostedt (VMware)	68e10d5ff5	ring-buffer: Always check to put back before stamp when crossing pages The current ring buffer logic checks to see if the updating of the event buffer was interrupted, and if it is, it will try to fix up the before stamp with the write stamp to make them equal again. This logic is flawed, because if it is not interrupted, the two are guaranteed to be different, as the current event just updated the before stamp before allocation. This guarantees that the next event (this one or another interrupting one) will think it interrupted the time updates of a previous event and inject an absolute time stamp to compensate. The correct logic is to always update the timestamps when traversing to a new sub buffer. Cc: stable@vger.kernel.org Fixes: `a389d86f7f` ("ring-buffer: Have nested events still record running time stamp") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-11-30 23:21:51 -05:00
Naveen N. Rao	49a962c075	ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency DYNAMIC_FTRACE_WITH_DIRECT_CALLS should depend on DYNAMIC_FTRACE_WITH_REGS since we need ftrace_regs_caller(). Link: https://lkml.kernel.org/r/fc4b257ea8689a36f086d2389a9ed989496ca63a.1606412433.git.naveen.n.rao@linux.vnet.ibm.com Cc: stable@vger.kernel.org Fixes: `763e34e74b` ("ftrace: Add register_ftrace_direct()") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-11-30 21:43:08 -05:00
Naveen N. Rao	4c75b0ff4e	ftrace: Fix updating FTRACE_FL_TRAMP On powerpc, kprobe-direct.tc triggered FTRACE_WARN_ON() in ftrace_get_addr_new() followed by the below message: Bad trampoline accounting at: 000000004222522f (wake_up_process+0xc/0x20) (f0000001) The set of steps leading to this involved: - modprobe ftrace-direct-too - enable_probe - modprobe ftrace-direct - rmmod ftrace-direct <-- trigger The problem turned out to be that we were not updating flags in the ftrace record properly. From the above message about the trampoline accounting being bad, it can be seen that the ftrace record still has FTRACE_FL_TRAMP set though ftrace-direct module is going away. This happens because we are checking if any ftrace_ops has the FTRACE_FL_TRAMP flag set _before_ updating the filter hash. The fix for this is to look for any _other_ ftrace_ops that also needs FTRACE_FL_TRAMP. Link: https://lkml.kernel.org/r/56c113aa9c3e10c19144a36d9684c7882bf09af5.1606412433.git.naveen.n.rao@linux.vnet.ibm.com Cc: stable@vger.kernel.org Fixes: `a124692b69` ("ftrace: Enable trampoline when rec count returns back to one") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-11-30 21:43:07 -05:00
Minchan Kim	8fa655a3a0	tracing: Fix alignment of static buffer With 5.9 kernel on ARM64, I found ftrace_dump output was broken but it had no problem with normal output "cat /sys/kernel/debug/tracing/trace". With investigation, it seems coping the data into temporal buffer seems to break the align binary printf expects if the static buffer is not aligned with 4-byte. IIUC, get_arg in bstr_printf expects that args has already right align to be decoded and seq_buf_bprintf says ``the arguments are saved in a 32bit word array that is defined by the format string constraints``. So if we don't keep the align under copy to temporal buffer, the output will be broken by shifting some bytes. This patch fixes it. Link: https://lkml.kernel.org/r/20201125225654.1618966-1-minchan@kernel.org Cc: <stable@vger.kernel.org> Fixes: `8e99cf91b9` ("tracing: Do not allocate buffer in trace_find_next_entry() in atomic") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-11-30 21:43:07 -05:00
Vasily Averin	310e3a4b5a	tracing: Remove WARN_ON in start_thread() This patch reverts commit `978defee11` ("tracing: Do a WARN_ON() if start_thread() in hwlat is called when thread exists") .start hook can be legally called several times if according tracer is stopped screen window 1 [root@localhost ~]# echo 1 > /sys/kernel/tracing/events/kmem/kfree/enable [root@localhost ~]# echo 1 > /sys/kernel/tracing/options/pause-on-trace [root@localhost ~]# less -F /sys/kernel/tracing/trace screen window 2 [root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on 0 [root@localhost ~]# echo hwlat > /sys/kernel/debug/tracing/current_tracer [root@localhost ~]# echo 1 > /sys/kernel/debug/tracing/tracing_on [root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on 0 [root@localhost ~]# echo 2 > /sys/kernel/debug/tracing/tracing_on triggers warning in dmesg: WARNING: CPU: 3 PID: 1403 at kernel/trace/trace_hwlat.c:371 hwlat_tracer_start+0xc9/0xd0 Link: https://lkml.kernel.org/r/bd4d3e70-400d-9c82-7b73-a2d695e86b58@virtuozzo.com Cc: Ingo Molnar <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: `978defee11` ("tracing: Do a WARN_ON() if start_thread() in hwlat is called when thread exists") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-11-30 21:43:07 -05:00
Andrea Righi	8785f51a17	ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next() In the slow path of __rb_reserve_next() a nested event(s) can happen between evaluating the timestamp delta of the current event and updating write_stamp via local_cmpxchg(); in this case the delta is not valid anymore and it should be set to 0 (same timestamp as the interrupting event), since the event that we are currently processing is not the last event in the buffer. Link: https://lkml.kernel.org/r/X8IVJcp1gRE+FJCJ@xps-13-7390 Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Link: https://lwn.net/Articles/831207 Fixes: `a389d86f7f` ("ring-buffer: Have nested events still record running time stamp") Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-11-30 15:22:05 -05:00
Steven Rostedt (VMware)	55ea4cf403	ring-buffer: Update write stamp with the correct ts The write stamp, used to calculate deltas between events, was updated with the stale "ts" value in the "info" structure, and not with the updated "ts" variable. This caused the deltas between events to be inaccurate, and when crossing into a new sub buffer, had time go backwards. Link: https://lkml.kernel.org/r/20201124223917.795844-1-elavila@google.com Cc: stable@vger.kernel.org Fixes: `a389d86f7f` ("ring-buffer: Have nested events still record running time stamp") Reported-by: "J. Avila" <elavila@google.com> Tested-by: Daniel Mentz <danielmentz@google.com> Tested-by: Will McVicker <willmcvicker@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-11-30 15:21:31 -05:00
Marc Zyngier	4615fbc378	genirq/irqdomain: Don't try to free an interrupt that has no mapping When an interrupt allocation fails for N interrupts, it is pretty common for the error handling code to free the same number of interrupts, no matter how many interrupts have actually been allocated. This may result in the domain freeing code to be unexpectedly called for interrupts that have no mapping in that domain. Things end pretty badly. Instead, add some checks to irq_domain_free_irqs_hierarchy() to make sure that thiss does not follow the hierarchy if no mapping exists for a given interrupt. Fixes: `6a6544e520` ("genirq/irqdomain: Remove auto-recursive hierarchy support") Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201129135551.396777-1-maz@kernel.org	2020-11-30 14:50:21 +01:00
Laurent Vivier	bb4c6910c8	genirq/irqdomain: Add an irq_create_mapping_affinity() function There is currently no way to convey the affinity of an interrupt via irq_create_mapping(), which creates issues for devices that expect that affinity to be managed by the kernel. In order to sort this out, rename irq_create_mapping() to irq_create_mapping_affinity() with an additional affinity parameter that can be passed down to irq_domain_alloc_descs(). irq_create_mapping() is re-implemented as a wrapper around irq_create_mapping_affinity(). No functional change. Fixes: `e75eafb9b0` ("genirq/msi: Switch to new irq spreading infrastructure") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kurz <groug@kaod.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201126082852.1178497-2-lvivier@redhat.com	2020-11-30 12:21:31 +01:00
Linus Torvalds	f91a3aa6bc	Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Two more places which invoke tracing from RCU disabled regions in the idle path. Similar to the entry path the low level idle functions have to be non-instrumentable" * tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: intel_idle: Fix intel_idle() vs tracing sched/idle: Fix arch_cpu_idle() vs tracing	2020-11-29 11:19:26 -08:00
Jakub Kicinski	5c39f26e67	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Trivial conflict in CAN, keep the net-next + the byteswap wrapper. Conflicts: drivers/net/can/usb/gs_usb.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-11-27 18:25:27 -08:00
Arnd Bergmann	718e43b5f8	Backmerge tag 'v5.10-rc2' into arm/drivers The SCMI pull request for the arm/drivers branch requires v5.10-rc2 because of dependencies with other git trees, so merge that in here. Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-11-27 21:04:53 +01:00
Linus Torvalds	43d6ecd97c	Merge tag 'printk-for-5.10-rc6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fixes from Petr Mladek: - do not lose trailing newline in pr_cont() calls - two trivial fixes for a dead store and a config description * tag 'printk-for-5.10-rc6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: finalize records with trailing newlines printk: remove unneeded dead-store assignment init/Kconfig: Fix CPU number in LOG_CPU_MAX_BUF_SHIFT description	2020-11-27 10:38:36 -08:00
Petr Mladek	739e7116b1	Merge branch 'for-5.10-pr_cont-fixup' into for-linus	2020-11-27 13:41:23 +01:00
John Ogness	4ad9921af4	printk: finalize records with trailing newlines Any record with a trailing newline (LOG_NEWLINE flag) cannot be continued because the newline has been stripped and will not be visible if the message is appended. This was already handled correctly when committing in log_output() but was not handled correctly when committing in log_store(). Fixes: `f5f022e53b` ("printk: reimplement log_cont using record extension") Link: https://lore.kernel.org/r/20201126114836.14750-1-john.ogness@linutronix.de Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com>	2020-11-27 11:58:54 +01:00
Ingo Molnar	a787bdaff8	Merge branch 'linus' into sched/core, to resolve semantic conflict Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-11-27 11:10:50 +01:00
Barry Song	65789daa80	dma-mapping: add benchmark support for streaming DMA APIs Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap. A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU. So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind Cc: Will Deacon <will@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> [hch: folded in two fixes from Colin Ian King <colin.king@canonical.com>] Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-11-27 10:33:42 +01:00
tangjianqiang	819b70ad62	dma-contiguous: fix a typo error in a comment Fix a typo error in cma description comment: "then" -> "than". Signed-off-by: tangjianqiang <tangjianqiang@xiaomi.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-11-27 10:33:42 +01:00
Tiezhu Yang	94035edcb4	dma-pool: no need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-11-27 10:33:42 +01:00
Alexey Kardashevskiy	8d8d53cf8f	dma-mapping: Allow mixing bypass and mapped DMA operation At the moment we allow bypassing DMA ops only when we can do this for the entire RAM. However there are configs with mixed type memory where we could still allow bypassing IOMMU in most cases; POWERPC with persistent memory is one example. This adds an arch hook to determine where bypass can still work and we invoke direct DMA API. The following patch checks the bus limit on POWERPC to allow or disallow direct mapping. This adds a ARCH_HAS_DMA_MAP_DIRECT config option to make the arch_xxxx hooks no-op by default. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-11-27 10:33:40 +01:00
Nicholas Piggin	8ff00399b1	kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling powerpc/64s keeps a counter in the mm which counts bits set in mm_cpumask as well as other things. This means it can't use generic code to clear bits out of the mask and doesn't adjust the arch specific counter. Add an arch override that allows powerpc/64s to use clear_tasks_mm_cpumask(). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201126102530.691335-4-npiggin@gmail.com	2020-11-27 00:10:39 +11:00
Peter Zijlstra	20c7775aec	Merge remote-tracking branch 'origin/master' into perf/core Further perf/core patches will depend on: `d3f7b1bb20` ("mm/gup: fix gup_fast with dynamic page table folding") which is already in Linus' tree.	2020-11-26 13:16:55 +01:00
KP Singh	27672f0d28	bpf: Add a BPF helper for getting the IMA hash of an inode Provide a wrapper function to get the IMA hash of an inode. This helper is useful in fingerprinting files (e.g executables on execution) and using these fingerprints in detections like an executable unlinking itself. Since the ima_inode_hash can sleep, it's only allowed for sleepable LSM hooks. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201124151210.1081188-3-kpsingh@chromium.org	2020-11-26 00:04:04 +01:00
Hui Su	5a7b5f32c5	cgroup/cgroup.c: replace 'of->kn->priv' with of_cft() we have supplied the inline function: of_cft() in cgroup.h. So replace the direct use 'of->kn->priv' with inline func of_cft(), which is more readable. Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-11-25 17:13:53 -05:00
Bhaskar Chowdhury	58315c9665	kernel: cgroup: Mundane spelling fixes throughout the file Few spelling fixes throughout the file. Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-11-25 17:13:43 -05:00
Yunfeng Ye	01341fbd0d	workqueue: Kick a worker based on the actual activation of delayed works In realtime scenario, We do not want to have interference on the isolated cpu cores. but when invoking alloc_workqueue() for percpu wq on the housekeeping cpu, it kick a kworker on the isolated cpu. alloc_workqueue pwq_adjust_max_active wake_up_worker The comment in pwq_adjust_max_active() said: "Need to kick a worker after thawed or an unbound wq's max_active is bumped" So it is unnecessary to kick a kworker for percpu's wq when invoking alloc_workqueue(). this patch only kick a worker based on the actual activation of delayed works. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2020-11-25 17:10:28 -05:00
Andy Shevchenko	b87e745945	resource: provide meaningful MODULE_LICENSE() in test suite modpost complains that module has no licence provided. Provide it via meaningful MODULE_LICENSE(). Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-11-25 18:52:35 +01:00
Johan Hovold	b112082c89	module: simplify version-attribute handling Instead of using the array-of-pointers trick to avoid having gcc mess up the built-in module-version array stride, specify type alignment when declaring entries to prevent gcc from increasing alignment. This is essentially an alternative (one-line) fix to the problem addressed by commit `b4bc842802` ("module: deal with alignment issues in built-in module versions"). gcc can increase the alignment of larger objects with static extent as an optimisation, but this can be suppressed by using the aligned attribute when declaring variables. Note that we have been relying on this behaviour for kernel parameters for 16 years and it indeed hasn't changed since the introduction of the aligned attribute in gcc-3.1. Link: https://lore.kernel.org/lkml/20201103175711.10731-1-johan@kernel.org Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jessica Yu <jeyu@kernel.org>	2020-11-25 15:44:46 +01:00
Wedson Almeida Filho	59e2e27d22	bpf: Refactor check_cfg to use a structured loop. The current implementation uses a number of gotos to implement a loop and different paths within the loop, which makes the code less readable than it would be with an explicit while-loop. This patch also replaces a chain of if/if-elses keyed on the same expression with a switch statement. No change in behaviour is intended. Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201121015509.3594191-1-wedsonaf@google.com	2020-11-24 20:29:26 -08:00
Alex Shi	ba59eae723	audit: fix macros warnings Some unused macros could cause gcc warning: kernel/audit.c:68:0: warning: macro "AUDIT_UNINITIALIZED" is not used [-Wunused-macros] kernel/auditsc.c:104:0: warning: macro "AUDIT_AUX_IPCPERM" is not used [-Wunused-macros] kernel/auditsc.c:82:0: warning: macro "AUDITSC_INVALID" is not used [-Wunused-macros] AUDIT_UNINITIALIZED and AUDITSC_INVALID are still meaningful and should be in incorporated. Just remove AUDIT_AUX_IPCPERM. Thanks comments from Richard Guy Briggs and Paul Moore. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Richard Guy Briggs <rgb@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: linux-audit@redhat.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Paul Moore <paul@paul-moore.com>	2020-11-24 23:28:02 -05:00
Andrii Nakryiko	607c543f93	bpf: Sanitize BTF data pointer after module is loaded Given .BTF section is not allocatable, it will get trimmed after module is loaded. BPF system handles that properly by creating an independent copy of data. But prevent any accidental misused by resetting the pointer to BTF data. Fixes: `36e68442d1` ("bpf: Load and verify kernel module BTFs") Suggested-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jessica Yu <jeyu@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/bpf/20201121070829.2612884-2-andrii@kernel.org	2020-11-25 00:05:21 +01:00
Peter Zijlstra	2914b0ba61	irq_work: Optimize irq_work_single() Trade one atomic op for a full memory barrier. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>	2020-11-24 16:47:49 +01:00
Peter Zijlstra	545b8c8df4	smp: Cleanup smp_call_function*() Get rid of the __call_single_node union and cleanup the API a little to avoid external code relying on the structure layout as much. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>	2020-11-24 16:47:49 +01:00
Peter Zijlstra	7a9f50a058	irq_work: Cleanup Get rid of the __call_single_node union and clean up the API a little to avoid external code relying on the structure layout as much. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>	2020-11-24 16:47:48 +01:00
Mel Gorman	23e6082a52	sched: Limit the amount of NUMA imbalance that can exist at fork time At fork time currently, a local node can be allowed to fill completely and allow the periodic load balancer to fix the problem. This can be problematic in cases where a task creates lots of threads that idle until woken as part of a worker poll causing a memory bandwidth problem. However, a "real" workload suffers badly from this behaviour. The workload in question is mostly NUMA aware but spawns large numbers of threads that act as a worker pool that can be called from anywhere. These need to spread early to get reasonable behaviour. This patch limits how much a local node can fill before spilling over to another node and it will not be a universal win. Specifically, very short-lived workloads that fit within a NUMA node would prefer the memory bandwidth. As I cannot describe the "real" workload, the best proxy measure I found for illustration was a page fault microbenchmark. It's not representative of the workload but demonstrates the hazard of the current behaviour. pft timings 5.10.0-rc2 5.10.0-rc2 imbalancefloat-v2 forkspread-v2 Amean elapsed-1 46.37 ( 0.00%) 46.05 * 0.69%* Amean elapsed-4 12.43 ( 0.00%) 12.49 * -0.47%* Amean elapsed-7 7.61 ( 0.00%) 7.55 * 0.81%* Amean elapsed-12 4.79 ( 0.00%) 4.80 ( -0.17%) Amean elapsed-21 3.13 ( 0.00%) 2.89 * 7.74%* Amean elapsed-30 3.65 ( 0.00%) 2.27 * 37.62%* Amean elapsed-48 3.08 ( 0.00%) 2.13 * 30.69%* Amean elapsed-79 2.00 ( 0.00%) 1.90 * 4.95%* Amean elapsed-80 2.00 ( 0.00%) 1.90 * 4.70%* This is showing the time to fault regions belonging to threads. The target machine has 80 logical CPUs and two nodes. Note the ~30% gain when the machine is approximately the point where one node becomes fully utilised. The slower results are borderline noise. Kernel building shows similar benefits around the same balance point. Generally performance was either neutral or better in the tests conducted. The main consideration with this patch is the point where fork stops spreading a task so some workloads may benefit from different balance points but it would be a risky tuning parameter. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.net	2020-11-24 16:47:48 +01:00
Mel Gorman	7d2b5dd0bc	sched/numa: Allow a floating imbalance between NUMA nodes Currently, an imbalance is only allowed when a destination node is almost completely idle. This solved one basic class of problems and was the cautious approach. This patch revisits the possibility that NUMA nodes can be imbalanced until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat superficial -- it's half the cores when HT is enabled. At higher utilisations, balancing should continue as normal and keep things even until scheduler domains are fully busy or over utilised. Note that this is not expected to be a universal win. Any benchmark that prefers spreading as wide as possible with limited communication will favour the old behaviour as there is more memory bandwidth. Workloads that communicate heavily in pairs such as netperf or tbench benefit. For the tests I ran, the vast majority of workloads saw a benefit so it seems to be a worthwhile trade-off. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-4-mgorman@techsingularity.net	2020-11-24 16:47:47 +01:00
Mel Gorman	5c339005f8	sched: Avoid unnecessary calculation of load imbalance at clone time In find_idlest_group(), the load imbalance is only relevant when the group is either overloaded or fully busy but it is calculated unconditionally. This patch moves the imbalance calculation to the context it is required. Technically, it is a micro-optimisation but really the benefit is avoiding confusing one type of imbalance with another depending on the group_type in the next patch. No functional change. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-3-mgorman@techsingularity.net	2020-11-24 16:47:47 +01:00
Mel Gorman	abeae76a47	sched/numa: Rename nr_running and break out the magic number This is simply a preparation patch to make the following patches easier to read. No functional change. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-2-mgorman@techsingularity.net	2020-11-24 16:47:47 +01:00
Peter Zijlstra	58c644ba51	sched/idle: Fix arch_cpu_idle() vs tracing We call arch_cpu_idle() with RCU disabled, but then use local_irq_{en,dis}able(), which invokes tracing, which relies on RCU. Switch all arch_cpu_idle() implementations to use raw_local_irq_{en,dis}able() and carefully manage the lockdep,rcu,tracing state like we do in entry. (XXX: we really should change arch_cpu_idle() to not return with interrupts enabled) Reported-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org	2020-11-24 16:47:35 +01:00

... 24 25 26 27 28 ...

36280 Commits