The setup_profiling_timer() is mostly un-implemented by many
architectures. In many places it isn't guarded by CONFIG_PROFILE which is
needed for it to be used. Make it a weak symbol in kernel/profile.c and
remove the 'return -EINVAL' implementations from the kenrel.
There are a couple of architectures which do return 0 from the
setup_profiling_timer() function but they don't seem to do anything else
with it. To keep the /proc compatibility for now, leave these for a
future update or removal.
On ARM, this fixes the following sparse warning:
arch/arm/kernel/smp.c:793:5: warning: symbol 'setup_profiling_timer' was not declared. Should it be static?
Link: https://lkml.kernel.org/r/20220721195509.418205-1-ben-linux@fluff.org
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
s/heartbaet/heartbeat
Link: https://lkml.kernel.org/r/4d4a6786e8ad522bfad6d2401b7f6634f8af0e5d.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use bitmap_zero() instead of hand-writing it. It is less verbose.
While at it, add an explicit #include <linux/bitmap.h>.
Link: https://lkml.kernel.org/r/86d2a027c319db12055c98f00c65f7d01e703722.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "ocfs2: A few clean_ups", v2.
__ocfs2_node_map_set_bit() and __ocfs2_node_map_clear_bit() are just
wrapper around set_bit() and clear_bit().
The leading __ also makes think that these functions are non-atomic just
like __set_bit() and __clear_bit().
So, just remove these wrappers and call set_bit() and clear_bit()
directly.
Link: https://lkml.kernel.org/r/cover.1658436259.git.christophe.jaillet@wanadoo.fr
Link: https://lkml.kernel.org/r/bd1429c84ec7d174c96dbb67a2b42b1b456d9394.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace 'the the' with 'the' in the comment.
Link: https://lkml.kernel.org/r/20220722101922.81126-1-slark_xiao@163.com
Signed-off-by: Slark Xiao <slark_xiao@163.com>
Cc: Hongbo Li <herberthbli@tencent.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* /proc/${pid}/net status
* removing PDE vs last close stuff (again!)
* random small stuff
Link: https://lkml.kernel.org/r/YtwrM6sDC0OQ53YB@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
enum wb_congested_state and the member 'congested' in bdi_writeback are
useless since commit a88f2096d5 ("remove congestion tracking
framework"), so remove it.
Link: https://lkml.kernel.org/r/20220719083349.87547-1-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The proc_dohung_task_timeout_secs() function is incorrectly marked
as having a __user buffer as argument 3. However this is not the
case and it is casing multiple sparse warnings. Fix the following
warnings by removing __user from the argument:
kernel/hung_task.c:237:52: warning: incorrect type in argument 3 (different address spaces)
kernel/hung_task.c:237:52: expected void *
kernel/hung_task.c:237:52: got void [noderef] __user *buffer
kernel/hung_task.c:287:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/hung_task.c:287:35: expected int ( [usertype] *proc_handler )( ... )
kernel/hung_task.c:287:35: got int ( * )( ... )
kernel/hung_task.c:295:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/hung_task.c:295:35: expected int ( [usertype] *proc_handler )( ... )
kernel/hung_task.c:295:35: got int ( * )( ... )
Link: https://lkml.kernel.org/r/20220714074744.189017-1-ben.dooks@sifive.com
Signed-off-by: Ben Dooks <ben.dooks@sifive.com>
Cc: <Conor.Dooley@microchip.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fix the following coccicheck warning:
lib/lzo/lzo1x_compress.c:54: WARNING opportunity for min().
lib/lzo/lzo1x_compress.c:329: WARNING opportunity for min().
min() and min_t() macro is defined in include/linux/minmax.h. It avoids
multiple evaluations of the arguments when non-constant and performs
strict type-checking.
Link: https://lkml.kernel.org/r/20220714015441.1313036-1-13667453960@163.com
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Tested-by: Dave Rodgman <dave.rodgman@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a function which can be used to read fragments in the readahead call.
This function is necessary because filesystems built with the -tailends
(or -always-use-fragments) option may have fragments present which cannot
be currently handled.
Link: https://lkml.kernel.org/r/20220617083810.337573-5-hsinyi@chromium.org
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement readahead callback for squashfs. It will read datablocks which
cover pages in readahead request. For a few cases it will not mark page
as uptodate, including:
- file end is 0.
- zero filled blocks.
- current batch of pages isn't in the same datablock.
- decompressor error.
Otherwise pages will be marked as uptodate. The unhandled pages will be
updated by readpage later.
Link: https://lkml.kernel.org/r/20220617083810.337573-4-hsinyi@chromium.org
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: Xiongwei Song <Xiongwei.Song@windriver.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Squashfs_readahead uses the "file direct" version of the page actor, and
so build it unconditionally.
Link: https://lkml.kernel.org/r/20220617083810.337573-3-hsinyi@chromium.org
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Implement readahead for squashfs", v7.
Commit 9eec1d897139("squashfs: provide backing_dev_info in order to
disable read-ahead") mitigates the performance drop issue for squashfs by
closing readahead for it.
This series implements readahead callback for squashfs.
This patch (of 4):
This reverts 9eec1d8971 ("squashfs: provide backing_dev_info in order
to disable read-ahead").
Revert closing the readahead to squashfs since the readahead callback for
squashfs is implemented.
Link: https://lkml.kernel.org/r/20220617083810.337573-1-hsinyi@chromium.org
Link: https://lkml.kernel.org/r/20220617083810.337573-2-hsinyi@chromium.org
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Suggested-by: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kernel test robot throws below warning ->
arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
arch/ia64/include/asm/mmu_context.h:127:48: warning: variable
'old_rr4' set but not used [-Wunused-but-set-variable]
127 | unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
Add it under CONFIG_HUGETLB_PAGE
Link: https://lkml.kernel.org/r/20220626022114.4020-1-jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder (HPE) <jrdr.linux@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Booting with vsyscall=xonly results in the following vsyscall VMA:
ffffffffff600000-ffffffffff601000 --xp ... [vsyscall]
Test does read from fixed vsyscall address to determine if kernel
supports vsyscall page but it doesn't work because, well, vsyscall
page is execute only.
Fix test by trying to execute from the first byte of the page which
contains gettimeofday() stub. This should work because vsyscall
entry points have stable addresses by design.
Alexey, avoiding parsing .config, /proc/config.gz and
/proc/cmdline at all costs.
Link: https://lkml.kernel.org/r/Ys2KgeiEMboU8Ytu@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: <dylanbhatch@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
moved proc_flush_task() behind __exit_signal(). Then, process systemd can
take long period high cpu usage during releasing task in following
concurrent processes:
systemd ps
kernel_waitid stat(/proc/tgid)
do_wait filename_lookup
wait_consider_task lookup_fast
release_task
__exit_signal
__unhash_process
detach_pid
__change_pid // remove task->pid_links
d_revalidate -> pid_revalidate // 0
d_invalidate(/proc/tgid)
shrink_dcache_parent(/proc/tgid)
d_walk(/proc/tgid)
spin_lock_nested(/proc/tgid/fd)
// iterating opened fd
proc_flush_pid |
d_invalidate (/proc/tgid/fd) |
shrink_dcache_parent(/proc/tgid/fd) |
shrink_dentry_list(subdirs) ↓
shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock
Function d_invalidate() will remove dentry from hash firstly, but why does
proc_flush_pid() process dentry '/proc/tgid/fd' before dentry
'/proc/tgid'? That's because proc_pid_make_inode() adds proc inode in
reverse order by invoking hlist_add_head_rcu(). But proc should not add
any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by
adding inode into 'pid->inodes' only if the inode is /proc/tgid or
/proc/tgid/task/pid.
Performance regression:
Create 200 tasks, each task open one file for 50,000 times. Kill all
tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).
Before fix:
$ time killall -wq aa
real 4m40.946s # During this period, we can see 'ps' and 'systemd'
taking high cpu usage.
After fix:
$ time killall -wq aa
real 1m20.732s # During this period, we can see 'systemd' taking
high cpu usage.
Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove the unused inode field of the autofs dentry info structure.
Link: https://lkml.kernel.org/r/165724460393.30914.6511330213821246793.stgit@donald.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function autofs_mountpoint_changed() is unusual, add a comment about
two cases for which it is needed.
Link: https://lkml.kernel.org/r/165724459804.30914.10974834416046555127.stgit@donald.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The dentry info. field count is used to check if a dentry is in use
during expire. But, to be used for this the count field must account for
the presence of child dentries in a directory dentry.
Therefore it can also be used to check for an empty directory dentry which
can be done without having to to take an additional lock or account for
the presence of a readdir cursor dentry as is done by simple_empty().
Link: https://lkml.kernel.org/r/165724459238.30914.1504611159945950108.stgit@donald.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If an autofs dentry is a mount root directory there's no ->mkdir() call to
set its count to one.
To make the dentry info count consistent for all autofs dentries set count
to one when the dentry info struct is allocated.
Link: https://lkml.kernel.org/r/165724458671.30914.2902424437132835325.stgit@donald.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Allocate device resource from local node memory when the numa locality of
the device is specified.
Link: https://lkml.kernel.org/r/20220708131952.14500-1-mark-pk.tsai@mediatek.com
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently instrumentation_end() won't be called if printk_ratelimit()
returned false.
Link: https://lkml.kernel.org/r/a636d8e0-ad32-5888-acac-671f7f553bb3@I-love.SAKURA.ne.jp
Fixes: 126f21f0e8 ("lib/smp_processor_id: Move it into noinstr section")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The extern specifier is not needed for this declaration, so drop it. The
function also depends only on the input parameters, and has no side
effects, so it can be marked __pure like other functions in cpumask.h.
Link: https://lkml.kernel.org/r/72ab755695b74bb5fbaa756ae4c0edd708d172f1.1656777646.git.sander@svanheule.net
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On uniprocessor builds, any CPU mask is assumed to contain exactly one CPU
(cpu0). This assumption ignores the existence of empty masks, resulting
in incorrect behaviour.
cpumask_first_zero(), cpumask_next_zero(), and for_each_cpu_not() don't
provide behaviour matching the assumption that a UP mask is always "1",
and instead provide behaviour matching the empty mask.
Drop the incorrectly optimised code and use the generic implementations in
all cases.
Link: https://lkml.kernel.org/r/86bf3f005abba2d92120ddd0809235cab4f759a6.1656777646.git.sander@svanheule.net
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Suggested-by: Yury Norov <yury.norov@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On uniprocessor builds, the following loops will always run over a mask
that contains one enabled CPU (cpu0):
- for_each_possible_cpu
- for_each_online_cpu
- for_each_present_cpu
Provide uniprocessor-specific macros for these loops, that always run
exactly once.
Link: https://lkml.kernel.org/r/3a92869b902a075b97be5d1452c9c6badbbff0df.1656777646.git.sander@svanheule.net
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Acked-by: Yury Norov <yury.norov@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "cpumask: Fix invalid uniprocessor assumptions", v4.
On uniprocessor builds, it is currently assumed that any cpumask will
contain the single CPU: cpu0. This assumption is used to provide
optimised implementations.
The current assumption also appears to be wrong, by ignoring the fact that
users can provide empty cpumasks. This can result in bugs as explained in
[1] - for_each_cpu() will run one iteration of the loop even when passed
an empty cpumask.
This series introduces some basic tests, and updates the optimisations for
uniprocessor builds.
The x86 patch was written after the kernel test robot [2] ran into a
failed build. I have tried to list the files potentially affected by the
changes to cpumask.h, in an attempt to find any other cases that fail on
!SMP. I've gone through some of the files manually, and ran a few cross
builds, but nothing else popped up. I (build) checked about half of the
potientally affected files, but I do not have the resources to do them
all. I hope we can fix other issues if/when they pop up later.
[1] https://lore.kernel.org/all/20220530082552.46113-1-sander@svanheule.net/
[2] https://lore.kernel.org/all/202206060858.wA0FOzRy-lkp@intel.com/
This patch (of 5):
The maps to keep track of shared caches between CPUs on SMP systems are
declared in asm/smp.h, among them specifically cpu_llc_shared_map. These
maps are externally defined in cpu/smpboot.c. The latter is only compiled
on CONFIG_SMP=y, which means the declared extern symbols from asm/smp.h do
not have a corresponding definition on uniprocessor builds.
The inline cpu_llc_shared_mask() function from asm/smp.h refers to the map
declaration mentioned above. This function is referenced in cacheinfo.c
inside for_each_cpu() loop macros, to provide cpumask for the loop. On
uniprocessor builds, the symbol for the cpu_llc_shared_map does not exist.
However, the current implementation of for_each_cpu() also (wrongly)
ignores the provided mask.
By sheer luck, the compiler thus optimises out this unused reference to
cpu_llc_shared_map, and the linker therefore does not require the
cpu_llc_shared_mask to actually exist on uniprocessor builds. Only on SMP
bulids does smpboot.o exist to provide the required symbols.
To no longer rely on compiler optimisations for successful uniprocessor
builds, move the definitions of cpu_llc_shared_map and cpu_l2c_shared_map
from smpboot.c to cacheinfo.c.
Link: https://lkml.kernel.org/r/cover.1656777646.git.sander@svanheule.net
Link: https://lkml.kernel.org/r/e8167ddb570f56744a3dc12c2149a660a324d969.1656777646.git.sander@svanheule.net
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When doing cross platform development on a machine sometimes it might be
useful to invoke bloat-o-meter for files which haven't been build with the
native toolchain. In cases when the host nm doesn't support the target
one then a toolchain-specific nm could be used. Add this ability by
adding the -p allowing invocations as:
./scripts/bloat-o-meter -p riscv64-unknown-linux-gnu- file1.o file2.o
Link: https://lkml.kernel.org/r/20220701113513.1938008-2-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This will facilitate further extension to the arguments the script takes.
As an added benefit it also produces saner usage output, where mutual
exclusivity of the c|d|t parameters is clearly visible:
./scripts/bloat-o-meter -h
usage: bloat-o-meter [-h] [-c | -d | -t] file1 file2
Simple script used to compare the symbol sizes of 2 object files
positional arguments:
file1 First file to compare
file2 Second file to compare
optional arguments:
-h, --help show this help message and exit
-c categorize output based on symbol type
-d Show delta of Data Section
-t Show delta of text Section
Link: https://lkml.kernel.org/r/20220701113513.1938008-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If a process is killed or otherwise exits while having active network
connections and many threads waiting on epoll_wait, the threads will all
be woken immediately, but not removed from ep->wq. Then when network
traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will
not remove the entries from the list.
This means that the cost of the wakeup attempt is far higher than usual,
does not decrease, and this also competes with the dying threads trying to
actually make progress and remove themselves from the wq.
Handle this by removing visited epoll wq entries unconditionally, rather
than only when the wakeup succeeds - the structure of ep_poll means that
the only potential loss is the timed_out->eavail heuristic, which now can
race and result in a redundant ep_send_events attempt. (But only when
incoming data and a timeout actually race, not on every timeout)
Shakeel added:
: We are seeing this issue in production with real workloads and it has
: caused hard lockups. Particularly network heavy workloads with a lot
: of threads in epoll_wait() can easily trigger this issue if they get
: killed (oom-killed in our case).
Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com
Signed-off-by: Ben Segall <bsegall@google.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Heiher <r@hev.cc>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The total memory size we get in kernel is usually slightly less than the
actual memory size because BIOS/firmware will reserve some memory region.
So it won't export all memory as usable.
E.g, on my x86_64 kvm guest with 1G memory, the total_mem value shows:
UEFI boot with ovmf: 0x3faef000 Legacy boot kvm guest: 0x3ff7ec00
When specifying crashkernel=1G-2G:128M, if we have a 1G memory machine, we
get total size 1023M from firmware. Then it will not fall into 1G-2G,
thus no memory reserved. User will never know this, it is hard to let
user know the exact total value in kernel.
One way is to use dmi/smbios to get physical memory size, but it's not
reliable as well. According to Prarit hardware vendors sometimes screw
this up. Thus round up total size to 128M to work around this problem.
This patch is a resend of [1] and rebased onto v5.19-rc2, and the
original credit goes to Dave Young.
[1]: http://lists.infradead.org/pipermail/kexec/2018-April/020568.html
Link: https://lkml.kernel.org/r/20220627074440.187222-1-ltao@redhat.com
Signed-off-by: Tao Liu <ltao@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The internal kallsyms tables contain information which could be quite
useful to a debugging tool in the absence of other debuginfo. If kallsyms
is enabled, then a debugging tool could parse it and use it as a fallback
symbol table. Combined with BTF data, live & post-mortem debuggers can
support basic operations without needing a large DWARF debuginfo file
available. As many as five symbols are necessary to properly parse
kallsyms names and addresses. Add these to the vmcoreinfo note.
CONFIG_KALLSYMS_ABSOLUTE_PERCPU does impact the computation of symbol
addresses. However, a debugger can infer this configuration value by
comparing the address of _stext in the vmcoreinfo with the address
computed via kallsyms. So there's no need to include information about
this config value in the vmcoreinfo note.
To verify that we're still well below the maximum of 4096 bytes, I created
a script[1] to compute a rough upper bound on the possible size of
vmcoreinfo. On v5.18-rc7, the script reports 3106 bytes, and with this
patch, the maximum become 3370 bytes.
[1]: https://github.com/brenns10/kernel_stuff/blob/master/vmcoreinfosize/
Link: https://lkml.kernel.org/r/20220517000508.777145-3-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Vernet <void@manifault.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Expose kallsyms data in vmcoreinfo note".
The kernel can be configured to contain a lot of introspection or
debugging information built-in, such as ORC for unwinding stack traces,
BTF for type information, and of course kallsyms. Debuggers could use
this information to navigate a core dump or live system, but they need to
be able to find it.
This patch series adds the necessary symbols into vmcoreinfo, which would
allow a debugger to find and interpret the kallsyms table. Using the
kallsyms data, the debugger can then lookup any symbol, allowing it to
find ORC, BTF, or any other useful data.
This would allow a live kernel, or core dump, to be debugged without any
DWARF debuginfo. This is useful for many cases: the debuginfo may not
have been generated, or you may not want to deploy the large files
everywhere you need them.
I've demonstrated a proof of concept for this at LSF/MM+BPF during a
lighting talk. Using a work-in-progress branch of the drgn debugger, and
an extended set of BTF generated by a patched version of dwarves, I've
been able to open a core dump without any DWARF info and do basic tasks
such as enumerating slab caches, block devices, tasks, and doing
backtraces. I hope this series can be a first step toward a new
possibility of "DWARFless debugging".
Related discussion around the BTF side of this:
https://lore.kernel.org/bpf/586a6288-704a-f7a7-b256-e18a675927df@oracle.com/T/#u
Some work-in-progress branches using this feature:
https://github.com/brenns10/dwarves/tree/remove_percpu_restriction_1https://github.com/brenns10/drgn/tree/kallsyms_plus_btf
This patch (of 2):
To include kallsyms data in the vmcoreinfo note, we must make the symbol
declarations visible outside of kallsyms.c. Move these to a new internal
header file.
Link: https://lkml.kernel.org/r/20220517000508.777145-1-stephen.s.brennan@oracle.com
Link: https://lkml.kernel.org/r/20220517000508.777145-2-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Stephen Boyd <swboyd@chromium.org>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: David Vernet <void@manifault.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is no need to store the result of the addition back to variable
consumed after the addition. The store is redundant, replace += with just
+
Cleans up clang scan build warning: lib/ts_bm.c:83:11: warning: Although
the value stored to 'consumed' is used in the enclosing expression, the
value is never actually read from 'consumed' [deadcode.DeadStores]
Link: https://lkml.kernel.org/r/20220704215325.600993-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
commit 4635873c56 ("scsi: lib/sg_pool.c: improve APIs for allocating sg
pool") changeed @(bool)skip_first_chunk of __sg_free_table() to @(unsigned
int)nents_first_chunk, so use unsigend int type instead of bool type
(false -> 0) when calling the function in sg_free_append_table() and
sg_free_table().
Link: https://lkml.kernel.org/r/20220629030241.84559-1-wuchi.zero@gmail.com
Signed-off-by: wuchi <wuchi.zero@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Maor Gottlieb <maorg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
insert_entries() doesn't use the 'bool replace' argument, and the function
is only used locally, remove the argument.
The historical context of the unused argument is as follow:
2: commit <3a08cd52c37c79> (radix tree: Remove multiorder support)
Remove the code related to macro CONFIG_RADIX_TREE_MULTIORDER
to convert to the xArray.
Without the macro, there is no need to retain the argument.
1: commit <175542f575723e> (radix-tree: add radix_tree_join)
Add insert_entries(..., bool replace) function, depending on the
macro CONFIG_RADIX_TREE_MULTIORDER definition, the implementation
is different. Notice that the implementation without the macro doesn't
use the argument.
[Matthew Wilcox: add historical context for argument]
Link: https://lkml.kernel.org/r/20220625135324.72574-1-wuchi.zero@gmail.com
Signed-off-by: wuchi <wuchi.zero@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kfifo_to_user() macro is supposed to return zero for success or
negative error codes. Unfortunately, there is a signedness bug so it
returns unsigned int. This only affects callers which try to save the
result in ssize_t and as far as I can see the only place which does that
is line6_hwdep_read().
TL;DR: s/_uint/_int/.
Link: https://lkml.kernel.org/r/YrVL3OJVLlNhIMFs@kili
Fixes: 144ecf310e ("kfifo: fix kfifo_alloc() to return a signed int value")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The workaround for 'asm goto' miscompilation introduces a compiler barrier
quirk that inhibits many useful compiler optimizations. For example,
__try_cmpxchg_user compiles to:
11375: 41 8b 4d 00 mov 0x0(%r13),%ecx
11379: 41 8b 02 mov (%r10),%eax
1137c: f0 0f b1 0a lock cmpxchg %ecx,(%rdx)
11380: 0f 94 c2 sete %dl
11383: 84 d2 test %dl,%dl
11385: 75 c4 jne 1134b <...>
11387: 41 89 02 mov %eax,(%r10)
where the barrier inhibits flags propagation from asm when compiled with
gcc-12.
When the mentioned quirk is removed, the following code is generated:
11553: 41 8b 4d 00 mov 0x0(%r13),%ecx
11557: 41 8b 02 mov (%r10),%eax
1155a: f0 0f b1 0a lock cmpxchg %ecx,(%rdx)
1155e: 74 c9 je 11529 <...>
11560: 41 89 02 mov %eax,(%r10)
The refered compiler bug:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670
was fixed for gcc-4.8.2.
Current minimum required version of GCC is version 5.1 which has the above
'asm goto' miscompilation fixed, so remove the workaround.
Link: https://lkml.kernel.org/r/20220624141412.72274-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Traversing list without mutex in get_injectable_error_type will
race with the following code:
list_del_init(&ent->list)
kfree(ent)
in module_unload_ei_list. So fix that.
Link: https://lkml.kernel.org/r/20220620100244.82896-1-wuchi.zero@gmail.com
Signed-off-by: wuchi <wuchi.zero@gmail.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As Linus explained [1], setting the stackdepot hash table size as a config
option is suboptimal, especially as stackdepot becomes a dependency of
less "expert" subsystems than initially (e.g. DRM, networking,
SLUB_DEBUG):
: (a) it introduces a new compile-time question that isn't sane to ask
: a regular user, but is now exposed to regular users.
: (b) this by default uses 1MB of memory for a feature that didn't in
: the past, so now if you have small machines you need to make sure you
: make a special kernel config for them.
Ideally we would employ rhashtable for fully automatic resizing, which
should be feasible for many of the new users, but problematic for the
original users with restricted context that call __stack_depot_save() with
can_alloc == false, i.e. KASAN.
However we can easily remove the config option and scale the hash table
automatically with system memory. The STACK_HASH_MASK constant becomes
stack_hash_mask variable and is used only in one mask operation, so the
overhead should be negligible to none. For early allocation we can employ
the existing alloc_large_system_hash() function and perform similar
scaling for the late allocation.
The existing limits of the config option (between 4k and 1M buckets) are
preserved, and scaling factor is set to one bucket per 16kB memory so on
64bit the max 1M buckets (8MB memory) is achieved with 16GB system, while
a 1GB system will use 512kB.
Because KASAN is reported to need the maximum number of buckets even with
smaller amounts of memory [2], set it as such when kasan_enabled().
If needed, the automatic scaling could be complemented with a boot-time
kernel parameter, but it feels pointless to add it without a specific use
case.
[1] https://lore.kernel.org/all/CAHk-=wjC5nS+fnf6EzRD9yQRJApAhxx7gRB87ZV+pAWo9oVrTg@mail.gmail.com/
[2] https://lore.kernel.org/all/CACT4Y+Y4GZfXOru2z5tFPzFdaSUd+GFc6KVL=bsa0+1m197cQQ@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220620150249.16814-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DO_ONCE(func, ...) will call func with spinlock which acquired by
spin_lock_irqsave in __do_once_start. But the get_random_once_wait will
sleep in get_random_bytes_wait -> wait_for_random_bytes.
Fortunately, there is no place to use {net_}get_random_once_wait, so we
could remove them simply.
Link: https://lkml.kernel.org/r/20220619074641.40916-1-wuchi.zero@gmail.com
Signed-off-by: wuchi <wuchi.zero@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When kmem_cache_alloc in function lc_create returns null, we will
free the memory already allocated. The loop of kmem_cache_free
is wrong, especially:
i = 0 ==> do wrong loop
i > 0 ==> do not free element[0]
Link: https://lkml.kernel.org/r/20220618082521.7082-1-wuchi.zero@gmail.com
Signed-off-by: wuchi <wuchi.zero@gmail.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Christoph Bhmwalder <christoph.boehmwalder@linbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The gethostname system call returns the hostname for the current machine.
However, the kernel has no mechanism to initially set the current
machine's name in such a way as to guarantee that the first userspace
process to call gethostname will receive a meaningful result. It relies
on some unspecified userspace process to first call sethostname before
gethostname can produce a meaningful name.
Traditionally the machine's hostname is set from userspace by the init
system. The init system, in turn, often relies on a configuration file
(say, /etc/hostname) to provide the value that it will supply in the call
to sethostname. Consequently, the file system containing /etc/hostname
usually must be available before the hostname will be set. There may,
however, be earlier userspace processes that could call gethostname before
the file system containing /etc/hostname is mounted. Such a process will
get some other, likely meaningless, name from gethostname (such as
"(none)", "localhost", or "darkstar").
A real-world example where this can happen, and lead to undesirable
results, is with mdadm. When assembling arrays, mdadm distinguishes
between "local" arrays and "foreign" arrays. A local array is one that
properly belongs to the current machine, and a foreign array is one that
is (possibly temporarily) attached to the current machine, but properly
belongs to some other machine. To determine if an array is local or
foreign, mdadm may compare the "homehost" recorded on the array with the
current hostname. If mdadm is run before the root file system is mounted,
perhaps because the root file system itself resides on an md-raid array,
then /etc/hostname isn't yet available and the init system will not yet
have called sethostname, causing mdadm to incorrectly conclude that all of
the local arrays are foreign.
Solving this problem *could* be delegated to the init system. It could be
left up to the init system (including any init system that starts within
an initramfs, if one is in use) to ensure that sethostname is called
before any other userspace process could possibly call gethostname.
However, it may not always be obvious which processes could call
gethostname (for example, udev itself might not call gethostname, but it
could via udev rules invoke processes that do). Additionally, the init
system has to ensure that the hostname configuration value is stored in
some place where it will be readily accessible during early boot.
Unfortunately, every init system will attempt to (or has already attempted
to) solve this problem in a different, possibly incorrect, way. This
makes getting consistently working configurations harder for users.
I believe it is better for the kernel to provide the means by which the
hostname may be set early, rather than making this a problem for the init
system to solve. The option to set the hostname during early startup, via
a kernel parameter, provides a simple, reliable way to solve this problem.
It also could make system configuration easier for some embedded systems.
[dmoulding@me.com: v2]
Link: https://lkml.kernel.org/r/20220506060310.7495-2-dmoulding@me.com
Link: https://lkml.kernel.org/r/20220505180651.22849-2-dmoulding@me.com
Signed-off-by: Dan Moulding <dmoulding@me.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>