Commit Graph

681 Commits

Author SHA1 Message Date
Linus Torvalds
9ea925c806 Updates for timers and timekeeping:
- Core:
 
 	- Overhaul of posix-timers in preparation of removing the
 	  workaround for periodic timers which have signal delivery
 	  ignored.
 
         - Remove the historical extra jiffie in msleep()
 
 	  msleep() adds an extra jiffie to the timeout value to ensure
 	  minimal sleep time. The timer wheel ensures minimal sleep
 	  time since the large rewrite to a non-cascading wheel, but the
 	  extra jiffie in msleep() remained unnoticed. Remove it.
 
         - Make the timer slack handling correct for realtime tasks.
 
 	  The procfs interface is inconsistent and does neither reflect
 	  reality nor conforms to the man page. Show the correct 0 slack
 	  for real time tasks and enforce it at the core level instead of
 	  having inconsistent individual checks in various timer setup
 	  functions.
 
         - The usual set of updates and enhancements all over the place.
 
   - Drivers:
 
         - Allow the ACPI PM timer to be turned off during suspend
 
 	- No new drivers
 
 	- The usual updates and enhancements in various drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v
 14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK
 ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt
 xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW
 V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS
 FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb
 zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x
 Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1
 T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why
 0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K
 Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G
 F6p9cgoDNP9KFg==
 =jE0N
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Core:

   - Overhaul of posix-timers in preparation of removing the workaround
     for periodic timers which have signal delivery ignored.

   - Remove the historical extra jiffie in msleep()

     msleep() adds an extra jiffie to the timeout value to ensure
     minimal sleep time. The timer wheel ensures minimal sleep time
     since the large rewrite to a non-cascading wheel, but the extra
     jiffie in msleep() remained unnoticed. Remove it.

   - Make the timer slack handling correct for realtime tasks.

     The procfs interface is inconsistent and does neither reflect
     reality nor conforms to the man page. Show the correct 0 slack for
     real time tasks and enforce it at the core level instead of having
     inconsistent individual checks in various timer setup functions.

   - The usual set of updates and enhancements all over the place.

  Drivers:

   - Allow the ACPI PM timer to be turned off during suspend

   - No new drivers

   - The usual updates and enhancements in various drivers"

* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  ntp: Make sure RTC is synchronized when time goes backwards
  treewide: Fix wrong singular form of jiffies in comments
  cpu: Use already existing usleep_range()
  timers: Rename next_expiry_recalc() to be unique
  platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
  clocksource/drivers/jcore: Use request_percpu_irq()
  clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
  clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
  clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
  clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
  platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
  clocksource: acpi_pm: Add external callback for suspend/resume
  clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
  dt-bindings: timer: rockchip: Add rk3576 compatible
  timers: Annotate possible non critical data race of next_expiry
  timers: Remove historical extra jiffie for timeout in msleep()
  hrtimer: Use and report correct timerslack values for realtime tasks
  hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
  timers: Add sparse annotation for timer_sync_wait_running().
  signal: Replace BUG_ON()s
  ...
2024-09-17 07:25:37 +02:00
Linus Torvalds
e8fc317dfc vfs-6.12.procfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
 onI2AQDXa5XhIx0VpLWE9uVImVy3QuUKc/5pI1e1DKMgxLhKCgEAh15a4ETqmVaw
 Zp3ZSzoLD8Ez1WwWb6cWQuHFYRSjtwU=
 =+LKG
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull procfs updates from Christian Brauner:
 "This contains the following changes for procfs:

   - Add config options and parameters to block forcing memory writes.

     This adds a Kconfig option and boot param to allow removing the
     FOLL_FORCE flag from /proc/<pid>/mem write calls as this can be
     used in various attacks.

     The traditional forcing behavior is kept as default because it can
     break GDB and some other use cases.

     This is the simpler version that you had requested.

   - Restrict overmounting of ephemeral entities.

     It is currently possible to mount on top of various ephemeral
     entities in procfs. This specifically includes magic links. To
     recap, magic links are links of the form /proc/<pid>/fd/<nr>. They
     serve as references to a target file and during path lookup they
     cause a jump to the target path. Such magic links disappear if the
     corresponding file descriptor is closed.

     Currently it is possible to overmount such magic links. This is
     mostly interesting for an attacker that wants to somehow trick a
     process into e.g., reopening something that it didn't intend to
     reopen or to hide a malicious file descriptor.

     But also it risks leaking mounts for long-running processes. When
     overmounting a magic link like above, the mount will not be
     detached when the file descriptor is closed. Only the target
     mountpoint will disappear. Which has the consequence of making it
     impossible to unmount that mount afterwards. So the mount will
     stick around until the process exits and the /proc/<pid>/ directory
     is cleaned up during proc_flush_pid() when the dentries are pruned
     and invalidated.

     That in turn means it's possible for a program to accidentally leak
     mounts and it's also possible to make a task leak mounts without
     it's knowledge if the attacker just keeps overmounting things under
     /proc/<pid>/fd/<nr>.

     Disallow overmounting of such ephemeral entities.

   - Cleanup the readdir method naming in some procfs file operations.

   - Replace kmalloc() and strcpy() with a simple kmemdup() call"

* tag 'vfs-6.12.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  proc: fold kmalloc() + strcpy() into kmemdup()
  proc: block mounting on top of /proc/<pid>/fdinfo/*
  proc: block mounting on top of /proc/<pid>/fd/*
  proc: block mounting on top of /proc/<pid>/map_files/*
  proc: add proc_splice_unmountable()
  proc: proc_readfdinfo() -> proc_fdinfo_iterate()
  proc: proc_readfd() -> proc_fd_iterate()
  proc: add config & param to block forcing mem writes
2024-09-16 09:36:59 +02:00
Linus Torvalds
3352633ce6 vfs-6.12.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
 osS0AQCgIpvey9oW5DMyMw6Bv0hFMRv95gbNQZfHy09iK+NMNAD9GALhb/4cMIVB
 7YrZGXEz454lpgcs8AnrOVjVNfctOQg=
 =e9s9
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs file updates from Christian Brauner:
 "This is the work to cleanup and shrink struct file significantly.

  Right now, (focusing on x86) struct file is 232 bytes. After this
  series struct file will be 184 bytes aka 3 cacheline and a spare 8
  bytes for future extensions at the end of the struct.

  With struct file being as ubiquitous as it is this should make a
  difference for file heavy workloads and allow further optimizations in
  the future.

   - struct fown_struct was embedded into struct file letting it take up
     32 bytes in total when really it shouldn't even be embedded in
     struct file in the first place. Instead, actual users of struct
     fown_struct now allocate the struct on demand. This frees up 24
     bytes.

   - Move struct file_ra_state into the union containg the cleanup hooks
     and move f_iocb_flags out of the union. This closes a 4 byte hole
     we created earlier and brings struct file to 192 bytes. Which means
     struct file is 3 cachelines and we managed to shrink it by 40
     bytes.

   - Reorder struct file so that nothing crosses a cacheline.

     I suspect that in the future we will end up reordering some members
     to mitigate false sharing issues or just because someone does
     actually provide really good perf data.

   - Shrinking struct file to 192 bytes is only part of the work.

     Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache
     is created with SLAB_TYPESAFE_BY_RCU the free pointer must be
     located outside of the object because the cache doesn't know what
     part of the memory can safely be overwritten as it may be needed to
     prevent object recycling.

     That has the consequence that SLAB_TYPESAFE_BY_RCU may end up
     adding a new cacheline.

     So this also contains work to add a new kmem_cache_create_rcu()
     function that allows the caller to specify an offset where the
     freelist pointer is supposed to be placed. Thus avoiding the
     implicit addition of a fourth cacheline.

   - And finally this removes the f_version member in struct file.

     The f_version member isn't particularly well-defined. It is mainly
     used as a cookie to detect concurrent seeks when iterating
     directories. But it is also abused by some subsystems for
     completely unrelated things.

     It is mostly a directory and filesystem specific thing that doesn't
     really need to live in struct file and with its wonky semantics it
     really lacks a specific function.

     For pipes, f_version is (ab)used to defer poll notifications until
     a write has happened. And struct pipe_inode_info is used by
     multiple struct files in their ->private_data so there's no chance
     of pushing that down into file->private_data without introducing
     another pointer indirection.

     But pipes don't rely on f_pos_lock so this adds a union into struct
     file encompassing f_pos_lock and a pipe specific f_pipe member that
     pipes can use. This union of course can be extended to other file
     types and is similar to what we do in struct inode already"

* tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
  fs: remove f_version
  pipe: use f_pipe
  fs: add f_pipe
  ubifs: store cookie in private data
  ufs: store cookie in private data
  udf: store cookie in private data
  proc: store cookie in private data
  ocfs2: store cookie in private data
  input: remove f_version abuse
  ext4: store cookie in private data
  ext2: store cookie in private data
  affs: store cookie in private data
  fs: add generic_llseek_cookie()
  fs: use must_set_pos()
  fs: add must_set_pos()
  fs: add vfs_setpos_cookie()
  s390: remove unused f_version
  ceph: remove unused f_version
  adi: remove unused f_version
  mm: Removed @freeptr_offset to prevent doc warning
  ...
2024-09-16 09:14:02 +02:00
Christian Brauner
b4dba2efa8
proc: store cookie in private data
Store the cookie to detect concurrent seeks on directories in
file->private_data.

Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-14-6d3e4816aa7b@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 11:58:44 +02:00
Christian Brauner
641bb4394f fs: move FMODE_UNSIGNED_OFFSET to fop_flags
This is another flag that is statically set and doesn't need to use up
an FMODE_* bit. Move it to ->fop_flags and free up another FMODE_* bit.

(1) mem_open() used from proc_mem_operations
(2) adi_open() used from adi_fops
(3) drm_open_helper():
    (3.1) accel_open() used from DRM_ACCEL_FOPS
    (3.2) drm_open() used from
    (3.2.1) amdgpu_driver_kms_fops
    (3.2.2) psb_gem_fops
    (3.2.3) i915_driver_fops
    (3.2.4) nouveau_driver_fops
    (3.2.5) panthor_drm_driver_fops
    (3.2.6) radeon_driver_kms_fops
    (3.2.7) tegra_drm_fops
    (3.2.8) vmwgfx_driver_fops
    (3.2.9) xe_driver_fops
    (3.2.10) DRM_GEM_FOPS
    (3.2.11) DEFINE_DRM_GEM_DMA_FOPS
(4) struct memdev sets fmode flags based on type of device opened. For
    devices using struct mem_fops unsigned offset is used.

Mark all these file operations as FOP_UNSIGNED_OFFSET and add asserts
into the open helper to ensure that the flag is always set.

Link: https://lore.kernel.org/r/20240809-work-fop_unsigned-v1-1-658e054d893e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30 08:22:36 +02:00
Christian Brauner
d80b065bb1
Merge patch series "proc: restrict overmounting of ephemeral entities"
Christian Brauner <brauner@kernel.org> says:

It is currently possible to mount on top of various ephemeral entities
in procfs. This specifically includes magic links. To recap, magic links
are links of the form /proc/<pid>/fd/<nr>. They serve as references to
a target file and during path lookup they cause a jump to the target
path. Such magic links disappear if the corresponding file descriptor is
closed.

Currently it is possible to overmount such magic links:

int fd = open("/mnt/foo", O_RDONLY);
sprintf(path, "/proc/%d/fd/%d", getpid(), fd);
int fd2 = openat(AT_FDCWD, path, O_PATH | O_NOFOLLOW);
mount("/mnt/bar", path, "", MS_BIND, 0);

Arguably, this is nonsensical and is mostly interesting for an attacker
that wants to somehow trick a process into e.g., reopening something
that they didn't intend to reopen or to hide a malicious file
descriptor.

But also it risks leaking mounts for long-running processes. When
overmounting a magic link like above, the mount will not be detached
when the file descriptor is closed. Only the target mountpoint will
disappear. Which has the consequence of making it impossible to unmount
that mount afterwards. So the mount will stick around until the process
exits and the /proc/<pid>/ directory is cleaned up during
proc_flush_pid() when the dentries are pruned and invalidated.

That in turn means it's possible for a program to accidentally leak
mounts and it's also possible to make a task leak mounts without it's
knowledge if the attacker just keeps overmounting things under
/proc/<pid>/fd/<nr>.

I think it's wrong to try and fix this by us starting to play games with
close() or somewhere else to undo these mounts when the file descriptor
is closed. The fact that we allow overmounting of such magic links is
simply a bug and one that we need to fix.

Similar things can be said about entries under fdinfo/ and map_files/ so
those are restricted as well.

I have a further more aggressive patch that gets out the big hammer and
makes everything under /proc/<pid>/*, as well as immediate symlinks such
as /proc/self, /proc/thread-self, /proc/mounts, /proc/net that point
into /proc/<pid>/ not overmountable. Imho, all of this should be blocked
if we can get away with it. It's only useful to hide exploits such as in [1].

And again, overmounting of any global procfs files remains unaffected
and is an existing and supported use-case.

Link: https://righteousit.com/2024/07/24/hiding-linux-processes-with-bind-mounts [1]

// Note that repro uses the traditional way of just mounting over
// /proc/<pid>/fd/<nr>. This could also all be achieved just based on
// file descriptors using move_mount(). So /proc/<pid>/fd/<nr> isn't the
// only entry vector here. It's also possible to e.g., mount directly
// onto /proc/<pid>/map_files/* without going over /proc/<pid>/fd/<nr>.
int main(int argc, char *argv[])
{
        char path[PATH_MAX];

        creat("/mnt/foo", 0777);
        creat("/mnt/bar", 0777);

        /*
         * For illustration use a bunch of file descriptors in the upper
         * range that are unused.
         */
        for (int i = 10000; i >= 256; i--) {
                printf("I'm: /proc/%d/\n", getpid());

                int fd2 = open("/mnt/foo", O_RDONLY);
                if (fd2 < 0) {
                        printf("%m - Failed to open\n");
                        _exit(1);
                }

                int newfd = dup2(fd2, i);
                if (newfd < 0) {
                        printf("%m - Failed to dup\n");
                        _exit(1);
                }
                close(fd2);

                sprintf(path, "/proc/%d/fd/%d", getpid(), newfd);
                int fd = openat(AT_FDCWD, path, O_PATH | O_NOFOLLOW);
                if (fd < 0) {
                        printf("%m - Failed to open\n");
                        _exit(3);
                }

                sprintf(path, "/proc/%d/fd/%d", getpid(), fd);
                printf("Mounting on top of %s\n", path);
                if (mount("/mnt/bar", path, "", MS_BIND, 0)) {
                        printf("%m - Failed to mount\n");
                        _exit(4);
                }

                close(newfd);
                close(fd2);
        }

        /*
         * Give some time to look at things. The mounts now linger until
         * the process exits.
         */
        sleep(10000);
        _exit(0);
}

* patches from https://lore.kernel.org/r/20240806-work-procfs-v1-0-fb04e1d09f0c@kernel.org:
  proc: block mounting on top of /proc/<pid>/fdinfo/*
  proc: block mounting on top of /proc/<pid>/fd/*
  proc: block mounting on top of /proc/<pid>/map_files/*
  proc: add proc_splice_unmountable()
  proc: proc_readfdinfo() -> proc_fdinfo_iterate()
  proc: proc_readfd() -> proc_fd_iterate()

Link: https://lore.kernel.org/r/20240806-work-procfs-v1-0-fb04e1d09f0c@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30 08:22:13 +02:00
Christian Brauner
3836b31c3e
proc: block mounting on top of /proc/<pid>/map_files/*
Entries under /proc/<pid>/map_files/* are ephemeral and may go away
before the process dies. As such allowing them to be used as mount
points creates the ability to leak mounts that linger until the process
dies with no ability to unmount them until then. Don't allow using them
as mountpoints.

Link: https://lore.kernel.org/r/20240806-work-procfs-v1-4-fb04e1d09f0c@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30 08:22:12 +02:00
Adrian Ratiu
41e8149c88
proc: add config & param to block forcing mem writes
This adds a Kconfig option and boot param to allow removing
the FOLL_FORCE flag from /proc/pid/mem write calls because
it can be abused.

The traditional forcing behavior is kept as default because
it can break GDB and some other use cases.

Previously we tried a more sophisticated approach allowing
distributions to fine-tune /proc/pid/mem behavior, however
that got NAK-ed by Linus [1], who prefers this simpler
approach with semantics also easier to understand for users.

Link: https://lore.kernel.org/lkml/CAHk-=wiGWLChxYmUA5HrT5aopZrB7_2VTa0NLZcxORgkUe5tEQ@mail.gmail.com/ [1]
Cc: Doug Anderson <dianders@chromium.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
Link: https://lore.kernel.org/r/20240802080225.89408-1-adrian.ratiu@collabora.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30 08:19:43 +02:00
Felix Moessbauer
ed4fb6d7ef hrtimer: Use and report correct timerslack values for realtime tasks
The timerslack_ns setting is used to specify how much the hardware
timers should be delayed, to potentially dispatch multiple timers in a
single interrupt. This is a performance optimization. Timers of
realtime tasks (having a realtime scheduling policy) should not be
delayed.

This logic was inconsitently applied to the hrtimers, leading to delays
of realtime tasks which used timed waits for events (e.g. condition
variables). Due to the downstream override of the slack for rt tasks,
the procfs reported incorrect (non-zero) timerslack_ns values.

This is changed by setting the timer_slack_ns task attribute to 0 for
all tasks with a rt policy. By that, downstream users do not need to
specially handle rt tasks (w.r.t. the slack), and the procfs entry
shows the correct value of "0". Setting non-zero slack values (either
via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored,
as stated in "man 2 PR_SET_TIMERSLACK":

  Timer slack is not applied to threads that are scheduled under a
  real-time scheduling policy (see sched_setscheduler(2)).

The special handling of timerslack on rt tasks in downstream users
is removed as well.

Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240814121032.368444-2-felix.moessbauer@siemens.com
2024-08-23 20:13:02 +02:00
Thomas Gleixner
52dea0a15c posix-timers: Convert timer list to hlist
No requirement for a real list. Spare a few bytes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29 21:57:35 +02:00
Chengming Zhou
c2dc78b86e mm/ksm: fix ksm_zero_pages accounting
We normally ksm_zero_pages++ in ksmd when page is merged with zero page,
but ksm_zero_pages-- is done from page tables side, where there is no any
accessing protection of ksm_zero_pages.

So we can read very exceptional value of ksm_zero_pages in rare cases,
such as -1, which is very confusing to users.

Fix it by changing to use atomic_long_t, and the same case with the
mm->ksm_zero_pages.

Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-2-34bb358fdc13@linux.dev
Fixes: e2942062e0 ("ksm: count all zero pages placed by KSM")
Fixes: 6080d19f07 ("ksm: add ksm zero pages for each process")
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-05 19:19:26 -07:00
Al Viro
47458802f6 procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
that keeps both around until struct inode is freed, making access
to them safe from rcu-pathwalk

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Casey Schaufler
267c068e5f proc: Use lsmids instead of lsm names for attrs
Use the LSM ID number instead of the LSM name to identify which
security module's attibute data should be shown in /proc/self/attr.
The security_[gs]etprocattr() functions have been changed to expect
the LSM ID. The change from a string comparison to an integer comparison
in these functions will provide a minor performance improvement.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Mickael Salaun <mic@digikod.net>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-11-12 22:54:42 -05:00
Linus Torvalds
8f6f76a6a2 As usual, lots of singleton and doubleton patches all over the tree and
there's little I can say which isn't in the individual changelogs.
 
 The lengthier patch series are
 
 - "kdump: use generic functions to simplify crashkernel reservation in
   arch", from Baoquan He.  This is mainly cleanups and consolidation of
   the "crashkernel=" kernel parameter handling.
 
 - After much discussion, David Laight's "minmax: Relax type checks in
   min() and max()" is here.  Hopefully reduces some typecasting and the
   use of min_t() and max_t().
 
 - A group of patches from Oleg Nesterov which clean up and slightly fix
   our handling of reads from /proc/PID/task/...  and which remove
   task_struct.therad_group.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
 jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
 63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
 =On4u
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "As usual, lots of singleton and doubleton patches all over the tree
  and there's little I can say which isn't in the individual changelogs.

  The lengthier patch series are

   - 'kdump: use generic functions to simplify crashkernel reservation
     in arch', from Baoquan He. This is mainly cleanups and
     consolidation of the 'crashkernel=' kernel parameter handling

   - After much discussion, David Laight's 'minmax: Relax type checks in
     min() and max()' is here. Hopefully reduces some typecasting and
     the use of min_t() and max_t()

   - A group of patches from Oleg Nesterov which clean up and slightly
     fix our handling of reads from /proc/PID/task/... and which remove
     task_struct.thread_group"

* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
  scripts/gdb/vmalloc: disable on no-MMU
  scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
  .mailmap: add address mapping for Tomeu Vizoso
  mailmap: update email address for Claudiu Beznea
  tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
  .mailmap: map Benjamin Poirier's address
  scripts/gdb: add lx_current support for riscv
  ocfs2: fix a spelling typo in comment
  proc: test ProtectionKey in proc-empty-vm test
  proc: fix proc-empty-vm test with vsyscall
  fs/proc/base.c: remove unneeded semicolon
  do_io_accounting: use sig->stats_lock
  do_io_accounting: use __for_each_thread()
  ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
  ocfs2: fix a typo in a comment
  scripts/show_delta: add __main__ judgement before main code
  treewide: mark stuff as __ro_after_init
  fs: ocfs2: check status values
  proc: test /proc/${pid}/statm
  compiler.h: move __is_constexpr() to compiler.h
  ...
2023-11-02 20:53:31 -10:00
Yang Li
639931020e fs/proc/base.c: remove unneeded semicolon
./fs/proc/base.c:3829:2-3: Unneeded semicolon

Link: https://lkml.kernel.org/r/20231026005634.6581-1-yang.lee@linux.alibaba.com
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7057
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-01 12:46:59 -07:00
Oleg Nesterov
1df4bd83cd do_io_accounting: use sig->stats_lock
Rather than lock_task_sighand(), sig->stats_lock was specifically designed
for this type of use.

This way the "if (whole)" branch runs lockless in the likely case.

Link: https://lkml.kernel.org/r/20231023153405.GA4639@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-01 12:46:59 -07:00
Oleg Nesterov
2320222067 do_io_accounting: use __for_each_thread()
Rather than while_each_thread() which should be avoided when possible.

This makes the code more clear and allows the next change.

Link: https://lkml.kernel.org/r/20231023153343.GA4629@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-01 12:46:58 -07:00
Linus Torvalds
14ab6d425e vfs-6.7.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
 okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
 YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
 =m4pB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode time accessor updates from Christian Brauner:
 "This finishes the conversion of all inode time fields to accessor
  functions as discussed on list. Changing timestamps manually as we
  used to do before is error prone. Using accessors function makes this
  robust.

  It does not contain the switch of the time fields to discrete 64 bit
  integers to replace struct timespec and free up space in struct inode.
  But after this, the switch can be trivially made and the patch should
  only affect the vfs if we decide to do it"

* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
  fs: rename inode i_atime and i_mtime fields
  security: convert to new timestamp accessors
  selinux: convert to new timestamp accessors
  apparmor: convert to new timestamp accessors
  sunrpc: convert to new timestamp accessors
  mm: convert to new timestamp accessors
  bpf: convert to new timestamp accessors
  ipc: convert to new timestamp accessors
  linux: convert to new timestamp accessors
  zonefs: convert to new timestamp accessors
  xfs: convert to new timestamp accessors
  vboxsf: convert to new timestamp accessors
  ufs: convert to new timestamp accessors
  udf: convert to new timestamp accessors
  ubifs: convert to new timestamp accessors
  tracefs: convert to new timestamp accessors
  sysv: convert to new timestamp accessors
  squashfs: convert to new timestamp accessors
  server: convert to new timestamp accessors
  client: convert to new timestamp accessors
  ...
2023-10-30 09:47:13 -10:00
Amir Goldstein
08582d678f
fs: create helper file_user_path() for user displayed mapped file path
Overlayfs uses backing files with "fake" overlayfs f_path and "real"
underlying f_inode, in order to use underlying inode aops for mapped
files and to display the overlayfs path in /proc/<pid>/maps.

In preparation for storing the overlayfs "fake" path instead of the
underlying "real" path in struct backing_file, define a noop helper
file_user_path() that returns f_path for now.

Use the new helper in procfs and kernel logs whenever a path of a
mapped file is displayed to users.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19 11:03:15 +02:00
Alexey Dobriyan
860a2e7fa4 proc: use initializer for clearing some buffers
Save LOC by using dark magic of initialisation instead of memset().

Those buffer aren't passed to userspace directly so padding is not
an issue.

Link: https://lkml.kernel.org/r/3821d3a2-6e10-4629-b0d5-9519d828ab72@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18 14:43:22 -07:00
Jeff Layton
200d942170
proc: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-59-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18 14:08:26 +02:00
Oleg Nesterov
33a9813825 introduce __next_thread(), fix next_tid() vs exec() race
Patch series "introduce __next_thread(), change next_thread()".

After commit dce8f8ed1d ("document while_each_thread(), change
first_tid() to use for_each_thread()") + this series

1. We have only one lockless user of next_thread(), task_group_seq_get_next().
   I think it should be changed too.

2. We have only one user of task_struct->thread_group, thread_group_empty().
   The next patches will change thread_group_empty() and kill ->thread_group.


This patch (of 2):

next_tid(start) does:

	rcu_read_lock();
	if (pid_alive(start)) {
		pos = next_thread(start);
		if (thread_group_leader(pos))
			pos = NULL;
		else
			get_task_struct(pos);

it should return pos = NULL when next_thread() wraps to the 1st thread
in the thread group, group leader, and the thread_group_leader() check
tries to detect this case.

But this can race with exec. To simplify, suppose we have a main thread
M and a single sub-thread T, next_tid(T) should return NULL.

Now suppose that T execs. If next_tid(T) is called after T changes the
leadership and before it does release_task() which removes the old leader
from list, then next_thread() returns M and thread_group_leader(M) = F.

Lockless use of next_thread() should be avoided. After this change only
task_group_seq_get_next() does this, and I believe it should be changed
as well.

Link: https://lkml.kernel.org/r/20230824143112.GA31208@redhat.com
Link: https://lkml.kernel.org/r/20230824143142.GA31222@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:41:56 -07:00
Linus Torvalds
d68b4b6f30 - An extensive rework of kexec and crash Kconfig from Eric DeVolder
("refactor Kconfig to consolidate KEXEC and CRASH options").
 
 - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
   couple of macros to args.h").
 
 - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
   commands").
 
 - vsprintf inclusion rationalization from Andy Shevchenko
   ("lib/vsprintf: Rework header inclusions").
 
 - Switch the handling of kdump from a udev scheme to in-kernel handling,
   by Eric DeVolder ("crash: Kernel handling of CPU and memory hot
   un/plug").
 
 - Many singleton patches to various parts of the tree
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO2GpAAKCRDdBJ7gKXxA
 juW3AQD1moHzlSN6x9I3tjm5TWWNYFoFL8af7wXDJspp/DWH/AD/TO0XlWWhhbYy
 QHy7lL0Syha38kKLMXTM+bN6YQHi9AU=
 =WJQa
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - An extensive rework of kexec and crash Kconfig from Eric DeVolder
   ("refactor Kconfig to consolidate KEXEC and CRASH options")

 - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
   couple of macros to args.h")

 - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
   commands")

 - vsprintf inclusion rationalization from Andy Shevchenko
   ("lib/vsprintf: Rework header inclusions")

 - Switch the handling of kdump from a udev scheme to in-kernel
   handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
   hot un/plug")

 - Many singleton patches to various parts of the tree

* tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
  document while_each_thread(), change first_tid() to use for_each_thread()
  drivers/char/mem.c: shrink character device's devlist[] array
  x86/crash: optimize CPU changes
  crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
  crash: hotplug support for kexec_load()
  x86/crash: add x86 crash hotplug support
  crash: memory and CPU hotplug sysfs attributes
  kexec: exclude elfcorehdr from the segment digest
  crash: add generic infrastructure for crash hotplug support
  crash: move a few code bits to setup support of crash hotplug
  kstrtox: consistently use _tolower()
  kill do_each_thread()
  nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
  scripts/bloat-o-meter: count weak symbol sizes
  treewide: drop CONFIG_EMBEDDED
  lockdep: fix static memory detection even more
  lib/vsprintf: declare no_hash_pointers in sprintf.h
  lib/vsprintf: split out sprintf() and friends
  kernel/fork: stop playing lockless games for exe_file replacement
  adfs: delete unused "union adfs_dirtail" definition
  ...
2023-08-29 14:53:51 -07:00
Linus Torvalds
b96a3e9142 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP.  It
   also speeds up GUP handling of transparent hugepages.
 
 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").
 
 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").
 
 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").
 
 - xu xin has doe some work on KSM's handling of zero pages.  These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support tracking
   KSM-placed zero-pages").
 
 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
 
 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").
 
 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
 
 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").
 
 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").
 
 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").
 
 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").
 
 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").
 
 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap").  And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").
 
 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
 
 - Baoquan He has converted some architectures to use the GENERIC_IOREMAP
   ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
   GENERIC_IOREMAP way").
 
 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").
 
 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep").  Liam also developed some efficiency improvements
   ("Reduce preallocations for maple tree").
 
 - Cleanup and optimization to the secondary IOMMU TLB invalidation, from
   Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").
 
 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").
 
 - Kemeng Shi provides some maintenance work on the compaction code ("Two
   minor cleanups for compaction").
 
 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
   file-backed faults under the VMA lock").
 
 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").
 
 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").
 
 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").
 
 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
 
 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").
 
 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").
 
 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
 
 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").
 
 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").
 
 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
   on memory feature on ppc64").
 
 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
 
 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").
 
 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").
 
 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").
 
 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").
 
 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").
 
 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").
 
 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").
 
 - pagetable handling cleanups from Matthew Wilcox ("New page table range
   API").
 
 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").
 
 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").
 
 - Matthew Wilcox has also done some maintenance work on the MM subsystem
   documentation ("Improve mm documentation").
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
 jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
 Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
 =Afws
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in
   add_to_avail_list")

 - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP. It
   also speeds up GUP handling of transparent hugepages.

 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").

 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").

 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").

 - xu xin has doe some work on KSM's handling of zero pages. These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support
   tracking KSM-placed zero-pages").

 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").

 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with
   UFFD").

 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").

 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").

 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").

 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").

 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").

 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap"). And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").

 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

 - Baoquan He has converted some architectures to use the
   GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
   architectures to take GENERIC_IOREMAP way").

 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").

 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep"). Liam also developed some efficiency
   improvements ("Reduce preallocations for maple tree").

 - Cleanup and optimization to the secondary IOMMU TLB invalidation,
   from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").

 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").

 - Kemeng Shi provides some maintenance work on the compaction code
   ("Two minor cleanups for compaction").

 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
   most file-backed faults under the VMA lock").

 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").

 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").

 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").

 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").

 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").

 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").

 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").

 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
   memmap on memory feature on ppc64").

 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock
   migratetype").

 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").

 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").

 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").

 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").

 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").

 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").

 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").

 - pagetable handling cleanups from Matthew Wilcox ("New page table
   range API").

 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").

 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").

 - Matthew Wilcox has also done some maintenance work on the MM
   subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
  maple_tree: shrink struct maple_tree
  maple_tree: clean up mas_wr_append()
  secretmem: convert page_is_secretmem() to folio_is_secretmem()
  nios2: fix flush_dcache_page() for usage from irq context
  hugetlb: add documentation for vma_kernel_pagesize()
  mm: add orphaned kernel-doc to the rst files.
  mm: fix clean_record_shared_mapping_range kernel-doc
  mm: fix get_mctgt_type() kernel-doc
  mm: fix kernel-doc warning from tlb_flush_rmaps()
  mm: remove enum page_entry_size
  mm: allow ->huge_fault() to be called without the mmap_lock held
  mm: move PMD_ORDER to pgtable.h
  mm: remove checks for pte_index
  memcg: remove duplication detection for mem_cgroup_uncharge_swap
  mm/huge_memory: work on folio->swap instead of page->private when splitting folio
  mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
  mm/swap: use dedicated entry for swap in folio
  mm/swap: stop using page->private on tail pages for THP_SWAP
  selftests/mm: fix WARNING comparing pointer to 0
  selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
  ...
2023-08-29 14:25:26 -07:00
Linus Torvalds
b4a04f92a4 v6.6-fs.proc.uapi
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT2QAKCRCRxhvAZXjc
 olkFAQCT4nRkRTpBvbiv4DgvCIy+URqLNfHGxCxdAX1B09o3UwEAyepf1tz7aFpB
 wB67V265JFDMWtvQkSx4ORNpAjZ9Kg0=
 =Opqi
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull procfs fixes from Christian Brauner:
 "Mode changes to files under /proc/<pid>/ aren't supported ever since
  commit 6d76fa58b0 ("Don't allow chmod() on the /proc/<pid>/ files").

  Due to an oversight in commit 1b3044e39a ("procfs: fix pthread
  cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
  mode changes on /proc/thread-self/comm were accidently allowed.

  Similar, mode changes for all files beneath /proc/<pid>/net/ are
  blocked but mode changes on /proc/<pid>/net itself were accidently
  allowed.

  Both issues come down to not using the generic proc_setattr() helper
  which blocks all mode changes. This is rectified with this pull
  request.

  This also removes a strange nolibc test that abused /proc/<pid>/net
  for testing mode changes. Using procfs for this test never made a lot
  of sense given procfs has special semantics for almost everything
  anway.

  Both changes are minor user-visible changes. It is however very
  unlikely that mode changes on proc/<pid>/net and
  /proc/thread-self/comm are something that userspace relies on"

* tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  procfs: block chmod on /proc/thread-self/comm
  proc: use generic setattr() for /proc/$PID/net
  selftests/nolibc: drop test chmod_net
2023-08-28 11:43:19 -07:00
Linus Torvalds
615e95831e v6.6-vfs.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
 oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
 XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
 =eJz5
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
 "This adds VFS support for multi-grain timestamps and converts tmpfs,
  xfs, ext4, and btrfs to use them. This carries acks from all relevant
  filesystems.

  The VFS always uses coarse-grained timestamps when updating the ctime
  and mtime after a change. This has the benefit of allowing filesystems
  to optimize away a lot of metadata updates, down to around 1 per
  jiffy, even when a file is under heavy writes.

  Unfortunately, this has always been an issue when we're exporting via
  NFSv3, which relies on timestamps to validate caches. A lot of changes
  can happen in a jiffy, so timestamps aren't sufficient to help the
  client decide to invalidate the cache.

  Even with NFSv4, a lot of exported filesystems don't properly support
  a change attribute and are subject to the same problems with timestamp
  granularity. Other applications have similar issues with timestamps
  (e.g., backup applications).

  If we were to always use fine-grained timestamps, that would improve
  the situation, but that becomes rather expensive, as the underlying
  filesystem would have to log a lot more metadata updates.

  This introduces fine-grained timestamps that are used when they are
  actively queried.

  This uses the 31st bit of the ctime tv_nsec field to indicate that
  something has queried the inode for the mtime or ctime. When this flag
  is set, on the next mtime or ctime update, the kernel will fetch a
  fine-grained timestamp instead of the usual coarse-grained one.

  As POSIX generally mandates that when the mtime changes, the ctime
  must also change the kernel always stores normalized ctime values, so
  only the first 30 bits of the tv_nsec field are ever used.

  Filesytems can opt into this behavior by setting the FS_MGTIME flag in
  the fstype. Filesystems that don't set this flag will continue to use
  coarse-grained timestamps.

  Various preparatory changes, fixes and cleanups are included:

   - Fixup all relevant places where POSIX requires updating ctime
     together with mtime. This is a wide-range of places and all
     maintainers provided necessary Acks.

   - Add new accessors for inode->i_ctime directly and change all
     callers to rely on them. Plain accesses to inode->i_ctime are now
     gone and it is accordingly rename to inode->__i_ctime and commented
     as requiring accessors.

   - Extend generic_fillattr() to pass in a request mask mirroring in a
     sense the statx() uapi. This allows callers to pass in a request
     mask to only get a subset of attributes filled in.

   - Rework timestamp updates so it's possible to drop the @now
     parameter the update_time() inode operation and associated helpers.

   - Add inode_update_timestamps() and convert all filesystems to it
     removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  fs: add infrastructure for multigrain timestamps
  fs: drop the timespec64 argument from update_time
  xfs: have xfs_vn_update_time gets its own timestamp
  fat: make fat_update_time get its own timestamp
  fat: remove i_version handling from fat_update_time
  ubifs: have ubifs_update_time use inode_update_timestamps
  btrfs: have it use inode_update_timestamps
  fs: drop the timespec64 arg from generic_update_time
  fs: pass the request_mask to generic_fillattr
  fs: remove silly warning from current_time
  gfs2: fix timestamp handling on quota inodes
  fs: rename i_ctime field to __i_ctime
  selinux: convert to ctime accessor functions
  security: convert to ctime accessor functions
  apparmor: convert to ctime accessor functions
  sunrpc: convert to ctime accessor functions
  ...
2023-08-28 09:31:32 -07:00
Oleg Nesterov
dce8f8ed1d document while_each_thread(), change first_tid() to use for_each_thread()
Add the comment to explain that while_each_thread(g,t) is not rcu-safe
unless g is stable (e.g.  current).  Even if g is a group leader and thus
can't exit before t, t or another sub-thread can exec and remove g from
the thread_group list.

The only lockless user of while_each_thread() is first_tid() and it is
fine in that it can't loop forever, yet for_each_thread() looks better and
I am going to change while_each_thread/next_thread.

Link: https://lkml.kernel.org/r/20230823170806.GA11724@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:25:15 -07:00
xu xin
6080d19f07 ksm: add ksm zero pages for each process
As the number of ksm zero pages is not included in ksm_merging_pages per
process when enabling use_zero_pages, it's unclear of how many actual
pages are merged by KSM. To let users accurately estimate their memory
demands when unsharing KSM zero-pages, it's necessary to show KSM zero-
pages per process. In addition, it help users to know the actual KSM
profit because KSM-placed zero pages are also benefit from KSM.

since unsharing zero pages placed by KSM accurately is achieved, then
tracking empty pages merging and unmerging is not a difficult thing any
longer.

Since we already have /proc/<pid>/ksm_stat, just add the information of
'ksm_zero_pages' in it.

Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:10 -07:00
Jeff Layton
0d72b92883 fs: pass the request_mask to generic_fillattr
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09 08:56:36 +02:00
Linus Torvalds
0a2c2baafa
proc: fix missing conversion to 'iterate_shared'
I'm looking at the directory handling due to the discussion about f_pos
locking (see commit 797964253d: "file: reinstate f_pos locking
optimization for regular files"), and wanting to clean that up.

And one source of ugliness is how we were supposed to move filesystems
over to the '->iterate_shared()' function that only takes the inode lock
for reading many many years ago, but several filesystems still use the
bad old '->iterate()' that takes the inode lock for exclusive access.

See commit 6192269444 ("introduce a parallel variant of ->iterate()")
that also added some documentation stating

      Old method is only used if the new one is absent; eventually it will
      be removed.  Switch while you still can; the old one won't stay.

and that was back in April 2016.  Here we are, many years later, and the
old version is still clearly sadly alive and well.

Now, some of those old style iterators are probably just because the
filesystem may end up having per-inode mutable data that it uses for
iterating a directory, but at least one case is just a mistake.

Al switched over most filesystems to use '->iterate_shared()' back when
it was introduced.  In particular, the /proc filesystem was converted as
one of the first ones in commit f50752eaa0 ("switch all procfs
directories ->iterate_shared()").

But then later one new user of '->iterate()' was then re-introduced by
commit 6d9c939dbe ("procfs: add smack subdir to attrs").

And that's clearly not what we wanted, since that new case just uses the
same 'proc_pident_readdir()' and 'proc_pident_lookup()' helper functions
that other /proc pident directories use, and they are most definitely
safe to use with the inode lock held shared.

So just fix it.

This still leaves a fair number of oddball filesystems using the
old-style directory iterator (ceph, coda, exfat, jfs, ntfs, ocfs2,
overlayfs, and vboxsf), but at least we don't have any remaining in the
core filesystems.

I'm going to add a wrapper function that just drops the read-lock and
takes it as a write lock, so that we can clean up the core vfs layer and
make all the ugly 'this filesystem needs exclusive inode locking' be
just filesystem-internal warts.

I just didn't want to make that conversion when we still had a core user
left.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-06 15:08:35 +02:00
Jeff Layton
e9d7d3cb9f procfs: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-65-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-24 10:30:03 +02:00
Aleksa Sarai
ccf61486fe
procfs: block chmod on /proc/thread-self/comm
Due to an oversight in commit 1b3044e39a ("procfs: fix pthread
cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
chmod operations on /proc/thread-self/comm were no longer blocked as
they are on almost all other procfs files.

A very similar situation with /proc/self/environ was used to as a root
exploit a long time ago, but procfs has SB_I_NOEXEC so this is simply a
correctness issue.

Ref: https://lwn.net/Articles/191954/
Ref: 6d76fa58b0 ("Don't allow chmod() on the /proc/<pid>/ files")
Fixes: 1b3044e39a ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230713141001.27046-1-cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13 16:30:52 +02:00
Linus Torvalds
33afd4b763 Mainly singleton patches all over the place. Series of note are:
- updates to scripts/gdb from Glenn Washburn
 
 - kexec cleanups from Bjorn Helgaas
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr+6wAKCRDdBJ7gKXxA
 jn4NAP4u/hj/kR2dxYehcVLuQqJspCRZZBZlAReFJyHNQO6voAEAk0NN9rtG2+/E
 r0G29CJhK+YL0W6mOs8O1yo9J1rZnAM=
 =2CUV
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Mainly singleton patches all over the place.

  Series of note are:

   - updates to scripts/gdb from Glenn Washburn

   - kexec cleanups from Bjorn Helgaas"

* tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits)
  mailmap: add entries for Paul Mackerras
  libgcc: add forward declarations for generic library routines
  mailmap: add entry for Oleksandr
  ocfs2: reduce ioctl stack usage
  fs/proc: add Kthread flag to /proc/$pid/status
  ia64: fix an addr to taddr in huge_pte_offset()
  checkpatch: introduce proper bindings license check
  epoll: rename global epmutex
  scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry()
  scripts/gdb: create linux/vfs.py for VFS related GDB helpers
  uapi/linux/const.h: prefer ISO-friendly __typeof__
  delayacct: track delays from IRQ/SOFTIRQ
  scripts/gdb: timerlist: convert int chunks to str
  scripts/gdb: print interrupts
  scripts/gdb: raise error with reduced debugging information
  scripts/gdb: add a Radix Tree Parser
  lib/rbtree: use '+' instead of '|' for setting color.
  proc/stat: remove arch_idle_time()
  checkpatch: check for misuse of the link tags
  checkpatch: allow Closes tags with links
  ...
2023-04-27 19:57:00 -07:00
Stefan Roesch
d21077fbc2 mm: add new KSM process and sysfs knobs
This adds the general_profit KSM sysfs knob and the process profit metric
knobs to ksm_stat.

1) expose general_profit metric

   The documentation mentions a general profit metric, however this
   metric is not calculated.  In addition the formula depends on the size
   of internal structures, which makes it more difficult for an
   administrator to make the calculation.  Adding the metric for a better
   user experience.

2) document general_profit sysfs knob

3) calculate ksm process profit metric

   The ksm documentation mentions the process profit metric and how to
   calculate it.  This adds the calculation of the metric.

4) mm: expose ksm process profit metric in ksm_stat

   This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
   The documentation mentions the formula for the ksm process profit
   metric, however it does not calculate it.  In addition the formula
   depends on the size of internal structures.  So it makes sense to
   expose it.

5) document new procfs ksm knobs

Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:03 -07:00
Chao Yu
b99b258899 proc: remove mark_inode_dirty() in .setattr()
procfs' .setattr() has updated i_uid, i_gid and i_mode into proc dirent,
we don't need to call mark_inode_dirty() for delayed update, remove it.

Link: https://lkml.kernel.org/r/20230131150840.34726-1-chao@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-08 13:45:36 -07:00
Christian Brauner
4609e1f18e
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
b74d24f7a7
fs: port ->getattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner
c1632a0f11
fs: port ->setattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:02 +01:00
Linus Torvalds
27bc50fc90 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative
   reports (or any positive ones, come to that).
 
 - Also the Maple Tree from Liam R.  Howlett.  An overlapping range-based
   tree for vmas.  It it apparently slight more efficient in its own right,
   but is mainly targeted at enabling work to reduce mmap_lock contention.
 
   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.
 
   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
   This has yet to be addressed due to Liam's unfortunately timed
   vacation.  He is now back and we'll get this fixed up.
 
 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer.  It uses
   clang-generated instrumentation to detect used-unintialized bugs down to
   the single bit level.
 
   KMSAN keeps finding bugs.  New ones, as well as the legacy ones.
 
 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.
 
 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
   file/shmem-backed pages.
 
 - userfaultfd updates from Axel Rasmussen
 
 - zsmalloc cleanups from Alexey Romanov
 
 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
 
 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.
 
 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.
 
 - memcg cleanups from Kairui Song.
 
 - memcg fixes and cleanups from Johannes Weiner.
 
 - Vishal Moola provides more folio conversions
 
 - Zhang Yi removed ll_rw_block() :(
 
 - migration enhancements from Peter Xu
 
 - migration error-path bugfixes from Huang Ying
 
 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths.  For optimizations by PMEM drivers, DRM
   drivers, etc.
 
 - vma merging improvements from Jakub Matěn.
 
 - NUMA hinting cleanups from David Hildenbrand.
 
 - xu xin added aditional userspace visibility into KSM merging activity.
 
 - THP & KSM code consolidation from Qi Zheng.
 
 - more folio work from Matthew Wilcox.
 
 - KASAN updates from Andrey Konovalov.
 
 - DAMON cleanups from Kaixu Xia.
 
 - DAMON work from SeongJae Park: fixes, cleanups.
 
 - hugetlb sysfs cleanups from Muchun Song.
 
 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
 joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
 bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
 =xfWx
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
2022-10-10 17:53:04 -07:00
xu xin
cb4df4cae4 ksm: count allocated ksm rmap_items for each process
Patch series "ksm: count allocated rmap_items and update documentation",
v5.

KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.

To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.

The detailed description can be seen in the following patches' commit
message.


This patch (of 2):

KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.  Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.

The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile.  A simple
approximate calculation:

	profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
	         sizeof(rmap_item);

where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.

But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked. 
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code.  For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.

So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.

So similarly, we can calculate the ksm profit approximately for a single
process by:

	profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
		 sizeof(rmap_item);

where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.

Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:29 -07:00
Liam R. Howlett
5f14b9246e fs/proc/base: use the vma iterators in place of linked list
Use the vma iterator instead of a for loop across the linked list.  The
link list of vmas will be removed in this patch set.

Link: https://lkml.kernel.org/r/20220906194824.2110408-43-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:21 -07:00
Al Viro
5b09c9fec0 do_proc_readlink(): constify path
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-09-01 17:38:37 -04:00
Al Viro
c8e477c649 ->getprocattr(): attribute name is const char *, TYVM...
cast of ->d_name.name to char * is completely wrong - nothing is
allowed to modify its contents.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-09-01 17:34:39 -04:00
Zhihao Cheng
d919a1e79b proc: fix a dentry lock race between release_task and lookup
Commit 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
moved proc_flush_task() behind __exit_signal().  Then, process systemd can
take long period high cpu usage during releasing task in following
concurrent processes:

  systemd                                 ps
kernel_waitid                 stat(/proc/tgid)
  do_wait                       filename_lookup
    wait_consider_task            lookup_fast
      release_task
        __exit_signal
          __unhash_process
            detach_pid
              __change_pid // remove task->pid_links
                                     d_revalidate -> pid_revalidate  // 0
                                     d_invalidate(/proc/tgid)
                                       shrink_dcache_parent(/proc/tgid)
                                         d_walk(/proc/tgid)
                                           spin_lock_nested(/proc/tgid/fd)
                                           // iterating opened fd
        proc_flush_pid                                    |
           d_invalidate (/proc/tgid/fd)                   |
              shrink_dcache_parent(/proc/tgid/fd)         |
                shrink_dentry_list(subdirs)               ↓
                  shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock

Function d_invalidate() will remove dentry from hash firstly, but why does
proc_flush_pid() process dentry '/proc/tgid/fd' before dentry
'/proc/tgid'?  That's because proc_pid_make_inode() adds proc inode in
reverse order by invoking hlist_add_head_rcu().  But proc should not add
any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by
adding inode into 'pid->inodes' only if the inode is /proc/tgid or
/proc/tgid/task/pid.

Performance regression:
Create 200 tasks, each task open one file for 50,000 times. Kill all
tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).

Before fix:
$ time killall -wq aa
  real    4m40.946s   # During this period, we can see 'ps' and 'systemd'
			taking high cpu usage.

After fix:
$ time killall -wq aa
  real    1m20.732s   # During this period, we can see 'systemd' taking
			high cpu usage.

Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17 17:31:43 -07:00
xu xin
7609385337 ksm: count ksm merging pages for each process
Some applications or containers want to use KSM by calling madvise() to
advise areas of address space to be MERGEABLE.  But they may not know
which applications are more likely to cause real merges in the
deployment.  If this patch is applied, it helps them know their
corresponding number of merged pages, and then optimize their app code.

As current KSM only counts the number of KSM merging pages(e.g. 
ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see
the more fine-grained KSM merging, for the upper application optimization,
the merging area cannot be set easily according to the KSM page merging
probability of each process.  Therefore, it is necessary to add extra
statistical means so that the upper level users can know the detailed KSM
merging information of each process.

We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to
indicate the involved ksm merging pages of this process.

[akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s]
Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Ohhoon Kwon <ohoono.kwon@samsung.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28 23:16:16 -07:00
Linus Torvalds
1930a6e739 ptrace: Cleanups for v5.18
This set of changes removes tracehook.h, moves modification of all of
 the ptrace fields inside of siglock to remove races, adds a missing
 permission check to ptrace.c
 
 The removal of tracehook.h is quite significant as it has been a major
 source of confusion in recent years.  Much of that confusion was
 around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
 making the semantics clearer).
 
 For people who don't know tracehook.h is a vestiage of an attempt to
 implement uprobes like functionality that was never fully merged, and
 was later superseeded by uprobes when uprobes was merged.  For many
 years now we have been removing what tracehook functionaly a little
 bit at a time.  To the point where now anything left in tracehook.h is
 some weird strange thing that is difficult to understand.
 
 Eric W. Biederman (15):
       ptrace: Move ptrace_report_syscall into ptrace.h
       ptrace/arm: Rename tracehook_report_syscall report_syscall
       ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
       ptrace: Remove arch_syscall_{enter,exit}_tracehook
       ptrace: Remove tracehook_signal_handler
       task_work: Remove unnecessary include from posix_timers.h
       task_work: Introduce task_work_pending
       task_work: Call tracehook_notify_signal from get_signal on all architectures
       task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
       signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
       resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
       resume_user_mode: Move to resume_user_mode.h
       tracehook: Remove tracehook.h
       ptrace: Move setting/clearing ptrace_message into ptrace_stop
       ptrace: Return the signal to continue with from ptrace_stop
 
 Jann Horn (1):
       ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
 
 Yang Li (1):
       ptrace: Remove duplicated include in ptrace.c
 
  MAINTAINERS                          |   1 -
  arch/Kconfig                         |   5 +-
  arch/alpha/kernel/ptrace.c           |   5 +-
  arch/alpha/kernel/signal.c           |   4 +-
  arch/arc/kernel/ptrace.c             |   5 +-
  arch/arc/kernel/signal.c             |   4 +-
  arch/arm/kernel/ptrace.c             |  12 +-
  arch/arm/kernel/signal.c             |   4 +-
  arch/arm64/kernel/ptrace.c           |  14 +--
  arch/arm64/kernel/signal.c           |   4 +-
  arch/csky/kernel/ptrace.c            |   5 +-
  arch/csky/kernel/signal.c            |   4 +-
  arch/h8300/kernel/ptrace.c           |   5 +-
  arch/h8300/kernel/signal.c           |   4 +-
  arch/hexagon/kernel/process.c        |   4 +-
  arch/hexagon/kernel/signal.c         |   1 -
  arch/hexagon/kernel/traps.c          |   6 +-
  arch/ia64/kernel/process.c           |   4 +-
  arch/ia64/kernel/ptrace.c            |   6 +-
  arch/ia64/kernel/signal.c            |   1 -
  arch/m68k/kernel/ptrace.c            |   5 +-
  arch/m68k/kernel/signal.c            |   4 +-
  arch/microblaze/kernel/ptrace.c      |   5 +-
  arch/microblaze/kernel/signal.c      |   4 +-
  arch/mips/kernel/ptrace.c            |   5 +-
  arch/mips/kernel/signal.c            |   4 +-
  arch/nds32/include/asm/syscall.h     |   2 +-
  arch/nds32/kernel/ptrace.c           |   5 +-
  arch/nds32/kernel/signal.c           |   4 +-
  arch/nios2/kernel/ptrace.c           |   5 +-
  arch/nios2/kernel/signal.c           |   4 +-
  arch/openrisc/kernel/ptrace.c        |   5 +-
  arch/openrisc/kernel/signal.c        |   4 +-
  arch/parisc/kernel/ptrace.c          |   7 +-
  arch/parisc/kernel/signal.c          |   4 +-
  arch/powerpc/kernel/ptrace/ptrace.c  |   8 +-
  arch/powerpc/kernel/signal.c         |   4 +-
  arch/riscv/kernel/ptrace.c           |   5 +-
  arch/riscv/kernel/signal.c           |   4 +-
  arch/s390/include/asm/entry-common.h |   1 -
  arch/s390/kernel/ptrace.c            |   1 -
  arch/s390/kernel/signal.c            |   5 +-
  arch/sh/kernel/ptrace_32.c           |   5 +-
  arch/sh/kernel/signal_32.c           |   4 +-
  arch/sparc/kernel/ptrace_32.c        |   5 +-
  arch/sparc/kernel/ptrace_64.c        |   5 +-
  arch/sparc/kernel/signal32.c         |   1 -
  arch/sparc/kernel/signal_32.c        |   4 +-
  arch/sparc/kernel/signal_64.c        |   4 +-
  arch/um/kernel/process.c             |   4 +-
  arch/um/kernel/ptrace.c              |   5 +-
  arch/x86/kernel/ptrace.c             |   1 -
  arch/x86/kernel/signal.c             |   5 +-
  arch/x86/mm/tlb.c                    |   1 +
  arch/xtensa/kernel/ptrace.c          |   5 +-
  arch/xtensa/kernel/signal.c          |   4 +-
  block/blk-cgroup.c                   |   2 +-
  fs/coredump.c                        |   1 -
  fs/exec.c                            |   1 -
  fs/io-wq.c                           |   6 +-
  fs/io_uring.c                        |  11 +-
  fs/proc/array.c                      |   1 -
  fs/proc/base.c                       |   1 -
  include/asm-generic/syscall.h        |   2 +-
  include/linux/entry-common.h         |  47 +-------
  include/linux/entry-kvm.h            |   2 +-
  include/linux/posix-timers.h         |   1 -
  include/linux/ptrace.h               |  81 ++++++++++++-
  include/linux/resume_user_mode.h     |  64 ++++++++++
  include/linux/sched/signal.h         |  17 +++
  include/linux/task_work.h            |   5 +
  include/linux/tracehook.h            | 226 -----------------------------------
  include/uapi/linux/ptrace.h          |   2 +-
  kernel/entry/common.c                |  19 +--
  kernel/entry/kvm.c                   |   9 +-
  kernel/exit.c                        |   3 +-
  kernel/livepatch/transition.c        |   1 -
  kernel/ptrace.c                      |  47 +++++---
  kernel/seccomp.c                     |   1 -
  kernel/signal.c                      |  62 +++++-----
  kernel/task_work.c                   |   4 +-
  kernel/time/posix-cpu-timers.c       |   1 +
  mm/memcontrol.c                      |   2 +-
  security/apparmor/domain.c           |   1 -
  security/selinux/hooks.c             |   1 -
  85 files changed, 372 insertions(+), 495 deletions(-)
 
 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmJCQkoACgkQC/v6Eiaj
 j0DCWQ/5AZVFU+hX32obUNCLackHTwgcCtSOs3JNBmNA/zL/htPiYYG0ghkvtlDR
 Dw5J5DnxC6P7PVAdAqrpvx2uX2FebHYU0bRlyLx8LYUEP5dhyNicxX9jA882Z+vw
 Ud0Ue9EojwGWS76dC9YoKUj3slThMATbhA2r4GVEoof8fSNJaBxQIqath44t0FwU
 DinWa+tIOvZANGBZr6CUUINNIgqBIZCH/R4h6ArBhMlJpuQ5Ufk2kAaiWFwZCkX4
 0LuuAwbKsCKkF8eap5I2KrIg/7zZVgxAg9O3cHOzzm8OPbKzRnNnQClcDe8perqp
 S6e/f3MgpE+eavd1EiLxevZ660cJChnmikXVVh8ZYYoefaMKGqBaBSsB38bNcLjY
 3+f2dB+TNBFRnZs1aCujK3tWBT9QyjZDKtCBfzxDNWBpXGLhHH6j6lA5Lj+Cef5K
 /HNHFb+FuqedlFZh5m1Y+piFQ70hTgCa2u8b+FSOubI2hW9Zd+WzINV0ANaZ2LvZ
 4YGtcyDNk1q1+c87lxP9xMRl/xi6rNg+B9T2MCo4IUnHgpSVP6VEB3osgUmrrrN0
 eQlUI154G/AaDlqXLgmn1xhRmlPGfmenkxpok1AuzxvNJsfLKnpEwQSc13g3oiZr
 disZQxNY0kBO2Nv3G323Z6PLinhbiIIFez6cJzK5v0YJ2WtO3pY=
 =uEro
 -----END PGP SIGNATURE-----

Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull ptrace cleanups from Eric Biederman:
 "This set of changes removes tracehook.h, moves modification of all of
  the ptrace fields inside of siglock to remove races, adds a missing
  permission check to ptrace.c

  The removal of tracehook.h is quite significant as it has been a major
  source of confusion in recent years. Much of that confusion was around
  task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
  semantics clearer).

  For people who don't know tracehook.h is a vestiage of an attempt to
  implement uprobes like functionality that was never fully merged, and
  was later superseeded by uprobes when uprobes was merged. For many
  years now we have been removing what tracehook functionaly a little
  bit at a time. To the point where anything left in tracehook.h was
  some weird strange thing that was difficult to understand"

* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ptrace: Remove duplicated include in ptrace.c
  ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
  ptrace: Return the signal to continue with from ptrace_stop
  ptrace: Move setting/clearing ptrace_message into ptrace_stop
  tracehook: Remove tracehook.h
  resume_user_mode: Move to resume_user_mode.h
  resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
  signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
  task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
  task_work: Call tracehook_notify_signal from get_signal on all architectures
  task_work: Introduce task_work_pending
  task_work: Remove unnecessary include from posix_timers.h
  ptrace: Remove tracehook_signal_handler
  ptrace: Remove arch_syscall_{enter,exit}_tracehook
  ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
  ptrace/arm: Rename tracehook_report_syscall report_syscall
  ptrace: Move ptrace_report_syscall into ptrace.h
2022-03-28 17:29:53 -07:00
Hao Lee
3a72917ccf proc: alloc PATH_MAX bytes for /proc/${pid}/fd/ symlinks
It's not a standard approach that use __get_free_page() to alloc path
buffer directly.  We'd better use kmalloc and PATH_MAX.

	PAGE_SIZE is different on different archs. An unlinked file
	with very long canonical pathname will readlink differently
	because "(deleted)" eats into a buffer.	--adobriyan

[akpm@linux-foundation.org: remove now-unneeded cast]

Link: https://lkml.kernel.org/r/Ye1fCxyZZ0I5lgOL@localhost.localdomain
Signed-off-by: Hao Lee <haolee.swjtu@gmail.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23 19:00:33 -07:00
Eric W. Biederman
355f841a3f tracehook: Remove tracehook.h
Now that all of the definitions have moved out of tracehook.h into
ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
tracehook.h so remove it.

Update the few files that were depending upon tracehook.h to bring in
definitions to use the headers they need directly.

Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10 16:51:51 -06:00
Qi Zheng
51a1873440 proc: convert the return type of proc_fd_access_allowed() to be boolean
Convert return type of proc_fd_access_allowed() and the 'allowed' in it
to be boolean since the return type of ptrace_may_access() is boolean.

Link: https://lkml.kernel.org/r/20211219024404.29779-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20 08:52:52 +02:00
Linus Torvalds
59a2ceeef6 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "87 patches.

  Subsystems affected by this patch series: mm (pagecache and hugetlb),
  procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
  init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
  sysvfs, kcov, gdb, resource, selftests, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
  ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
  ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
  selftests/kselftest/runner/run_one(): allow running non-executable files
  virtio-mem: disallow mapping virtio-mem memory via /dev/mem
  kernel/resource: disallow access to exclusive system RAM regions
  kernel/resource: clean up and optimize iomem_is_exclusive()
  scripts/gdb: handle split debug for vmlinux
  kcov: replace local_irq_save() with a local_lock_t
  kcov: avoid enable+disable interrupts if !in_task()
  kcov: allocate per-CPU memory on the relevant node
  Documentation/kcov: define `ip' in the example
  Documentation/kcov: include types.h in the example
  sysv: use BUILD_BUG_ON instead of runtime check
  kernel/fork.c: unshare(): use swap() to make code cleaner
  seq_file: fix passing wrong private data
  seq_file: move seq_escape() to a header
  signal: remove duplicate include in signal.h
  crash_dump: remove duplicate include in crash_dump.h
  crash_dump: fix boolreturn.cocci warning
  hfs/hfsplus: use WARN_ON for sanity check
  ...
2021-11-09 10:11:53 -08:00