2019-05-19 12:08:55 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* linux/fs/exec.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* #!-checking implemented by tytso.
|
|
|
|
*/
|
|
|
|
/*
|
|
|
|
* Demand-loading implemented 01.12.91 - no need to read anything but
|
|
|
|
* the header into memory. The inode of the executable is put into
|
|
|
|
* "current->executable", and page faults do the actual loading. Clean.
|
|
|
|
*
|
|
|
|
* Once more I can proudly say that linux stood up to being changed: it
|
|
|
|
* was less than 2 hours work to get demand-loading completely implemented.
|
|
|
|
*
|
|
|
|
* Demand loading changed July 1993 by Eric Youngdale. Use mmap instead,
|
|
|
|
* current->executable is only used by the procfs. This allows a dispatch
|
|
|
|
* table to check for several different types of binary formats. We keep
|
|
|
|
* trying until we recognize the file or we run out of supported binary
|
2016-12-21 05:26:24 +00:00
|
|
|
* formats.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
2020-10-02 17:38:15 +00:00
|
|
|
#include <linux/kernel_read_file.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/file.h>
|
2008-04-24 11:44:08 +00:00
|
|
|
#include <linux/fdtable.h>
|
2008-07-25 08:45:43 +00:00
|
|
|
#include <linux/mm.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/stat.h>
|
|
|
|
#include <linux/fcntl.h>
|
2008-07-25 08:45:43 +00:00
|
|
|
#include <linux/swap.h>
|
2007-10-17 06:26:35 +00:00
|
|
|
#include <linux/string.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/init.h>
|
2017-02-08 17:51:29 +00:00
|
|
|
#include <linux/sched/mm.h>
|
2017-02-08 17:51:30 +00:00
|
|
|
#include <linux/sched/coredump.h>
|
2017-02-08 17:51:30 +00:00
|
|
|
#include <linux/sched/signal.h>
|
2017-02-08 17:51:31 +00:00
|
|
|
#include <linux/sched/numa_balancing.h>
|
2017-02-08 17:51:36 +00:00
|
|
|
#include <linux/sched/task.h>
|
2008-07-28 22:46:18 +00:00
|
|
|
#include <linux/pagemap.h>
|
perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 10:02:48 +00:00
|
|
|
#include <linux/perf_event.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/key.h>
|
|
|
|
#include <linux/personality.h>
|
|
|
|
#include <linux/binfmts.h>
|
|
|
|
#include <linux/utsname.h>
|
2006-12-08 10:38:01 +00:00
|
|
|
#include <linux/pid_namespace.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/namei.h>
|
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/security.h>
|
|
|
|
#include <linux/syscalls.h>
|
2006-10-01 06:28:59 +00:00
|
|
|
#include <linux/tsacct_kern.h>
|
2005-11-07 08:59:16 +00:00
|
|
|
#include <linux/cn_proc.h>
|
2006-04-26 18:04:08 +00:00
|
|
|
#include <linux/audit.h>
|
2008-07-09 08:28:40 +00:00
|
|
|
#include <linux/kmod.h>
|
2008-12-17 18:53:20 +00:00
|
|
|
#include <linux/fsnotify.h>
|
2009-03-29 23:50:06 +00:00
|
|
|
#include <linux/fs_struct.h>
|
2010-10-26 21:21:23 +00:00
|
|
|
#include <linux/oom.h>
|
2011-03-06 17:02:54 +00:00
|
|
|
#include <linux/compat.h>
|
2015-12-28 21:02:29 +00:00
|
|
|
#include <linux/vmalloc.h>
|
2020-09-13 19:09:39 +00:00
|
|
|
#include <linux/io_uring.h>
|
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector. This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant. The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access. In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-11-27 19:32:34 +00:00
|
|
|
#include <linux/syscall_user_dispatch.h>
|
2022-01-22 06:13:17 +00:00
|
|
|
#include <linux/coredump.h>
|
2022-09-21 00:31:19 +00:00
|
|
|
#include <linux/time_namespace.h>
|
2023-03-28 23:52:09 +00:00
|
|
|
#include <linux/user_events.h>
|
2023-12-15 20:58:20 +00:00
|
|
|
#include <linux/rseq.h>
|
2024-03-28 11:10:08 +00:00
|
|
|
#include <linux/ksm.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-12-24 19:46:01 +00:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <asm/mmu_context.h>
|
2007-07-19 08:48:16 +00:00
|
|
|
#include <asm/tlb.h>
|
2012-01-10 23:08:09 +00:00
|
|
|
|
|
|
|
#include <trace/events/task.h>
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
#include "internal.h"
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-02-07 16:11:05 +00:00
|
|
|
#include <trace/events/sched.h>
|
|
|
|
|
2020-05-30 03:00:54 +00:00
|
|
|
static int bprm_creds_from_file(struct linux_binprm *bprm);
|
|
|
|
|
2005-06-23 07:09:43 +00:00
|
|
|
int suid_dumpable = 0;
|
|
|
|
|
2007-10-17 06:26:03 +00:00
|
|
|
static LIST_HEAD(formats);
|
2005-04-16 22:20:36 +00:00
|
|
|
static DEFINE_RWLOCK(binfmt_lock);
|
|
|
|
|
2012-03-17 07:05:16 +00:00
|
|
|
void __register_binfmt(struct linux_binfmt * fmt, int insert)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
write_lock(&binfmt_lock);
|
2009-04-30 22:08:49 +00:00
|
|
|
insert ? list_add(&fmt->lh, &formats) :
|
|
|
|
list_add_tail(&fmt->lh, &formats);
|
2005-04-16 22:20:36 +00:00
|
|
|
write_unlock(&binfmt_lock);
|
|
|
|
}
|
|
|
|
|
2009-04-30 22:08:49 +00:00
|
|
|
EXPORT_SYMBOL(__register_binfmt);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-10-17 06:26:04 +00:00
|
|
|
void unregister_binfmt(struct linux_binfmt * fmt)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
write_lock(&binfmt_lock);
|
2007-10-17 06:26:03 +00:00
|
|
|
list_del(&fmt->lh);
|
2005-04-16 22:20:36 +00:00
|
|
|
write_unlock(&binfmt_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
EXPORT_SYMBOL(unregister_binfmt);
|
|
|
|
|
|
|
|
static inline void put_binfmt(struct linux_binfmt * fmt)
|
|
|
|
{
|
|
|
|
module_put(fmt->module);
|
|
|
|
}
|
|
|
|
|
2015-06-29 19:42:03 +00:00
|
|
|
bool path_noexec(const struct path *path)
|
|
|
|
{
|
|
|
|
return (path->mnt->mnt_flags & MNT_NOEXEC) ||
|
|
|
|
(path->mnt->mnt_sb->s_iflags & SB_I_NOEXEC);
|
|
|
|
}
|
|
|
|
|
2014-04-03 21:48:27 +00:00
|
|
|
#ifdef CONFIG_USELIB
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Note that a shared library must be both readable and executable due to
|
|
|
|
* security reasons.
|
|
|
|
*
|
2022-02-11 16:09:40 +00:00
|
|
|
* Also note that we take the address to load from the file itself.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2009-01-14 13:14:29 +00:00
|
|
|
SYSCALL_DEFINE1(uselib, const char __user *, library)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2013-09-22 20:27:52 +00:00
|
|
|
struct linux_binfmt *fmt;
|
2008-07-26 07:33:14 +00:00
|
|
|
struct file *file;
|
2012-10-10 19:25:28 +00:00
|
|
|
struct filename *tmp = getname(library);
|
2008-07-26 07:33:14 +00:00
|
|
|
int error = PTR_ERR(tmp);
|
2011-02-23 22:44:09 +00:00
|
|
|
static const struct open_flags uselib_flags = {
|
uselib: remove use of __FMODE_EXEC
Jann Horn points out that uselib() really shouldn't trigger the new
FMODE_EXEC logic introduced by commit 4759ff71f23e ("exec: __FMODE_EXEC
instead of in_execve for LSMs").
In fact, it shouldn't even have ever triggered the old pre-existing
logic for __FMODE_EXEC (like the NFS code that makes executables not
need read permissions). Unlike a real execve(), that can work even with
files that are purely executable by the user (not readable), uselib()
has that MAY_READ requirement becasue it's really just a convenience
wrapper around mmap() for legacy shared libraries.
The whole FMODE_EXEC bit was originally introduced by commit
b500531e6f5f ("[PATCH] Introduce FMODE_EXEC file flag"), primarily to
give ETXTBUSY error returns for distributed filesystems.
It has since grown a few other warts (like that NFS thing), but there
really isn't any reason to use it for uselib(), and now that we are
trying to use it to replace the horrid 'tsk->in_execve' flag, it's
actively wrong.
Of course, as Jann Horn also points out, nobody should be enabling
CONFIG_USELIB in the first place in this day and age, but that's a
different discussion entirely.
Reported-by: Jann Horn <jannh@google.com>
Fixes: 4759ff71f23e ("exec: __FMODE_EXEC instead of in_execve for LSMs")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-24 21:12:20 +00:00
|
|
|
.open_flag = O_LARGEFILE | O_RDONLY,
|
2015-12-27 03:33:24 +00:00
|
|
|
.acc_mode = MAY_READ | MAY_EXEC,
|
2013-06-11 04:23:01 +00:00
|
|
|
.intent = LOOKUP_OPEN,
|
|
|
|
.lookup_flags = LOOKUP_FOLLOW,
|
2011-02-23 22:44:09 +00:00
|
|
|
};
|
2008-07-26 07:33:14 +00:00
|
|
|
|
2009-04-06 15:16:22 +00:00
|
|
|
if (IS_ERR(tmp))
|
|
|
|
goto out;
|
|
|
|
|
2013-06-11 04:23:01 +00:00
|
|
|
file = do_filp_open(AT_FDCWD, tmp, &uselib_flags);
|
2009-04-06 15:16:22 +00:00
|
|
|
putname(tmp);
|
|
|
|
error = PTR_ERR(file);
|
|
|
|
if (IS_ERR(file))
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
|
|
|
|
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 01:36:26 +00:00
|
|
|
/*
|
2024-08-05 13:17:21 +00:00
|
|
|
* Check do_open_execat() for an explanation.
|
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 01:36:26 +00:00
|
|
|
*/
|
2020-08-12 01:36:23 +00:00
|
|
|
error = -EACCES;
|
2024-08-05 13:17:21 +00:00
|
|
|
if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)) ||
|
|
|
|
path_noexec(&file->f_path))
|
2005-04-16 22:20:36 +00:00
|
|
|
goto exit;
|
|
|
|
|
|
|
|
error = -ENOEXEC;
|
|
|
|
|
2013-09-22 20:27:52 +00:00
|
|
|
read_lock(&binfmt_lock);
|
|
|
|
list_for_each_entry(fmt, &formats, lh) {
|
|
|
|
if (!fmt->load_shlib)
|
|
|
|
continue;
|
|
|
|
if (!try_module_get(fmt->module))
|
|
|
|
continue;
|
2005-04-16 22:20:36 +00:00
|
|
|
read_unlock(&binfmt_lock);
|
2013-09-22 20:27:52 +00:00
|
|
|
error = fmt->load_shlib(file);
|
|
|
|
read_lock(&binfmt_lock);
|
|
|
|
put_binfmt(fmt);
|
|
|
|
if (error != -ENOEXEC)
|
|
|
|
break;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2013-09-22 20:27:52 +00:00
|
|
|
read_unlock(&binfmt_lock);
|
2009-04-06 15:16:22 +00:00
|
|
|
exit:
|
2005-04-16 22:20:36 +00:00
|
|
|
fput(file);
|
|
|
|
out:
|
2022-10-18 07:14:20 +00:00
|
|
|
return error;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-04-03 21:48:27 +00:00
|
|
|
#endif /* #ifdef CONFIG_USELIB */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2011-03-06 17:03:11 +00:00
|
|
|
/*
|
|
|
|
* The nascent bprm->mm is not visible until exec_mmap() but it can
|
|
|
|
* use a lot of memory, account these pages in current->mm temporary
|
|
|
|
* for oom_badness()->get_mm_rss(). Once exec succeeds or fails, we
|
|
|
|
* change the counter back via acct_arg_size(0).
|
|
|
|
*/
|
2011-03-06 17:02:54 +00:00
|
|
|
static void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
|
2010-11-30 19:55:34 +00:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
|
|
|
long diff = (long)(pages - bprm->vma_pages);
|
|
|
|
|
|
|
|
if (!mm || !diff)
|
|
|
|
return;
|
|
|
|
|
|
|
|
bprm->vma_pages = pages;
|
|
|
|
add_mm_counter(mm, MM_ANONPAGES, diff);
|
|
|
|
}
|
|
|
|
|
2011-03-06 17:02:54 +00:00
|
|
|
static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
|
2007-07-19 08:48:16 +00:00
|
|
|
int write)
|
|
|
|
{
|
|
|
|
struct page *page;
|
2023-06-19 18:34:15 +00:00
|
|
|
struct vm_area_struct *vma = bprm->vma;
|
|
|
|
struct mm_struct *mm = bprm->mm;
|
2007-07-19 08:48:16 +00:00
|
|
|
int ret;
|
|
|
|
|
2023-06-19 18:34:15 +00:00
|
|
|
/*
|
|
|
|
* Avoid relying on expanding the stack down in GUP (which
|
|
|
|
* does not work for STACK_GROWSUP anyway), and just do it
|
|
|
|
* by hand ahead of time.
|
|
|
|
*/
|
|
|
|
if (write && pos < vma->vm_start) {
|
|
|
|
mmap_write_lock(mm);
|
2023-06-24 20:45:51 +00:00
|
|
|
ret = expand_downwards(vma, pos);
|
2023-06-19 18:34:15 +00:00
|
|
|
if (unlikely(ret < 0)) {
|
|
|
|
mmap_write_unlock(mm);
|
2007-07-19 08:48:16 +00:00
|
|
|
return NULL;
|
2023-06-19 18:34:15 +00:00
|
|
|
}
|
|
|
|
mmap_write_downgrade(mm);
|
|
|
|
} else
|
|
|
|
mmap_read_lock(mm);
|
2016-10-13 00:20:17 +00:00
|
|
|
|
2016-02-12 21:01:54 +00:00
|
|
|
/*
|
|
|
|
* We are doing an exec(). 'current' is the process
|
2023-06-19 18:34:15 +00:00
|
|
|
* doing the exec and 'mm' is the new process's mm.
|
2016-02-12 21:01:54 +00:00
|
|
|
*/
|
2023-06-19 18:34:15 +00:00
|
|
|
ret = get_user_pages_remote(mm, pos, 1,
|
|
|
|
write ? FOLL_WRITE : 0,
|
2023-05-17 19:25:39 +00:00
|
|
|
&page, NULL);
|
2023-06-19 18:34:15 +00:00
|
|
|
mmap_read_unlock(mm);
|
2007-07-19 08:48:16 +00:00
|
|
|
if (ret <= 0)
|
|
|
|
return NULL;
|
|
|
|
|
2019-01-03 23:28:11 +00:00
|
|
|
if (write)
|
2023-06-19 18:34:15 +00:00
|
|
|
acct_arg_size(bprm, vma_pages(vma));
|
2007-07-19 08:48:16 +00:00
|
|
|
|
|
|
|
return page;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void put_arg_page(struct page *page)
|
|
|
|
{
|
|
|
|
put_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_arg_pages(struct linux_binprm *bprm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
|
|
|
|
struct page *page)
|
|
|
|
{
|
|
|
|
flush_cache_page(bprm->vma, pos, page_to_pfn(page));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __bprm_mm_init(struct linux_binprm *bprm)
|
|
|
|
{
|
2009-01-06 22:40:44 +00:00
|
|
|
int err;
|
2007-07-19 08:48:16 +00:00
|
|
|
struct vm_area_struct *vma = NULL;
|
|
|
|
struct mm_struct *mm = bprm->mm;
|
|
|
|
|
2018-07-21 22:24:03 +00:00
|
|
|
bprm->vma = vma = vm_area_alloc(mm);
|
2007-07-19 08:48:16 +00:00
|
|
|
if (!vma)
|
2009-01-06 22:40:44 +00:00
|
|
|
return -ENOMEM;
|
mm: fix vma_is_anonymous() false-positives
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA. This is unreliable as ->mmap may not set ->vm_ops.
False-positive vma_is_anonymous() may lead to crashes:
next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
------------[ cut here ]------------
kernel BUG at mm/memory.c:1422!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
Call Trace:
unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
unmap_mapping_range_vma mm/memory.c:2792 [inline]
unmap_mapping_range_tree mm/memory.c:2813 [inline]
unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
unmap_mapping_range+0x48/0x60 mm/memory.c:2880
truncate_pagecache+0x54/0x90 mm/truncate.c:800
truncate_setsize+0x70/0xb0 mm/truncate.c:826
simple_setattr+0xe9/0x110 fs/libfs.c:409
notify_change+0xf13/0x10f0 fs/attr.c:335
do_truncate+0x1ac/0x2b0 fs/open.c:63
do_sys_ftruncate+0x492/0x560 fs/open.c:205
__do_sys_ftruncate fs/open.c:215 [inline]
__se_sys_ftruncate fs/open.c:213 [inline]
__x64_sys_ftruncate+0x59/0x80 fs/open.c:213
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reproducer:
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
#define KCOV_ENABLE _IO('c', 100)
#define KCOV_DISABLE _IO('c', 101)
#define COVER_SIZE (1024<<10)
#define KCOV_TRACE_PC 0
#define KCOV_TRACE_CMP 1
int main(int argc, char **argv)
{
int fd;
unsigned long *cover;
system("mount -t debugfs none /sys/kernel/debug");
fd = open("/sys/kernel/debug/kcov", O_RDWR);
ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
munmap(cover, COVER_SIZE * sizeof(unsigned long));
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
ftruncate(fd, 3UL << 20);
return 0;
}
This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.
If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.
Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-26 23:37:35 +00:00
|
|
|
vma_set_anonymous(vma);
|
2007-07-19 08:48:16 +00:00
|
|
|
|
2020-06-09 04:33:25 +00:00
|
|
|
if (mmap_write_lock_killable(mm)) {
|
2016-05-23 23:26:02 +00:00
|
|
|
err = -EINTR;
|
|
|
|
goto err_free;
|
|
|
|
}
|
2007-07-19 08:48:16 +00:00
|
|
|
|
2024-03-28 11:10:08 +00:00
|
|
|
/*
|
|
|
|
* Need to be called with mmap write lock
|
|
|
|
* held, to avoid race with ksmd.
|
|
|
|
*/
|
|
|
|
err = ksm_execve(mm);
|
|
|
|
if (err)
|
|
|
|
goto err_ksm;
|
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
/*
|
|
|
|
* Place the stack at the largest stack address the architecture
|
|
|
|
* supports. Later, we'll move this to an appropriate place. We don't
|
|
|
|
* use STACK_TOP because that can depend on attributes which aren't
|
|
|
|
* configured yet.
|
|
|
|
*/
|
2011-07-26 23:08:40 +00:00
|
|
|
BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
|
2007-07-19 08:48:16 +00:00
|
|
|
vma->vm_end = STACK_TOP_MAX;
|
|
|
|
vma->vm_start = vma->vm_end - PAGE_SIZE;
|
2023-01-26 19:37:49 +00:00
|
|
|
vm_flags_init(vma, VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP);
|
2007-10-19 06:39:15 +00:00
|
|
|
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
|
2010-12-09 14:29:42 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
err = insert_vm_struct(mm, vma);
|
2009-01-06 22:40:44 +00:00
|
|
|
if (err)
|
2007-07-19 08:48:16 +00:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
mm->stack_vm = mm->total_vm = 1;
|
2020-06-09 04:33:25 +00:00
|
|
|
mmap_write_unlock(mm);
|
2007-07-19 08:48:16 +00:00
|
|
|
bprm->p = vma->vm_end - sizeof(void *);
|
|
|
|
return 0;
|
|
|
|
err:
|
2024-03-28 11:10:08 +00:00
|
|
|
ksm_exit(mm);
|
|
|
|
err_ksm:
|
2020-06-09 04:33:25 +00:00
|
|
|
mmap_write_unlock(mm);
|
2016-05-23 23:26:02 +00:00
|
|
|
err_free:
|
2009-01-06 22:40:44 +00:00
|
|
|
bprm->vma = NULL;
|
2018-07-21 20:48:51 +00:00
|
|
|
vm_area_free(vma);
|
2007-07-19 08:48:16 +00:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool valid_arg_len(struct linux_binprm *bprm, long len)
|
|
|
|
{
|
|
|
|
return len <= MAX_ARG_STRLEN;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
2011-03-06 17:02:54 +00:00
|
|
|
static inline void acct_arg_size(struct linux_binprm *bprm, unsigned long pages)
|
2010-11-30 19:55:34 +00:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-03-06 17:02:54 +00:00
|
|
|
static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
|
2007-07-19 08:48:16 +00:00
|
|
|
int write)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
page = bprm->page[pos / PAGE_SIZE];
|
|
|
|
if (!page && write) {
|
|
|
|
page = alloc_page(GFP_HIGHUSER|__GFP_ZERO);
|
|
|
|
if (!page)
|
|
|
|
return NULL;
|
|
|
|
bprm->page[pos / PAGE_SIZE] = page;
|
|
|
|
}
|
|
|
|
|
|
|
|
return page;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void put_arg_page(struct page *page)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_arg_page(struct linux_binprm *bprm, int i)
|
|
|
|
{
|
|
|
|
if (bprm->page[i]) {
|
|
|
|
__free_page(bprm->page[i]);
|
|
|
|
bprm->page[i] = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_arg_pages(struct linux_binprm *bprm)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < MAX_ARG_PAGES; i++)
|
|
|
|
free_arg_page(bprm, i);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void flush_arg_page(struct linux_binprm *bprm, unsigned long pos,
|
|
|
|
struct page *page)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __bprm_mm_init(struct linux_binprm *bprm)
|
|
|
|
{
|
|
|
|
bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool valid_arg_len(struct linux_binprm *bprm, long len)
|
|
|
|
{
|
|
|
|
return len <= bprm->p;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_MMU */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a new mm_struct and populate it with a temporary stack
|
|
|
|
* vm_area_struct. We don't have enough context at this point to set the stack
|
|
|
|
* flags, permissions, and offset, so we use temporary values. We'll update
|
|
|
|
* them later in setup_arg_pages().
|
|
|
|
*/
|
2013-02-20 02:16:01 +00:00
|
|
|
static int bprm_mm_init(struct linux_binprm *bprm)
|
2007-07-19 08:48:16 +00:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct mm_struct *mm = NULL;
|
|
|
|
|
|
|
|
bprm->mm = mm = mm_alloc();
|
|
|
|
err = -ENOMEM;
|
|
|
|
if (!mm)
|
|
|
|
goto err;
|
|
|
|
|
2018-04-10 23:35:01 +00:00
|
|
|
/* Save current stack limit for all calculations made during exec. */
|
|
|
|
task_lock(current->group_leader);
|
|
|
|
bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
|
|
|
|
task_unlock(current->group_leader);
|
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
err = __bprm_mm_init(bprm);
|
|
|
|
if (err)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err:
|
|
|
|
if (mm) {
|
|
|
|
bprm->mm = NULL;
|
|
|
|
mmdrop(mm);
|
|
|
|
}
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2011-03-06 17:02:37 +00:00
|
|
|
struct user_arg_ptr {
|
2011-03-06 17:02:54 +00:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
bool is_compat;
|
|
|
|
#endif
|
|
|
|
union {
|
|
|
|
const char __user *const __user *native;
|
|
|
|
#ifdef CONFIG_COMPAT
|
2012-09-30 17:38:55 +00:00
|
|
|
const compat_uptr_t __user *compat;
|
2011-03-06 17:02:54 +00:00
|
|
|
#endif
|
|
|
|
} ptr;
|
2011-03-06 17:02:37 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static const char __user *get_user_arg_ptr(struct user_arg_ptr argv, int nr)
|
2011-03-06 17:02:21 +00:00
|
|
|
{
|
2011-03-06 17:02:54 +00:00
|
|
|
const char __user *native;
|
|
|
|
|
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
if (unlikely(argv.is_compat)) {
|
|
|
|
compat_uptr_t compat;
|
|
|
|
|
|
|
|
if (get_user(compat, argv.ptr.compat + nr))
|
|
|
|
return ERR_PTR(-EFAULT);
|
2011-03-06 17:02:21 +00:00
|
|
|
|
2011-03-06 17:02:54 +00:00
|
|
|
return compat_ptr(compat);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (get_user(native, argv.ptr.native + nr))
|
2011-03-06 17:02:21 +00:00
|
|
|
return ERR_PTR(-EFAULT);
|
|
|
|
|
2011-03-06 17:02:54 +00:00
|
|
|
return native;
|
2011-03-06 17:02:21 +00:00
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* count() counts the number of strings in array ARGV.
|
|
|
|
*/
|
2011-03-06 17:02:37 +00:00
|
|
|
static int count(struct user_arg_ptr argv, int max)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
|
2011-03-06 17:02:54 +00:00
|
|
|
if (argv.ptr.native != NULL) {
|
2005-04-16 22:20:36 +00:00
|
|
|
for (;;) {
|
2011-03-06 17:02:21 +00:00
|
|
|
const char __user *p = get_user_arg_ptr(argv, i);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (!p)
|
|
|
|
break;
|
2011-03-06 17:02:21 +00:00
|
|
|
|
|
|
|
if (IS_ERR(p))
|
|
|
|
return -EFAULT;
|
|
|
|
|
2013-01-11 22:31:48 +00:00
|
|
|
if (i >= max)
|
2005-04-16 22:20:36 +00:00
|
|
|
return -E2BIG;
|
2013-01-11 22:31:48 +00:00
|
|
|
++i;
|
2010-09-08 02:37:06 +00:00
|
|
|
|
|
|
|
if (fatal_signal_pending(current))
|
|
|
|
return -ERESTARTNOHAND;
|
2005-04-16 22:20:36 +00:00
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
2020-07-13 17:06:48 +00:00
|
|
|
static int count_strings_kernel(const char *const *argv)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!argv)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
for (i = 0; argv[i]; ++i) {
|
|
|
|
if (i >= MAX_ARG_STRINGS)
|
|
|
|
return -E2BIG;
|
|
|
|
if (fatal_signal_pending(current))
|
|
|
|
return -ERESTARTNOHAND;
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
2024-06-21 20:50:43 +00:00
|
|
|
static inline int bprm_set_stack_limit(struct linux_binprm *bprm,
|
|
|
|
unsigned long limit)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_MMU
|
2024-06-21 20:50:44 +00:00
|
|
|
/* Avoid a pathological bprm->p. */
|
|
|
|
if (bprm->p < limit)
|
|
|
|
return -E2BIG;
|
2024-06-21 20:50:43 +00:00
|
|
|
bprm->argmin = bprm->p - limit;
|
|
|
|
#endif
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline bool bprm_hit_stack_limit(struct linux_binprm *bprm)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
return bprm->p < bprm->argmin;
|
|
|
|
#else
|
|
|
|
return false;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2024-05-20 00:17:59 +00:00
|
|
|
/*
|
|
|
|
* Calculate bprm->argmin from:
|
|
|
|
* - _STK_LIM
|
|
|
|
* - ARG_MAX
|
|
|
|
* - bprm->rlim_stack.rlim_cur
|
|
|
|
* - bprm->argc
|
|
|
|
* - bprm->envc
|
|
|
|
* - bprm->p
|
|
|
|
*/
|
2020-07-12 13:23:54 +00:00
|
|
|
static int bprm_stack_limits(struct linux_binprm *bprm)
|
2019-01-03 23:28:11 +00:00
|
|
|
{
|
|
|
|
unsigned long limit, ptr_size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Limit to 1/4 of the max stack size or 3/4 of _STK_LIM
|
|
|
|
* (whichever is smaller) for the argv+env strings.
|
|
|
|
* This ensures that:
|
|
|
|
* - the remaining binfmt code will not run out of stack space,
|
|
|
|
* - the program will have a reasonable amount of stack left
|
|
|
|
* to work from.
|
|
|
|
*/
|
|
|
|
limit = _STK_LIM / 4 * 3;
|
|
|
|
limit = min(limit, bprm->rlim_stack.rlim_cur / 4);
|
|
|
|
/*
|
|
|
|
* We've historically supported up to 32 pages (ARG_MAX)
|
|
|
|
* of argument strings even with small stacks
|
|
|
|
*/
|
|
|
|
limit = max_t(unsigned long, limit, ARG_MAX);
|
2024-06-21 20:50:44 +00:00
|
|
|
/* Reject totally pathological counts. */
|
|
|
|
if (bprm->argc < 0 || bprm->envc < 0)
|
|
|
|
return -E2BIG;
|
2019-01-03 23:28:11 +00:00
|
|
|
/*
|
|
|
|
* We must account for the size of all the argv and envp pointers to
|
|
|
|
* the argv and envp strings, since they will also take up space in
|
|
|
|
* the stack. They aren't stored until much later when we can't
|
|
|
|
* signal to the parent that the child has run out of stack space.
|
|
|
|
* Instead, calculate it here so it's possible to fail gracefully.
|
2022-02-01 00:09:47 +00:00
|
|
|
*
|
|
|
|
* In the case of argc = 0, make sure there is space for adding a
|
|
|
|
* empty string (which will bump argc to 1), to ensure confused
|
|
|
|
* userspace programs don't start processing from argv[1], thinking
|
|
|
|
* argc can never be 0, to keep them from walking envp by accident.
|
|
|
|
* See do_execveat_common().
|
2019-01-03 23:28:11 +00:00
|
|
|
*/
|
2024-06-21 20:50:44 +00:00
|
|
|
if (check_add_overflow(max(bprm->argc, 1), bprm->envc, &ptr_size) ||
|
|
|
|
check_mul_overflow(ptr_size, sizeof(void *), &ptr_size))
|
|
|
|
return -E2BIG;
|
2019-01-03 23:28:11 +00:00
|
|
|
if (limit <= ptr_size)
|
|
|
|
return -E2BIG;
|
|
|
|
limit -= ptr_size;
|
|
|
|
|
2024-06-21 20:50:43 +00:00
|
|
|
return bprm_set_stack_limit(bprm, limit);
|
2019-01-03 23:28:11 +00:00
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2007-07-19 08:48:16 +00:00
|
|
|
* 'copy_strings()' copies argument/environment strings from the old
|
|
|
|
* processes's memory to the new process's stack. The call to get_user_pages()
|
|
|
|
* ensures the destination page is created and not swapped out.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2011-03-06 17:02:37 +00:00
|
|
|
static int copy_strings(int argc, struct user_arg_ptr argv,
|
2005-05-05 23:16:09 +00:00
|
|
|
struct linux_binprm *bprm)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct page *kmapped_page = NULL;
|
|
|
|
char *kaddr = NULL;
|
2007-07-19 08:48:16 +00:00
|
|
|
unsigned long kpos = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
while (argc-- > 0) {
|
2010-08-17 22:52:56 +00:00
|
|
|
const char __user *str;
|
2005-04-16 22:20:36 +00:00
|
|
|
int len;
|
|
|
|
unsigned long pos;
|
|
|
|
|
2011-03-06 17:02:21 +00:00
|
|
|
ret = -EFAULT;
|
|
|
|
str = get_user_arg_ptr(argv, argc);
|
|
|
|
if (IS_ERR(str))
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
|
|
|
|
2011-03-06 17:02:21 +00:00
|
|
|
len = strnlen_user(str, MAX_ARG_STRLEN);
|
|
|
|
if (!len)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = -E2BIG;
|
|
|
|
if (!valid_arg_len(bprm, len))
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
|
|
|
|
2022-02-11 16:09:40 +00:00
|
|
|
/* We're going to work our way backwards. */
|
2005-04-16 22:20:36 +00:00
|
|
|
pos = bprm->p;
|
2007-07-19 08:48:16 +00:00
|
|
|
str += len;
|
|
|
|
bprm->p -= len;
|
2024-06-21 20:50:43 +00:00
|
|
|
if (bprm_hit_stack_limit(bprm))
|
2019-01-03 23:28:11 +00:00
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
while (len > 0) {
|
|
|
|
int offset, bytes_to_copy;
|
|
|
|
|
2010-09-08 02:37:06 +00:00
|
|
|
if (fatal_signal_pending(current)) {
|
|
|
|
ret = -ERESTARTNOHAND;
|
|
|
|
goto out;
|
|
|
|
}
|
2010-09-08 02:36:28 +00:00
|
|
|
cond_resched();
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
offset = pos % PAGE_SIZE;
|
2007-07-19 08:48:16 +00:00
|
|
|
if (offset == 0)
|
|
|
|
offset = PAGE_SIZE;
|
|
|
|
|
|
|
|
bytes_to_copy = offset;
|
|
|
|
if (bytes_to_copy > len)
|
|
|
|
bytes_to_copy = len;
|
|
|
|
|
|
|
|
offset -= bytes_to_copy;
|
|
|
|
pos -= bytes_to_copy;
|
|
|
|
str -= bytes_to_copy;
|
|
|
|
len -= bytes_to_copy;
|
|
|
|
|
|
|
|
if (!kmapped_page || kpos != (pos & PAGE_MASK)) {
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
page = get_arg_page(bprm, pos, 1);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!page) {
|
2007-07-19 08:48:16 +00:00
|
|
|
ret = -E2BIG;
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
if (kmapped_page) {
|
2021-09-02 21:56:36 +00:00
|
|
|
flush_dcache_page(kmapped_page);
|
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 18:28:56 +00:00
|
|
|
kunmap_local(kaddr);
|
2007-07-19 08:48:16 +00:00
|
|
|
put_arg_page(kmapped_page);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
kmapped_page = page;
|
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 18:28:56 +00:00
|
|
|
kaddr = kmap_local_page(kmapped_page);
|
2007-07-19 08:48:16 +00:00
|
|
|
kpos = pos & PAGE_MASK;
|
|
|
|
flush_arg_page(bprm, kpos, kmapped_page);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2007-07-19 08:48:16 +00:00
|
|
|
if (copy_from_user(kaddr+offset, str, bytes_to_copy)) {
|
2005-04-16 22:20:36 +00:00
|
|
|
ret = -EFAULT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
out:
|
2007-07-19 08:48:16 +00:00
|
|
|
if (kmapped_page) {
|
2021-09-02 21:56:36 +00:00
|
|
|
flush_dcache_page(kmapped_page);
|
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 18:28:56 +00:00
|
|
|
kunmap_local(kaddr);
|
2007-07-19 08:48:16 +00:00
|
|
|
put_arg_page(kmapped_page);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2020-06-04 23:51:14 +00:00
|
|
|
* Copy and argument/environment string from the kernel to the processes stack.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2020-06-04 23:51:14 +00:00
|
|
|
int copy_string_kernel(const char *arg, struct linux_binprm *bprm)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2020-06-04 23:51:18 +00:00
|
|
|
int len = strnlen(arg, MAX_ARG_STRLEN) + 1 /* terminating NUL */;
|
|
|
|
unsigned long pos = bprm->p;
|
|
|
|
|
|
|
|
if (len == 0)
|
|
|
|
return -EFAULT;
|
|
|
|
if (!valid_arg_len(bprm, len))
|
|
|
|
return -E2BIG;
|
|
|
|
|
|
|
|
/* We're going to work our way backwards. */
|
|
|
|
arg += len;
|
|
|
|
bprm->p -= len;
|
2024-06-21 20:50:43 +00:00
|
|
|
if (bprm_hit_stack_limit(bprm))
|
2020-06-04 23:51:18 +00:00
|
|
|
return -E2BIG;
|
2011-03-06 17:02:37 +00:00
|
|
|
|
2020-06-04 23:51:18 +00:00
|
|
|
while (len > 0) {
|
|
|
|
unsigned int bytes_to_copy = min_t(unsigned int, len,
|
|
|
|
min_not_zero(offset_in_page(pos), PAGE_SIZE));
|
|
|
|
struct page *page;
|
2011-03-06 17:02:37 +00:00
|
|
|
|
2020-06-04 23:51:18 +00:00
|
|
|
pos -= bytes_to_copy;
|
|
|
|
arg -= bytes_to_copy;
|
|
|
|
len -= bytes_to_copy;
|
2011-03-06 17:02:37 +00:00
|
|
|
|
2020-06-04 23:51:18 +00:00
|
|
|
page = get_arg_page(bprm, pos, 1);
|
|
|
|
if (!page)
|
|
|
|
return -E2BIG;
|
|
|
|
flush_arg_page(bprm, pos & PAGE_MASK, page);
|
2022-07-24 21:25:23 +00:00
|
|
|
memcpy_to_page(page, offset_in_page(pos), arg, bytes_to_copy);
|
2020-06-04 23:51:18 +00:00
|
|
|
put_arg_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2020-06-04 23:51:14 +00:00
|
|
|
EXPORT_SYMBOL(copy_string_kernel);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-07-13 17:06:48 +00:00
|
|
|
static int copy_strings_kernel(int argc, const char *const *argv,
|
|
|
|
struct linux_binprm *bprm)
|
|
|
|
{
|
|
|
|
while (argc-- > 0) {
|
|
|
|
int ret = copy_string_kernel(argv[argc], bprm);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
if (fatal_signal_pending(current))
|
|
|
|
return -ERESTARTNOHAND;
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2007-07-19 08:48:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Finalizes the stack vm_area_struct. The flags and permissions are updated,
|
|
|
|
* the stack is optionally relocated, and some extra space is added.
|
|
|
|
*/
|
2005-04-16 22:20:36 +00:00
|
|
|
int setup_arg_pages(struct linux_binprm *bprm,
|
|
|
|
unsigned long stack_top,
|
|
|
|
int executable_stack)
|
|
|
|
{
|
2007-07-19 08:48:16 +00:00
|
|
|
unsigned long ret;
|
|
|
|
unsigned long stack_shift;
|
2005-04-16 22:20:36 +00:00
|
|
|
struct mm_struct *mm = current->mm;
|
2007-07-19 08:48:16 +00:00
|
|
|
struct vm_area_struct *vma = bprm->vma;
|
|
|
|
struct vm_area_struct *prev = NULL;
|
|
|
|
unsigned long vm_flags;
|
|
|
|
unsigned long stack_base;
|
2010-02-10 21:56:42 +00:00
|
|
|
unsigned long stack_size;
|
|
|
|
unsigned long stack_expand;
|
|
|
|
unsigned long rlim_stack;
|
mm/mprotect: use mmu_gather
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
This patchset is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar and
further optimizations for MADV_COLD and userfaultfd would be possible.
Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and do
better/fewer flushes. This would also be handy for later userfaultfd
enhancements.
2. Avoid unnecessary TLB flushes. This optimization is the one that
provides most of the performance benefits. Unlike previous versions,
we now only avoid flushes that would not result in spurious
page-faults.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.
Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I
therefore ran a microbenchmark: a loop that does the following on
anonymous memory, just as a sanity check to see that time is saved by
avoiding TLB flushes. The loop goes:
mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable
The test was run in KVM guest with 1 or 2 threads (the second thread was
busy-looping). I measured the time (cycles) of each operation:
1 thread 2 threads
mmots +patch mmots +patch
PROT_READ 3494 2725 (-22%) 8630 7788 (-10%)
PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%)
[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
The exact numbers are really meaningless, but the benefit is clear. There
are 2 interesting results though.
(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved
(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.
This patch (of 3):
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates the
code, as developers need to be aware of different invalidation schemes,
and prevents opportunities to avoid TLB flushes or perform them in finer
granularity.
The use of mmu_gather for modified PTEs has benefits in various scenarios
even if pages are not released. For instance, if only a single page needs
to be flushed out of a range of many pages, only that page would be
flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction
can be used instead of 512 instructions (or a full TLB flush, which would
Linux would actually use by default). mprotect() over multiple VMAs
requires a single flush.
Use mmu_gather in change_pXX_range(). As the pages are not released, only
record the flushed range using tlb_flush_pXX_range().
Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.
Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 01:20:50 +00:00
|
|
|
struct mmu_gather tlb;
|
2023-01-20 16:26:18 +00:00
|
|
|
struct vma_iterator vmi;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_STACK_GROWSUP
|
2014-05-13 22:58:24 +00:00
|
|
|
/* Limit stack size */
|
2018-04-10 23:35:01 +00:00
|
|
|
stack_base = bprm->rlim_stack.rlim_max;
|
2020-11-06 18:41:36 +00:00
|
|
|
|
|
|
|
stack_base = calc_max_stack_size(stack_base);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-05-11 20:01:27 +00:00
|
|
|
/* Add space for stack randomization. */
|
2024-09-07 16:28:11 +00:00
|
|
|
if (current->flags & PF_RANDOMIZE)
|
|
|
|
stack_base += (STACK_RND_MASK << PAGE_SHIFT);
|
2015-05-11 20:01:27 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
/* Make sure we didn't let the argument array grow too large. */
|
|
|
|
if (vma->vm_end - vma->vm_start > stack_base)
|
|
|
|
return -ENOMEM;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
stack_base = PAGE_ALIGN(stack_top - stack_base);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
stack_shift = vma->vm_start - stack_base;
|
|
|
|
mm->arg_start = bprm->p - stack_shift;
|
|
|
|
bprm->p = vma->vm_end - stack_shift;
|
2005-04-16 22:20:36 +00:00
|
|
|
#else
|
2007-07-19 08:48:16 +00:00
|
|
|
stack_top = arch_align_stack(stack_top);
|
|
|
|
stack_top = PAGE_ALIGN(stack_top);
|
2010-09-08 02:35:49 +00:00
|
|
|
|
|
|
|
if (unlikely(stack_top < mmap_min_addr) ||
|
|
|
|
unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
stack_shift = vma->vm_end - stack_top;
|
|
|
|
|
|
|
|
bprm->p -= stack_shift;
|
2005-04-16 22:20:36 +00:00
|
|
|
mm->arg_start = bprm->p;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (bprm->loader)
|
2007-07-19 08:48:16 +00:00
|
|
|
bprm->loader -= stack_shift;
|
|
|
|
bprm->exec -= stack_shift;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-06-09 04:33:25 +00:00
|
|
|
if (mmap_write_lock_killable(mm))
|
2016-05-23 23:26:02 +00:00
|
|
|
return -EINTR;
|
|
|
|
|
2008-07-10 20:19:20 +00:00
|
|
|
vm_flags = VM_STACK_FLAGS;
|
2007-07-19 08:48:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Adjust stack execute permissions; explicitly enable for
|
|
|
|
* EXSTACK_ENABLE_X, disable for EXSTACK_DISABLE_X and leave alone
|
|
|
|
* (arch default) otherwise.
|
|
|
|
*/
|
|
|
|
if (unlikely(executable_stack == EXSTACK_ENABLE_X))
|
|
|
|
vm_flags |= VM_EXEC;
|
|
|
|
else if (executable_stack == EXSTACK_DISABLE_X)
|
|
|
|
vm_flags &= ~VM_EXEC;
|
|
|
|
vm_flags |= mm->def_flags;
|
2010-05-24 21:32:24 +00:00
|
|
|
vm_flags |= VM_STACK_INCOMPLETE_SETUP;
|
2007-07-19 08:48:16 +00:00
|
|
|
|
2023-01-20 16:26:18 +00:00
|
|
|
vma_iter_init(&vmi, mm, vma->vm_start);
|
|
|
|
|
mm/mprotect: use mmu_gather
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
This patchset is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar and
further optimizations for MADV_COLD and userfaultfd would be possible.
Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and do
better/fewer flushes. This would also be handy for later userfaultfd
enhancements.
2. Avoid unnecessary TLB flushes. This optimization is the one that
provides most of the performance benefits. Unlike previous versions,
we now only avoid flushes that would not result in spurious
page-faults.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.
Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I
therefore ran a microbenchmark: a loop that does the following on
anonymous memory, just as a sanity check to see that time is saved by
avoiding TLB flushes. The loop goes:
mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable
The test was run in KVM guest with 1 or 2 threads (the second thread was
busy-looping). I measured the time (cycles) of each operation:
1 thread 2 threads
mmots +patch mmots +patch
PROT_READ 3494 2725 (-22%) 8630 7788 (-10%)
PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%)
[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
The exact numbers are really meaningless, but the benefit is clear. There
are 2 interesting results though.
(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved
(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.
This patch (of 3):
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates the
code, as developers need to be aware of different invalidation schemes,
and prevents opportunities to avoid TLB flushes or perform them in finer
granularity.
The use of mmu_gather for modified PTEs has benefits in various scenarios
even if pages are not released. For instance, if only a single page needs
to be flushed out of a range of many pages, only that page would be
flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction
can be used instead of 512 instructions (or a full TLB flush, which would
Linux would actually use by default). mprotect() over multiple VMAs
requires a single flush.
Use mmu_gather in change_pXX_range(). As the pages are not released, only
record the flushed range using tlb_flush_pXX_range().
Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.
Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 01:20:50 +00:00
|
|
|
tlb_gather_mmu(&tlb, mm);
|
2023-01-20 16:26:18 +00:00
|
|
|
ret = mprotect_fixup(&vmi, &tlb, vma, &prev, vma->vm_start, vma->vm_end,
|
2007-07-19 08:48:16 +00:00
|
|
|
vm_flags);
|
mm/mprotect: use mmu_gather
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
This patchset is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar and
further optimizations for MADV_COLD and userfaultfd would be possible.
Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and do
better/fewer flushes. This would also be handy for later userfaultfd
enhancements.
2. Avoid unnecessary TLB flushes. This optimization is the one that
provides most of the performance benefits. Unlike previous versions,
we now only avoid flushes that would not result in spurious
page-faults.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.
Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I
therefore ran a microbenchmark: a loop that does the following on
anonymous memory, just as a sanity check to see that time is saved by
avoiding TLB flushes. The loop goes:
mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable
The test was run in KVM guest with 1 or 2 threads (the second thread was
busy-looping). I measured the time (cycles) of each operation:
1 thread 2 threads
mmots +patch mmots +patch
PROT_READ 3494 2725 (-22%) 8630 7788 (-10%)
PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%)
[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
The exact numbers are really meaningless, but the benefit is clear. There
are 2 interesting results though.
(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved
(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.
This patch (of 3):
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates the
code, as developers need to be aware of different invalidation schemes,
and prevents opportunities to avoid TLB flushes or perform them in finer
granularity.
The use of mmu_gather for modified PTEs has benefits in various scenarios
even if pages are not released. For instance, if only a single page needs
to be flushed out of a range of many pages, only that page would be
flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction
can be used instead of 512 instructions (or a full TLB flush, which would
Linux would actually use by default). mprotect() over multiple VMAs
requires a single flush.
Use mmu_gather in change_pXX_range(). As the pages are not released, only
record the flushed range using tlb_flush_pXX_range().
Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.
Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 01:20:50 +00:00
|
|
|
tlb_finish_mmu(&tlb);
|
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_unlock;
|
|
|
|
BUG_ON(prev != vma);
|
|
|
|
|
2020-01-31 06:17:29 +00:00
|
|
|
if (unlikely(vm_flags & VM_EXEC)) {
|
|
|
|
pr_warn_once("process '%pD4' started with executable stack\n",
|
|
|
|
bprm->file);
|
|
|
|
}
|
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
/* Move stack pages down in memory. */
|
|
|
|
if (stack_shift) {
|
2024-07-29 11:50:37 +00:00
|
|
|
/*
|
|
|
|
* During bprm_mm_init(), we create a temporary stack at STACK_TOP_MAX. Once
|
|
|
|
* the binfmt code determines where the new stack should reside, we shift it to
|
|
|
|
* its final location.
|
|
|
|
*/
|
|
|
|
ret = relocate_vma_down(vma, stack_shift);
|
2009-11-11 22:26:48 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2010-05-24 21:32:24 +00:00
|
|
|
/* mprotect_fixup is overkill to remove the temporary stack flags */
|
2023-01-26 19:37:49 +00:00
|
|
|
vm_flags_clear(vma, VM_STACK_INCOMPLETE_SETUP);
|
2010-05-24 21:32:24 +00:00
|
|
|
|
2010-03-05 21:42:57 +00:00
|
|
|
stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
|
2010-02-10 21:56:42 +00:00
|
|
|
stack_size = vma->vm_end - vma->vm_start;
|
|
|
|
/*
|
|
|
|
* Align this down to a page boundary as expand_stack
|
|
|
|
* will align it up.
|
|
|
|
*/
|
2018-04-10 23:35:01 +00:00
|
|
|
rlim_stack = bprm->rlim_stack.rlim_cur & PAGE_MASK;
|
2022-10-19 07:32:35 +00:00
|
|
|
|
|
|
|
stack_expand = min(rlim_stack, stack_size + stack_expand);
|
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
#ifdef CONFIG_STACK_GROWSUP
|
2022-10-19 07:32:35 +00:00
|
|
|
stack_base = vma->vm_start + stack_expand;
|
2007-07-19 08:48:16 +00:00
|
|
|
#else
|
2022-10-19 07:32:35 +00:00
|
|
|
stack_base = vma->vm_end - stack_expand;
|
2007-07-19 08:48:16 +00:00
|
|
|
#endif
|
2010-05-18 14:30:49 +00:00
|
|
|
current->mm->start_stack = bprm->p;
|
2023-06-24 20:45:51 +00:00
|
|
|
ret = expand_stack_locked(vma, stack_base);
|
2007-07-19 08:48:16 +00:00
|
|
|
if (ret)
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
out_unlock:
|
2020-06-09 04:33:25 +00:00
|
|
|
mmap_write_unlock(mm);
|
2009-11-11 22:26:48 +00:00
|
|
|
return ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(setup_arg_pages);
|
|
|
|
|
2016-07-24 15:30:18 +00:00
|
|
|
#else
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Transfer the program arguments and environment from the holding pages
|
|
|
|
* onto the stack. The provided stack pointer is adjusted accordingly.
|
|
|
|
*/
|
|
|
|
int transfer_args_to_stack(struct linux_binprm *bprm,
|
|
|
|
unsigned long *sp_location)
|
|
|
|
{
|
|
|
|
unsigned long index, stop, sp;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
stop = bprm->p >> PAGE_SHIFT;
|
|
|
|
sp = *sp_location;
|
|
|
|
|
|
|
|
for (index = MAX_ARG_PAGES - 1; index >= stop; index--) {
|
|
|
|
unsigned int offset = index == stop ? bprm->p & ~PAGE_MASK : 0;
|
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 18:28:56 +00:00
|
|
|
char *src = kmap_local_page(bprm->page[index]) + offset;
|
2016-07-24 15:30:18 +00:00
|
|
|
sp -= PAGE_SIZE - offset;
|
|
|
|
if (copy_to_user((void *) sp, src, PAGE_SIZE - offset) != 0)
|
|
|
|
ret = -EFAULT;
|
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 18:28:56 +00:00
|
|
|
kunmap_local(src);
|
2016-07-24 15:30:18 +00:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2024-03-20 18:26:07 +00:00
|
|
|
bprm->exec += *sp_location - MAX_ARG_PAGES * PAGE_SIZE;
|
2016-07-24 15:30:18 +00:00
|
|
|
*sp_location = sp;
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(transfer_args_to_stack);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif /* CONFIG_MMU */
|
|
|
|
|
2022-09-17 00:11:18 +00:00
|
|
|
/*
|
|
|
|
* On success, caller must call do_close_execat() on the returned
|
|
|
|
* struct file to close it.
|
|
|
|
*/
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
static struct file *do_open_execat(int fd, struct filename *name, int flags)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct file *file;
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
struct open_flags open_exec_flags = {
|
2011-02-23 22:44:09 +00:00
|
|
|
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
|
2015-12-27 03:33:24 +00:00
|
|
|
.acc_mode = MAY_EXEC,
|
2013-06-11 04:23:01 +00:00
|
|
|
.intent = LOOKUP_OPEN,
|
|
|
|
.lookup_flags = LOOKUP_FOLLOW,
|
2011-02-23 22:44:09 +00:00
|
|
|
};
|
2005-04-16 22:20:36 +00:00
|
|
|
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
if (flags & AT_SYMLINK_NOFOLLOW)
|
|
|
|
open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
|
|
|
|
if (flags & AT_EMPTY_PATH)
|
|
|
|
open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
|
|
|
|
|
|
|
|
file = do_filp_open(fd, name, &open_exec_flags);
|
2009-04-06 15:16:22 +00:00
|
|
|
if (IS_ERR(file))
|
2024-08-05 13:17:21 +00:00
|
|
|
return file;
|
2008-05-19 05:53:34 +00:00
|
|
|
|
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 01:36:26 +00:00
|
|
|
/*
|
2024-08-05 13:17:21 +00:00
|
|
|
* In the past the regular type check was here. It moved to may_open() in
|
|
|
|
* 633fb6ac3980 ("exec: move S_ISREG() check earlier"). Since then it is
|
|
|
|
* an invariant that all non-regular files error out before we get here.
|
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look for
other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 01:36:26 +00:00
|
|
|
*/
|
2024-08-05 13:17:21 +00:00
|
|
|
if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)) ||
|
|
|
|
path_noexec(&file->f_path)) {
|
|
|
|
fput(file);
|
|
|
|
return ERR_PTR(-EACCES);
|
|
|
|
}
|
2008-05-19 05:53:34 +00:00
|
|
|
|
|
|
|
return file;
|
|
|
|
}
|
2014-02-05 20:54:53 +00:00
|
|
|
|
2022-09-17 00:11:18 +00:00
|
|
|
/**
|
|
|
|
* open_exec - Open a path name for execution
|
|
|
|
*
|
|
|
|
* @name: path name to open with the intent of executing it.
|
|
|
|
*
|
|
|
|
* Returns ERR_PTR on failure or allocated struct file on success.
|
|
|
|
*
|
fs: don't block i_writecount during exec
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.
Here are some of the notes that I took:
(1) The deny_write_access() mechanism is causing really pointless issues
such as [1]. If a thread in a thread-group opens a file writable,
then writes some stuff, then closing the file descriptor and then
calling execve() they can fail the execve() with ETXTBUSY because
another thread in the thread-group could have concurrently called
fork(). Multi-threaded libraries such as go suffer from this.
(2) There are userspace attacks that rely on overwriting the binary of a
running process. These attacks are _mitigated_ but _not at all
prevented_ from ocurring by the deny_write_access() mechanism.
I'll go over some details. The clearest example of such attacks was
the attack against runC in CVE-2019-5736 (cf. [3]).
An attack could compromise the runC host binary from inside a
_privileged_ runC container. The malicious binary could then be used
to take over the host.
(It is crucial to note that this attack is _not_ possible with
unprivileged containers. IOW, the setup here is already insecure.)
The attack can be made when attaching to a running container or when
starting a container running a specially crafted image. For example,
when runC attaches to a container the attacker can trick it into
executing itself.
This could be done by replacing the target binary inside the
container with a custom binary pointing back at the runC binary
itself. As an example, if the target binary was /bin/bash, this
could be replaced with an executable script specifying the
interpreter path #!/proc/self/exe.
As such when /bin/bash is executed inside the container, instead the
target of /proc/self/exe will be executed. That magic link will
point to the runc binary on the host. The attacker can then proceed
to write to the target of /proc/self/exe to try and overwrite the
runC binary on the host.
However, this will not succeed because of deny_write_access(). Now,
one might think that this would prevent the attack but it doesn't.
To overcome this, the attacker has multiple ways:
* Open a file descriptor to /proc/self/exe using the O_PATH flag and
then proceed to reopen the binary as O_WRONLY through
/proc/self/fd/<nr> and try to write to it in a busy loop from a
separate process. Ultimately it will succeed when the runC binary
exits. After this the runC binary is compromised and can be used
to attack other containers or the host itself.
* Use a malicious shared library annotating a function in there with
the constructor attribute making the malicious function run as an
initializor. The malicious library will then open /proc/self/exe
for creating a new entry under /proc/self/fd/<nr>. It'll then call
exec to a) force runC to exit and b) hand the file descriptor off
to a program that then reopens /proc/self/fd/<nr> for writing
(which is now possible because runC has exited) and overwriting
that binary.
To sum up: the deny_write_access() mechanism doesn't prevent such
attacks in insecure setups. It just makes them minimally harder.
That's all.
The only way back then to prevent this is to create a temporary copy
of the calling binary itself when it starts or attaches to
containers. So what I did back then for LXC (and Aleksa for runC)
was to create an anonymous, in-memory file using the memfd_create()
system call and to copy itself into the temporary in-memory file,
which is then sealed to prevent further modifications. This sealed,
in-memory file copy is then executed instead of the original on-disk
binary.
Any compromising write operations from a privileged container to the
host binary will then write to the temporary in-memory binary and
not to the host binary on-disk, preserving the integrity of the host
binary. Also as the temporary, in-memory binary is sealed, writes to
this will also fail.
The point is that deny_write_access() is uselss to prevent these
attacks.
(3) Denying write access to an inode because it's currently used in an
exec path could easily be done on an LSM level. It might need an
additional hook but that should be about it.
(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
ago so while we do protect the main executable the bigger portion of
the things you'd think need protecting such as the shared libraries
aren't. IOW, we let anyone happily overwrite shared libraries.
(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
(5.1) We removed the legacy uselib() protection for preventing
overwriting of shared libraries. Nobody cared in 3 years.
(5.2) We allow write access to the elf interpreter after exec
completed treating it on a par with shared libraries.
Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.
Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: https://github.com/golang/go/issues/22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: https://github.com/buildkite/agent/pull/2736
Link: https://github.com/rust-lang/rust/issues/114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: https://github.com/dotnet/runtime/issues/58964
Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-31 13:01:43 +00:00
|
|
|
* As this is a wrapper for the internal do_open_execat(). Also see
|
2022-09-17 00:11:18 +00:00
|
|
|
* do_close_execat().
|
|
|
|
*/
|
2014-02-05 20:54:53 +00:00
|
|
|
struct file *open_exec(const char *name)
|
|
|
|
{
|
2015-01-22 05:00:03 +00:00
|
|
|
struct filename *filename = getname_kernel(name);
|
|
|
|
struct file *f = ERR_CAST(filename);
|
|
|
|
|
|
|
|
if (!IS_ERR(filename)) {
|
|
|
|
f = do_open_execat(AT_FDCWD, filename, 0);
|
|
|
|
putname(filename);
|
|
|
|
}
|
|
|
|
return f;
|
2014-02-05 20:54:53 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
EXPORT_SYMBOL(open_exec);
|
|
|
|
|
2022-09-26 22:15:32 +00:00
|
|
|
#if defined(CONFIG_BINFMT_FLAT) || defined(CONFIG_BINFMT_ELF_FDPIC)
|
2013-04-14 00:31:37 +00:00
|
|
|
ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
|
|
|
|
{
|
2014-02-05 00:08:21 +00:00
|
|
|
ssize_t res = vfs_read(file, (void __user *)addr, len, &pos);
|
2013-04-14 00:31:37 +00:00
|
|
|
if (res > 0)
|
2020-06-08 04:42:43 +00:00
|
|
|
flush_icache_user_range(addr, addr + len);
|
2013-04-14 00:31:37 +00:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(read_code);
|
2020-06-08 04:42:40 +00:00
|
|
|
#endif
|
2013-04-14 00:31:37 +00:00
|
|
|
|
2020-03-25 15:03:36 +00:00
|
|
|
/*
|
|
|
|
* Maps the mm_struct mm into the current task struct.
|
2020-12-03 20:12:00 +00:00
|
|
|
* On success, this function returns with exec_update_lock
|
|
|
|
* held for writing.
|
2020-03-25 15:03:36 +00:00
|
|
|
*/
|
2005-04-16 22:20:36 +00:00
|
|
|
static int exec_mmap(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct task_struct *tsk;
|
mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 22:37:25 +00:00
|
|
|
struct mm_struct *old_mm, *active_mm;
|
2020-03-25 15:03:36 +00:00
|
|
|
int ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/* Notify parent that we're no longer interested in the old VM */
|
|
|
|
tsk = current;
|
|
|
|
old_mm = current->mm;
|
2019-11-06 21:55:38 +00:00
|
|
|
exec_mm_release(tsk, old_mm);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-12-03 20:12:00 +00:00
|
|
|
ret = down_write_killable(&tsk->signal->exec_update_lock);
|
2020-03-25 15:03:36 +00:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (old_mm) {
|
|
|
|
/*
|
2021-09-03 15:26:05 +00:00
|
|
|
* If there is a pending fatal signal perhaps a signal
|
|
|
|
* whose default action is to create a coredump get
|
|
|
|
* out and die instead of going through with the exec.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2021-09-03 15:26:05 +00:00
|
|
|
ret = mmap_read_lock_killable(old_mm);
|
|
|
|
if (ret) {
|
2020-12-03 20:12:00 +00:00
|
|
|
up_write(&tsk->signal->exec_update_lock);
|
2021-09-03 15:26:05 +00:00
|
|
|
return ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
2020-03-25 15:03:36 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
task_lock(tsk);
|
2019-09-19 17:37:02 +00:00
|
|
|
membarrier_exec_mmap(mm);
|
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
2020-09-14 04:52:16 +00:00
|
|
|
|
|
|
|
local_irq_disable();
|
|
|
|
active_mm = tsk->active_mm;
|
2005-04-16 22:20:36 +00:00
|
|
|
tsk->active_mm = mm;
|
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
2020-09-14 04:52:16 +00:00
|
|
|
tsk->mm = mm;
|
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.
These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.
However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.
Introduce the following changes to improve per-mm-cid cache locality:
- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
track of which mm_cid value was last used, and use it as a hint to
attempt re-allocating the same concurrency ID the next time this
mm/cpu needs to allocate a concurrency ID,
- Add a per-mm CPUs allowed mask, which keeps track of the union of
CPUs allowed for all threads belonging to this mm. This cpumask is
only set during the lifetime of the mm, never cleared, so it
represents the union of all the CPUs allowed since the beginning of
the mm lifetime (note that the mm_cpumask() is really arch-specific
and tailored to the TLB flush needs, and is thus _not_ a viable
approach for this),
- Add a per-mm nr_cpus_allowed to keep track of the weight of the
per-mm CPUs allowed mask (for fast access),
- Add a per-mm max_nr_cid to keep track of the highest number of
concurrency IDs allocated for the mm. This is used for expanding the
concurrency ID allocation within the upper bound defined by:
min(mm->nr_cpus_allowed, mm->mm_users)
When the next unused CID value reaches this threshold, stop trying
to expand the cid allocation and use the first available cid value
instead.
Spreading allocation to use all the cid values within the range
[ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
improves cache locality while preserving mm_cid compactness within the
expected user limits,
- In __mm_cid_try_get, only return cid values within the range
[ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
prevents allocating cids above the number of allowed cpus in
rare scenarios where cid allocation races with a concurrent
remote-clear of the per-mm/cpu cid. This improvement is made
possible by the addition of the per-mm CPUs allowed mask,
- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
t->nr_cpus_allowed. This criterion was really meant to compare
the number of mm->mm_users to the number of CPUs allowed for the
entire mm. Therefore, the prior comparison worked fine when all
threads shared the same CPUs allowed mask, but not so much in
scenarios where those threads have different masks (e.g. each
thread pinned to a single CPU). This improvement is made
possible by the addition of the per-mm CPUs allowed mask.
* Benchmarks
Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.
Testing configurations:
8-core/1-L3: Use 8 cores within a single L3
24-core/24-L3: Use 24 cores, 1 core per L3
192-core/24-L3: Use 192 cores (all cores in the system)
384-thread/24-L3: Use 384 HW threads (all HW threads in the system)
Intermittent workload delays between threads: 200ms, 10ms.
Hardware:
CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)
Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).
Intermittent workload delay: 200ms
per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 1374 19289 1336 14.4x
24-core/24-L3 2423 26721 1594 16.7x
192-core/24-L3 2291 15826 2153 7.3x
384-thread/24-L3 1874 13234 1907 6.9x
Intermittent workload delay: 10ms
per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 662 756 686 1.1x
24-core/24-L3 1378 3648 1035 3.5x
192-core/24-L3 1439 10833 1482 7.3x
384-thread/24-L3 1503 10570 1556 6.8x
[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
patch series with a simpler and more general approach. ]
[ This patch applies on top of v6.12-rc1. ]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
2024-10-09 13:50:07 +00:00
|
|
|
mm_init_cid(mm, tsk);
|
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
2020-09-14 04:52:16 +00:00
|
|
|
/*
|
|
|
|
* This prevents preemption while active_mm is being loaded and
|
|
|
|
* it and mm are being updated, which could cause problems for
|
|
|
|
* lazy tlb mm refcounting when these are updated by context
|
|
|
|
* switches. Not all architectures can handle irqs off over
|
|
|
|
* activate_mm yet.
|
|
|
|
*/
|
|
|
|
if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
|
|
|
|
local_irq_enable();
|
2005-04-16 22:20:36 +00:00
|
|
|
activate_mm(active_mm, mm);
|
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-2-npiggin@gmail.com
2020-09-14 04:52:16 +00:00
|
|
|
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
|
|
|
|
local_irq_enable();
|
2022-10-26 13:48:30 +00:00
|
|
|
lru_gen_add_mm(mm);
|
2005-04-16 22:20:36 +00:00
|
|
|
task_unlock(tsk);
|
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages. A kill switch will be
added in the next patch to disable this behavior. When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated. Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently. Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim. Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1147696.57 44640.29
patch1-8: 1245274.91 48435.66
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset
patch1-8
49.44% lzo1x_1_do_compress (real work)
6.19% page_vma_mapped_walk (overhead)
5.97% _raw_spin_unlock_irq
3.13% get_pfn_folio
2.85% ptep_clear_flush
2.42% __zram_bvec_write
2.08% do_raw_spin_lock
1.92% memmove
1.44% alloc_zspage
1.36% memset
Configurations:
no change
Thanks to the following developers for their efforts [3].
kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 08:00:05 +00:00
|
|
|
lru_gen_use_mm(mm);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (old_mm) {
|
2020-06-09 04:33:25 +00:00
|
|
|
mmap_read_unlock(old_mm);
|
2006-03-31 23:13:38 +00:00
|
|
|
BUG_ON(active_mm != old_mm);
|
2012-03-19 16:04:01 +00:00
|
|
|
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
|
mm owner: fix race between swapoff and exit
There's a race between mm->owner assignment and swapoff, more easily
seen when task slab poisoning is turned on. The condition occurs when
try_to_unuse() runs in parallel with an exiting task. A similar race
can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
or ptrace or page migration.
CPU0 CPU1
try_to_unuse
looks at mm = task0->mm
increments mm->mm_users
task 0 exits
mm->owner needs to be updated, but no
new owner is found (mm_users > 1, but
no other task has task->mm = task0->mm)
mm_update_next_owner() leaves
mmput(mm) decrements mm->mm_users
task0 freed
dereferencing mm->owner fails
The fix is to notify the subsystem via mm_owner_changed callback(),
if no new owner is found, by specifying the new task as NULL.
Jiri Slaby:
mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
must be set after that, so as not to pass NULL as old owner causing oops.
Daisuke Nishimura:
mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
and its callers need to take account of this situation to avoid oops.
Hugh Dickins:
Lockdep warning and hang below exec_mmap() when testing these patches.
exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
so exec_mmap() now needs to do the same. And with that repositioning,
there's now no point in mm_need_new_owner() allowing for NULL mm.
Reported-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-28 22:09:31 +00:00
|
|
|
mm_update_next_owner(old_mm);
|
2005-04-16 22:20:36 +00:00
|
|
|
mmput(old_mm);
|
|
|
|
return 0;
|
|
|
|
}
|
2023-02-03 07:18:34 +00:00
|
|
|
mmdrop_lazy_tlb(active_mm);
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-01-14 21:20:43 +00:00
|
|
|
static int de_thread(struct task_struct *tsk)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct signal_struct *sig = tsk->signal;
|
2007-10-17 06:27:22 +00:00
|
|
|
struct sighand_struct *oldsighand = tsk->sighand;
|
2005-04-16 22:20:36 +00:00
|
|
|
spinlock_t *lock = &oldsighand->siglock;
|
|
|
|
|
2006-09-27 08:51:13 +00:00
|
|
|
if (thread_group_empty(tsk))
|
2005-04-16 22:20:36 +00:00
|
|
|
goto no_thread_group;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Kill all other threads in the thread group.
|
|
|
|
*/
|
|
|
|
spin_lock_irq(lock);
|
2021-06-24 07:14:30 +00:00
|
|
|
if ((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task) {
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Another group action in progress, just
|
|
|
|
* return so that the signal is processed.
|
|
|
|
*/
|
|
|
|
spin_unlock_irq(lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
2010-05-26 21:43:11 +00:00
|
|
|
|
2021-06-06 18:47:53 +00:00
|
|
|
sig->group_exec_task = tsk;
|
2010-05-26 21:43:11 +00:00
|
|
|
sig->notify_count = zap_other_threads(tsk);
|
|
|
|
if (!thread_group_leader(tsk))
|
|
|
|
sig->notify_count--;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-05-26 21:43:11 +00:00
|
|
|
while (sig->notify_count) {
|
2012-10-08 17:13:01 +00:00
|
|
|
__set_current_state(TASK_KILLABLE);
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_unlock_irq(lock);
|
2018-12-03 12:04:18 +00:00
|
|
|
schedule();
|
2019-01-03 23:28:58 +00:00
|
|
|
if (__fatal_signal_pending(tsk))
|
2012-10-08 17:13:01 +00:00
|
|
|
goto killed;
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_lock_irq(lock);
|
|
|
|
}
|
|
|
|
spin_unlock_irq(lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* At this point all other threads have exited, all we have to
|
|
|
|
* do is to wait for the thread group leader to become inactive,
|
|
|
|
* and to assume its PID:
|
|
|
|
*/
|
2006-09-27 08:51:13 +00:00
|
|
|
if (!thread_group_leader(tsk)) {
|
2008-12-01 22:18:16 +00:00
|
|
|
struct task_struct *leader = tsk->group_leader;
|
2007-10-17 06:27:23 +00:00
|
|
|
|
|
|
|
for (;;) {
|
2017-02-02 10:50:56 +00:00
|
|
|
cgroup_threadgroup_change_begin(tsk);
|
2007-10-17 06:27:23 +00:00
|
|
|
write_lock_irq(&tasklist_lock);
|
2015-04-16 19:48:01 +00:00
|
|
|
/*
|
|
|
|
* Do this under tasklist_lock to ensure that
|
2021-06-06 18:47:53 +00:00
|
|
|
* exit_notify() can't miss ->group_exec_task
|
2015-04-16 19:48:01 +00:00
|
|
|
*/
|
|
|
|
sig->notify_count = -1;
|
2007-10-17 06:27:23 +00:00
|
|
|
if (likely(leader->exit_state))
|
|
|
|
break;
|
2012-10-08 17:13:01 +00:00
|
|
|
__set_current_state(TASK_KILLABLE);
|
2007-10-17 06:27:23 +00:00
|
|
|
write_unlock_irq(&tasklist_lock);
|
2017-02-02 10:50:56 +00:00
|
|
|
cgroup_threadgroup_change_end(tsk);
|
2018-12-03 12:04:18 +00:00
|
|
|
schedule();
|
2019-01-03 23:28:58 +00:00
|
|
|
if (__fatal_signal_pending(tsk))
|
2012-10-08 17:13:01 +00:00
|
|
|
goto killed;
|
2007-10-17 06:27:23 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-04-11 05:54:16 +00:00
|
|
|
/*
|
|
|
|
* The only record we have of the real-time age of a
|
|
|
|
* process, regardless of execs it's done, is start_time.
|
|
|
|
* All the past CPU time is accumulated in signal_struct
|
|
|
|
* from sister threads now dead. But in this non-leader
|
|
|
|
* exec, nothing survives from the original leader thread,
|
|
|
|
* whose birth marks the true age of this process now.
|
|
|
|
* When we take on its identity by switching to its PID, we
|
|
|
|
* also take its birthdate (always earlier than our own).
|
|
|
|
*/
|
2006-09-27 08:51:13 +00:00
|
|
|
tsk->start_time = leader->start_time;
|
2019-11-07 10:07:58 +00:00
|
|
|
tsk->start_boottime = leader->start_boottime;
|
2006-04-11 05:54:16 +00:00
|
|
|
|
2007-10-19 06:40:18 +00:00
|
|
|
BUG_ON(!same_thread_group(leader, tsk));
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* An exec() starts a new thread group with the
|
|
|
|
* TGID of the previous thread group. Rehash the
|
|
|
|
* two threads with a switched PID, and release
|
|
|
|
* the former thread group leader:
|
|
|
|
*/
|
2006-03-29 00:11:03 +00:00
|
|
|
|
|
|
|
/* Become a process group leader with the old leader's pid.
|
2006-09-27 08:51:06 +00:00
|
|
|
* The old leader becomes a thread of the this thread group.
|
2006-03-29 00:11:03 +00:00
|
|
|
*/
|
2020-04-19 11:35:02 +00:00
|
|
|
exchange_tids(tsk, leader);
|
2017-06-04 09:32:13 +00:00
|
|
|
transfer_pid(leader, tsk, PIDTYPE_TGID);
|
2006-09-27 08:51:13 +00:00
|
|
|
transfer_pid(leader, tsk, PIDTYPE_PGID);
|
|
|
|
transfer_pid(leader, tsk, PIDTYPE_SID);
|
2009-12-17 23:27:15 +00:00
|
|
|
|
2006-09-27 08:51:13 +00:00
|
|
|
list_replace_rcu(&leader->tasks, &tsk->tasks);
|
2009-12-17 23:27:15 +00:00
|
|
|
list_replace_init(&leader->sibling, &tsk->sibling);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-09-27 08:51:13 +00:00
|
|
|
tsk->group_leader = tsk;
|
|
|
|
leader->group_leader = tsk;
|
2006-04-10 23:16:49 +00:00
|
|
|
|
2006-09-27 08:51:13 +00:00
|
|
|
tsk->exit_signal = SIGCHLD;
|
2011-06-22 21:10:26 +00:00
|
|
|
leader->exit_signal = -1;
|
2005-11-23 21:37:43 +00:00
|
|
|
|
|
|
|
BUG_ON(leader->exit_state != EXIT_ZOMBIE);
|
|
|
|
leader->exit_state = EXIT_DEAD;
|
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
Test-case:
void *tfunc(void *arg)
{
execvp("true", NULL);
return NULL;
}
int main(void)
{
int pid;
if (fork()) {
pthread_t t;
kill(getpid(), SIGSTOP);
pthread_create(&t, NULL, tfunc, NULL);
for (;;)
pause();
}
pid = getppid();
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
while (wait(NULL) > 0)
ptrace(PTRACE_CONT, pid, 0,0);
return 0;
}
It is racy, exit_notify() does __wake_up_parent() too. But in the
likely case it triggers the problem: de_thread() does release_task()
and the old leader goes away without the notification, the tracer
sleeps in do_wait() without children/tracees.
Change de_thread() to do __wake_up_parent(traced_leader->parent).
Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
EXIT_DEAD threads do not exist from do_wait's pov.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2011-07-21 18:00:43 +00:00
|
|
|
/*
|
|
|
|
* We are going to release_task()->ptrace_unlink() silently,
|
|
|
|
* the tracer can sleep in do_wait(). EXIT_DEAD guarantees
|
2022-06-29 07:29:32 +00:00
|
|
|
* the tracer won't block again waiting for this thread.
|
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
Test-case:
void *tfunc(void *arg)
{
execvp("true", NULL);
return NULL;
}
int main(void)
{
int pid;
if (fork()) {
pthread_t t;
kill(getpid(), SIGSTOP);
pthread_create(&t, NULL, tfunc, NULL);
for (;;)
pause();
}
pid = getppid();
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
while (wait(NULL) > 0)
ptrace(PTRACE_CONT, pid, 0,0);
return 0;
}
It is racy, exit_notify() does __wake_up_parent() too. But in the
likely case it triggers the problem: de_thread() does release_task()
and the old leader goes away without the notification, the tracer
sleeps in do_wait() without children/tracees.
Change de_thread() to do __wake_up_parent(traced_leader->parent).
Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
EXIT_DEAD threads do not exist from do_wait's pov.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2011-07-21 18:00:43 +00:00
|
|
|
*/
|
|
|
|
if (unlikely(leader->ptrace))
|
|
|
|
__wake_up_parent(leader, leader->parent);
|
2005-04-16 22:20:36 +00:00
|
|
|
write_unlock_irq(&tasklist_lock);
|
2017-02-02 10:50:56 +00:00
|
|
|
cgroup_threadgroup_change_end(tsk);
|
2008-12-01 22:18:16 +00:00
|
|
|
|
|
|
|
release_task(leader);
|
2008-02-05 06:27:24 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2021-06-06 18:47:53 +00:00
|
|
|
sig->group_exec_task = NULL;
|
2007-10-17 06:27:23 +00:00
|
|
|
sig->notify_count = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
no_thread_group:
|
2012-03-19 16:03:22 +00:00
|
|
|
/* we have changed execution domain */
|
|
|
|
tsk->exit_signal = SIGCHLD;
|
|
|
|
|
2020-03-25 15:00:21 +00:00
|
|
|
BUG_ON(!thread_group_leader(tsk));
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
killed:
|
|
|
|
/* protects against exit_notify() and __exit_signal() */
|
|
|
|
read_lock(&tasklist_lock);
|
2021-06-06 18:47:53 +00:00
|
|
|
sig->group_exec_task = NULL;
|
2020-03-25 15:00:21 +00:00
|
|
|
sig->notify_count = 0;
|
|
|
|
read_unlock(&tasklist_lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2020-03-08 17:04:44 +00:00
|
|
|
/*
|
|
|
|
* This function makes sure the current process has its own signal table,
|
|
|
|
* so that flush_signal_handlers can later reset the handlers without
|
|
|
|
* disturbing other processes. (Other processes might share the signal
|
|
|
|
* table via the CLONE_SIGHAND option to clone().)
|
|
|
|
*/
|
2020-03-25 15:00:21 +00:00
|
|
|
static int unshare_sighand(struct task_struct *me)
|
|
|
|
{
|
|
|
|
struct sighand_struct *oldsighand = me->sighand;
|
2005-11-07 18:12:43 +00:00
|
|
|
|
2019-01-18 12:27:26 +00:00
|
|
|
if (refcount_read(&oldsighand->count) != 1) {
|
2007-10-17 06:27:22 +00:00
|
|
|
struct sighand_struct *newsighand;
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2007-10-17 06:27:22 +00:00
|
|
|
* This ->sighand is shared with the CLONE_SIGHAND
|
|
|
|
* but not CLONE_THREAD task, switch to the new one.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2007-10-17 06:27:22 +00:00
|
|
|
newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
|
|
|
|
if (!newsighand)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2019-01-18 12:27:26 +00:00
|
|
|
refcount_set(&newsighand->count, 1);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
write_lock_irq(&tasklist_lock);
|
|
|
|
spin_lock(&oldsighand->siglock);
|
2021-06-07 13:54:27 +00:00
|
|
|
memcpy(newsighand->action, oldsighand->action,
|
|
|
|
sizeof(newsighand->action));
|
2020-03-25 15:00:21 +00:00
|
|
|
rcu_assign_pointer(me->sighand, newsighand);
|
2005-04-16 22:20:36 +00:00
|
|
|
spin_unlock(&oldsighand->siglock);
|
|
|
|
write_unlock_irq(&tasklist_lock);
|
|
|
|
|
signal/timer/event: signalfd core
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:
http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:
int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.
The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.
If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:
struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};
[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 05:23:13 +00:00
|
|
|
__cleanup_sighand(oldsighand);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2007-10-17 06:27:22 +00:00
|
|
|
|
2017-12-14 23:32:41 +00:00
|
|
|
char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
task_lock(tsk);
|
2022-01-20 02:08:22 +00:00
|
|
|
/* Always NUL terminated and zero-padded */
|
|
|
|
strscpy_pad(buf, tsk->comm, buf_size);
|
2005-04-16 22:20:36 +00:00
|
|
|
task_unlock(tsk);
|
2008-02-05 06:27:21 +00:00
|
|
|
return buf;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2017-12-14 23:32:41 +00:00
|
|
|
EXPORT_SYMBOL_GPL(__get_task_comm);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-08-21 13:56:33 +00:00
|
|
|
/*
|
|
|
|
* These functions flushes out all traces of the currently running executable
|
|
|
|
* so that a new one can be started
|
|
|
|
*/
|
|
|
|
|
2014-05-28 08:45:04 +00:00
|
|
|
void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
task_lock(tsk);
|
2012-01-10 23:08:09 +00:00
|
|
|
trace_task_rename(tsk, buf);
|
2022-01-20 02:08:19 +00:00
|
|
|
strscpy_pad(tsk->comm, buf, sizeof(tsk->comm));
|
2005-04-16 22:20:36 +00:00
|
|
|
task_unlock(tsk);
|
2014-05-28 08:45:04 +00:00
|
|
|
perf_event_comm(tsk, exec);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
exec: Correct comments about "point of no return"
In commit 221af7f87b97 ("Split 'flush_old_exec' into two functions"),
the comment about the point of no return should have stayed in
flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
changes in commits c89681ed7d0e ("remove steal_locks()"), and
fd8328be874f ("sanitize handling of shared descriptor tables in failing
execve()") made it look like it meant the current->sas_ss_sp line instead.
The comment was referring to the fact that once bprm->mm is NULL, all
failures from a binfmt load_binary hook (e.g. load_elf_binary), will
get SEGV raised against current. Move this comment and expand the
explanation a bit, putting it above the assignment this time, and add
details about the true nature of "point of no return" being the call
to flush_old_exec() itself.
This also removes an erroneous commet about when credentials are being
installed. That has its own dedicated function, install_exec_creds(),
which carries a similar (and correct) comment, so remove the bogus comment
where installation is not actually happening.
Cc: David Howells <dhowells@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-07-18 22:25:30 +00:00
|
|
|
/*
|
|
|
|
* Calling this is the point of no return. None of the failures will be
|
|
|
|
* seen by userspace since either the process is already taking a fatal
|
|
|
|
* signal (via de_thread() or coredump), or will have SEGV raised
|
2020-03-19 22:16:12 +00:00
|
|
|
* (after exec_mmap()) by search_binary_handler (see below).
|
exec: Correct comments about "point of no return"
In commit 221af7f87b97 ("Split 'flush_old_exec' into two functions"),
the comment about the point of no return should have stayed in
flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
changes in commits c89681ed7d0e ("remove steal_locks()"), and
fd8328be874f ("sanitize handling of shared descriptor tables in failing
execve()") made it look like it meant the current->sas_ss_sp line instead.
The comment was referring to the fact that once bprm->mm is NULL, all
failures from a binfmt load_binary hook (e.g. load_elf_binary), will
get SEGV raised against current. Move this comment and expand the
explanation a bit, putting it above the assignment this time, and add
details about the true nature of "point of no return" being the call
to flush_old_exec() itself.
This also removes an erroneous commet about when credentials are being
installed. That has its own dedicated function, install_exec_creds(),
which carries a similar (and correct) comment, so remove the bogus comment
where installation is not actually happening.
Cc: David Howells <dhowells@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-07-18 22:25:30 +00:00
|
|
|
*/
|
2020-05-03 12:54:10 +00:00
|
|
|
int begin_new_exec(struct linux_binprm * bprm)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2020-03-25 15:00:07 +00:00
|
|
|
struct task_struct *me = current;
|
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-29 06:14:42 +00:00
|
|
|
int retval;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-05-30 03:00:54 +00:00
|
|
|
/* Once we are committed compute the creds */
|
|
|
|
retval = bprm_creds_from_file(bprm);
|
|
|
|
if (retval)
|
|
|
|
return retval;
|
|
|
|
|
2024-04-11 10:20:57 +00:00
|
|
|
/*
|
|
|
|
* This tracepoint marks the point before flushing the old exec where
|
|
|
|
* the current task is still unchanged, but errors are fatal (point of
|
|
|
|
* no return). The later "sched_process_exec" tracepoint is called after
|
|
|
|
* the current task has successfully switched to the new exec.
|
|
|
|
*/
|
|
|
|
trace_sched_prepare_exec(current, bprm);
|
|
|
|
|
2020-04-04 17:01:37 +00:00
|
|
|
/*
|
|
|
|
* Ensure all future errors are fatal.
|
|
|
|
*/
|
|
|
|
bprm->point_of_no_return = true;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2020-03-25 15:00:21 +00:00
|
|
|
* Make this the only thread in the thread group.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2020-03-25 15:00:07 +00:00
|
|
|
retval = de_thread(me);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (retval)
|
|
|
|
goto out;
|
|
|
|
|
2020-11-30 22:58:46 +00:00
|
|
|
/*
|
|
|
|
* Cancel any io_uring activity across execve
|
|
|
|
*/
|
|
|
|
io_uring_task_cancel();
|
|
|
|
|
2020-11-20 23:14:18 +00:00
|
|
|
/* Ensure the files table is not shared. */
|
2020-11-20 23:14:19 +00:00
|
|
|
retval = unshare_files();
|
2020-11-20 23:14:18 +00:00
|
|
|
if (retval)
|
|
|
|
goto out;
|
|
|
|
|
2015-04-16 19:47:59 +00:00
|
|
|
/*
|
|
|
|
* Must be called _before_ exec_mmap() as bprm->mm is
|
2023-08-14 17:21:40 +00:00
|
|
|
* not visible until then. Doing it here also ensures
|
|
|
|
* we don't race against replace_mm_exe_file().
|
2015-04-16 19:47:59 +00:00
|
|
|
*/
|
2021-04-23 08:29:59 +00:00
|
|
|
retval = set_mm_exe_file(bprm->mm, bprm->file);
|
|
|
|
if (retval)
|
|
|
|
goto out;
|
2015-04-16 19:47:59 +00:00
|
|
|
|
2020-05-14 20:17:40 +00:00
|
|
|
/* If the binary is not readable then enforce mm->dumpable=0 */
|
2020-05-16 21:29:20 +00:00
|
|
|
would_dump(bprm, bprm->file);
|
2020-05-14 20:17:40 +00:00
|
|
|
if (bprm->have_execfd)
|
|
|
|
would_dump(bprm, bprm->executable);
|
2020-05-16 21:29:20 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Release all of the old mmap stuff
|
|
|
|
*/
|
2010-11-30 19:55:34 +00:00
|
|
|
acct_arg_size(bprm, 0);
|
2005-04-16 22:20:36 +00:00
|
|
|
retval = exec_mmap(bprm->mm);
|
|
|
|
if (retval)
|
2008-04-22 09:11:59 +00:00
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
exec: Correct comments about "point of no return"
In commit 221af7f87b97 ("Split 'flush_old_exec' into two functions"),
the comment about the point of no return should have stayed in
flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
changes in commits c89681ed7d0e ("remove steal_locks()"), and
fd8328be874f ("sanitize handling of shared descriptor tables in failing
execve()") made it look like it meant the current->sas_ss_sp line instead.
The comment was referring to the fact that once bprm->mm is NULL, all
failures from a binfmt load_binary hook (e.g. load_elf_binary), will
get SEGV raised against current. Move this comment and expand the
explanation a bit, putting it above the assignment this time, and add
details about the true nature of "point of no return" being the call
to flush_old_exec() itself.
This also removes an erroneous commet about when credentials are being
installed. That has its own dedicated function, install_exec_creds(),
which carries a similar (and correct) comment, so remove the bogus comment
where installation is not actually happening.
Cc: David Howells <dhowells@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-07-18 22:25:30 +00:00
|
|
|
bprm->mm = NULL;
|
2010-02-02 20:37:44 +00:00
|
|
|
|
2022-09-21 00:31:19 +00:00
|
|
|
retval = exec_task_namespaces();
|
|
|
|
if (retval)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2020-03-25 15:03:21 +00:00
|
|
|
#ifdef CONFIG_POSIX_TIMERS
|
2022-08-09 17:07:51 +00:00
|
|
|
spin_lock_irq(&me->sighand->siglock);
|
|
|
|
posix_cpu_timers_exit(me);
|
|
|
|
spin_unlock_irq(&me->sighand->siglock);
|
2022-07-11 16:16:25 +00:00
|
|
|
exit_itimers(me);
|
2020-03-25 15:03:21 +00:00
|
|
|
flush_itimer_signals();
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make the signal table private.
|
|
|
|
*/
|
|
|
|
retval = unshare_sighand(me);
|
|
|
|
if (retval)
|
2020-04-02 23:04:54 +00:00
|
|
|
goto out_unlock;
|
2020-03-25 15:03:21 +00:00
|
|
|
|
2022-04-11 19:15:54 +00:00
|
|
|
me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC |
|
2014-01-23 23:55:57 +00:00
|
|
|
PF_NOFREEZE | PF_NO_SETAFFINITY);
|
2010-02-02 20:37:44 +00:00
|
|
|
flush_thread();
|
2020-03-25 15:00:07 +00:00
|
|
|
me->personality &= ~bprm->per_clear;
|
2010-02-02 20:37:44 +00:00
|
|
|
|
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector. This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant. The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access. In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-11-27 19:32:34 +00:00
|
|
|
clear_syscall_work_syscall_user_dispatch(me);
|
|
|
|
|
2016-12-21 05:26:24 +00:00
|
|
|
/*
|
|
|
|
* We have to apply CLOEXEC before we change whether the process is
|
|
|
|
* dumpable (in setup_new_exec) to avoid a race with a process in userspace
|
|
|
|
* trying to access the should-be-closed file descriptors of a process
|
|
|
|
* undergoing exec(2).
|
|
|
|
*/
|
2020-03-25 15:00:07 +00:00
|
|
|
do_close_on_exec(me->files);
|
2016-11-17 04:06:51 +00:00
|
|
|
|
2017-07-18 22:25:35 +00:00
|
|
|
if (bprm->secureexec) {
|
2017-07-18 22:25:36 +00:00
|
|
|
/* Make sure parent cannot signal privileged process. */
|
2020-04-02 23:35:14 +00:00
|
|
|
me->pdeath_signal = 0;
|
2017-07-18 22:25:36 +00:00
|
|
|
|
2017-07-18 22:25:35 +00:00
|
|
|
/*
|
|
|
|
* For secureexec, reset the stack limit to sane default to
|
|
|
|
* avoid bad behavior from the prior rlimits. This has to
|
|
|
|
* happen before arch_pick_mmap_layout(), which examines
|
|
|
|
* RLIMIT_STACK, but after the point of no return to avoid
|
2017-12-12 19:28:38 +00:00
|
|
|
* needing to clean up the change on failure.
|
2017-07-18 22:25:35 +00:00
|
|
|
*/
|
2018-04-10 23:35:01 +00:00
|
|
|
if (bprm->rlim_stack.rlim_cur > _STK_LIM)
|
|
|
|
bprm->rlim_stack.rlim_cur = _STK_LIM;
|
2017-07-18 22:25:35 +00:00
|
|
|
}
|
|
|
|
|
2020-04-02 23:35:14 +00:00
|
|
|
me->sas_ss_sp = me->sas_ss_size = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2018-01-02 23:21:33 +00:00
|
|
|
/*
|
|
|
|
* Figure out dumpability. Note that this checking only of current
|
|
|
|
* is wrong, but userspace depends on it. This should be testing
|
|
|
|
* bprm->secureexec instead.
|
|
|
|
*/
|
2017-07-18 22:25:34 +00:00
|
|
|
if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
|
2018-01-02 23:21:33 +00:00
|
|
|
!(uid_eq(current_euid(), current_uid()) &&
|
|
|
|
gid_eq(current_egid(), current_gid())))
|
2007-07-19 08:48:27 +00:00
|
|
|
set_dumpable(current->mm, suid_dumpable);
|
2017-07-18 22:25:34 +00:00
|
|
|
else
|
|
|
|
set_dumpable(current->mm, SUID_DUMP_USER);
|
2005-06-23 07:09:43 +00:00
|
|
|
|
2014-05-21 15:32:19 +00:00
|
|
|
perf_event_exec();
|
2020-04-02 23:35:14 +00:00
|
|
|
__set_task_comm(me, kbasename(bprm->filename), true);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/* An exec changes our domain. We are no longer part of the thread
|
|
|
|
group */
|
2020-04-02 23:35:14 +00:00
|
|
|
WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
|
|
|
|
flush_signal_handlers(me, 0);
|
2020-05-03 11:48:17 +00:00
|
|
|
|
2021-04-22 12:27:09 +00:00
|
|
|
retval = set_cred_ucounts(bprm->cred);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2020-05-03 11:48:17 +00:00
|
|
|
/*
|
|
|
|
* install the new credentials for this executable
|
|
|
|
*/
|
|
|
|
security_bprm_committing_creds(bprm);
|
|
|
|
|
|
|
|
commit_creds(bprm->cred);
|
|
|
|
bprm->cred = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Disable monitoring for regular users
|
|
|
|
* when executing setuid binaries. Must
|
|
|
|
* wait until new credentials are committed
|
|
|
|
* by commit_creds() above
|
|
|
|
*/
|
2020-04-02 23:35:14 +00:00
|
|
|
if (get_dumpable(me->mm) != SUID_DUMP_USER)
|
|
|
|
perf_event_exit_task(me);
|
2020-05-03 11:48:17 +00:00
|
|
|
/*
|
|
|
|
* cred_guard_mutex must be held at least to this point to prevent
|
|
|
|
* ptrace_attach() from altering our determination of the task's
|
|
|
|
* credentials; any time after this it may be unlocked.
|
|
|
|
*/
|
|
|
|
security_bprm_committed_creds(bprm);
|
2020-05-14 20:17:40 +00:00
|
|
|
|
|
|
|
/* Pass the opened binary to the interpreter. */
|
|
|
|
if (bprm->have_execfd) {
|
|
|
|
retval = get_unused_fd_flags(0);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_unlock;
|
|
|
|
fd_install(retval, bprm->executable);
|
|
|
|
bprm->executable = NULL;
|
|
|
|
bprm->execfd = retval;
|
|
|
|
}
|
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-29 06:14:42 +00:00
|
|
|
return 0;
|
|
|
|
|
2020-05-03 12:15:28 +00:00
|
|
|
out_unlock:
|
2020-12-03 20:12:00 +00:00
|
|
|
up_write(&me->signal->exec_update_lock);
|
2024-01-22 18:34:21 +00:00
|
|
|
if (!bprm->cred)
|
|
|
|
mutex_unlock(&me->signal->cred_guard_mutex);
|
|
|
|
|
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-29 06:14:42 +00:00
|
|
|
out:
|
|
|
|
return retval;
|
|
|
|
}
|
2020-05-03 12:54:10 +00:00
|
|
|
EXPORT_SYMBOL(begin_new_exec);
|
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-29 06:14:42 +00:00
|
|
|
|
2011-06-19 16:49:47 +00:00
|
|
|
void would_dump(struct linux_binprm *bprm, struct file *file)
|
|
|
|
{
|
2016-11-17 04:06:51 +00:00
|
|
|
struct inode *inode = file_inode(file);
|
2023-01-13 11:49:22 +00:00
|
|
|
struct mnt_idmap *idmap = file_mnt_idmap(file);
|
|
|
|
if (inode_permission(idmap, inode, MAY_READ) < 0) {
|
2016-11-17 04:06:51 +00:00
|
|
|
struct user_namespace *old, *user_ns;
|
2011-06-19 16:49:47 +00:00
|
|
|
bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP;
|
2016-11-17 04:06:51 +00:00
|
|
|
|
|
|
|
/* Ensure mm->user_ns contains the executable */
|
|
|
|
user_ns = old = bprm->mm->user_ns;
|
|
|
|
while ((user_ns != &init_user_ns) &&
|
2023-01-13 11:49:27 +00:00
|
|
|
!privileged_wrt_inode_uidgid(user_ns, idmap, inode))
|
2016-11-17 04:06:51 +00:00
|
|
|
user_ns = user_ns->parent;
|
|
|
|
|
|
|
|
if (old != user_ns) {
|
|
|
|
bprm->mm->user_ns = get_user_ns(user_ns);
|
|
|
|
put_user_ns(old);
|
|
|
|
}
|
|
|
|
}
|
2011-06-19 16:49:47 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(would_dump);
|
|
|
|
|
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-29 06:14:42 +00:00
|
|
|
void setup_new_exec(struct linux_binprm * bprm)
|
|
|
|
{
|
2020-05-03 12:15:28 +00:00
|
|
|
/* Setup things that can depend upon the personality */
|
|
|
|
struct task_struct *me = current;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-05-03 12:15:28 +00:00
|
|
|
arch_pick_mmap_layout(me->mm, &bprm->rlim_stack);
|
2005-06-23 07:09:43 +00:00
|
|
|
|
2017-03-20 08:16:26 +00:00
|
|
|
arch_setup_new_exec();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-03-01 00:59:19 +00:00
|
|
|
/* Set the new mm task size. We have to do that late because it may
|
|
|
|
* depend on TIF_32BIT which is only updated in flush_thread() on
|
|
|
|
* some architectures like powerpc
|
|
|
|
*/
|
2020-05-03 12:15:28 +00:00
|
|
|
me->mm->task_size = TASK_SIZE;
|
2020-12-03 20:12:00 +00:00
|
|
|
up_write(&me->signal->exec_update_lock);
|
2020-04-02 23:35:14 +00:00
|
|
|
mutex_unlock(&me->signal->cred_guard_mutex);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-29 06:14:42 +00:00
|
|
|
EXPORT_SYMBOL(setup_new_exec);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2018-04-10 23:34:57 +00:00
|
|
|
/* Runs immediately before start_thread() takes over. */
|
|
|
|
void finalize_exec(struct linux_binprm *bprm)
|
|
|
|
{
|
2018-04-10 23:35:01 +00:00
|
|
|
/* Store any stack rlimit changes before starting thread. */
|
|
|
|
task_lock(current->group_leader);
|
|
|
|
current->signal->rlim[RLIMIT_STACK] = bprm->rlim_stack;
|
|
|
|
task_unlock(current->group_leader);
|
2018-04-10 23:34:57 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(finalize_exec);
|
|
|
|
|
2009-09-05 18:17:13 +00:00
|
|
|
/*
|
|
|
|
* Prepare credentials and lock ->cred_guard_mutex.
|
2020-05-03 11:48:17 +00:00
|
|
|
* setup_new_exec() commits the new creds and drops the lock.
|
2021-02-24 20:00:48 +00:00
|
|
|
* Or, if exec fails before, free_bprm() should release ->cred
|
2009-09-05 18:17:13 +00:00
|
|
|
* and unlock.
|
|
|
|
*/
|
2018-12-10 07:49:54 +00:00
|
|
|
static int prepare_bprm_creds(struct linux_binprm *bprm)
|
2009-09-05 18:17:13 +00:00
|
|
|
{
|
2010-10-27 22:34:08 +00:00
|
|
|
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
|
2009-09-05 18:17:13 +00:00
|
|
|
return -ERESTARTNOINTR;
|
|
|
|
|
|
|
|
bprm->cred = prepare_exec_creds();
|
|
|
|
if (likely(bprm->cred))
|
|
|
|
return 0;
|
|
|
|
|
2010-10-27 22:34:08 +00:00
|
|
|
mutex_unlock(¤t->signal->cred_guard_mutex);
|
2009-09-05 18:17:13 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2022-09-17 00:11:18 +00:00
|
|
|
/* Matches do_open_execat() */
|
|
|
|
static void do_close_execat(struct file *file)
|
|
|
|
{
|
fs: don't block i_writecount during exec
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.
Here are some of the notes that I took:
(1) The deny_write_access() mechanism is causing really pointless issues
such as [1]. If a thread in a thread-group opens a file writable,
then writes some stuff, then closing the file descriptor and then
calling execve() they can fail the execve() with ETXTBUSY because
another thread in the thread-group could have concurrently called
fork(). Multi-threaded libraries such as go suffer from this.
(2) There are userspace attacks that rely on overwriting the binary of a
running process. These attacks are _mitigated_ but _not at all
prevented_ from ocurring by the deny_write_access() mechanism.
I'll go over some details. The clearest example of such attacks was
the attack against runC in CVE-2019-5736 (cf. [3]).
An attack could compromise the runC host binary from inside a
_privileged_ runC container. The malicious binary could then be used
to take over the host.
(It is crucial to note that this attack is _not_ possible with
unprivileged containers. IOW, the setup here is already insecure.)
The attack can be made when attaching to a running container or when
starting a container running a specially crafted image. For example,
when runC attaches to a container the attacker can trick it into
executing itself.
This could be done by replacing the target binary inside the
container with a custom binary pointing back at the runC binary
itself. As an example, if the target binary was /bin/bash, this
could be replaced with an executable script specifying the
interpreter path #!/proc/self/exe.
As such when /bin/bash is executed inside the container, instead the
target of /proc/self/exe will be executed. That magic link will
point to the runc binary on the host. The attacker can then proceed
to write to the target of /proc/self/exe to try and overwrite the
runC binary on the host.
However, this will not succeed because of deny_write_access(). Now,
one might think that this would prevent the attack but it doesn't.
To overcome this, the attacker has multiple ways:
* Open a file descriptor to /proc/self/exe using the O_PATH flag and
then proceed to reopen the binary as O_WRONLY through
/proc/self/fd/<nr> and try to write to it in a busy loop from a
separate process. Ultimately it will succeed when the runC binary
exits. After this the runC binary is compromised and can be used
to attack other containers or the host itself.
* Use a malicious shared library annotating a function in there with
the constructor attribute making the malicious function run as an
initializor. The malicious library will then open /proc/self/exe
for creating a new entry under /proc/self/fd/<nr>. It'll then call
exec to a) force runC to exit and b) hand the file descriptor off
to a program that then reopens /proc/self/fd/<nr> for writing
(which is now possible because runC has exited) and overwriting
that binary.
To sum up: the deny_write_access() mechanism doesn't prevent such
attacks in insecure setups. It just makes them minimally harder.
That's all.
The only way back then to prevent this is to create a temporary copy
of the calling binary itself when it starts or attaches to
containers. So what I did back then for LXC (and Aleksa for runC)
was to create an anonymous, in-memory file using the memfd_create()
system call and to copy itself into the temporary in-memory file,
which is then sealed to prevent further modifications. This sealed,
in-memory file copy is then executed instead of the original on-disk
binary.
Any compromising write operations from a privileged container to the
host binary will then write to the temporary in-memory binary and
not to the host binary on-disk, preserving the integrity of the host
binary. Also as the temporary, in-memory binary is sealed, writes to
this will also fail.
The point is that deny_write_access() is uselss to prevent these
attacks.
(3) Denying write access to an inode because it's currently used in an
exec path could easily be done on an LSM level. It might need an
additional hook but that should be about it.
(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
ago so while we do protect the main executable the bigger portion of
the things you'd think need protecting such as the shared libraries
aren't. IOW, we let anyone happily overwrite shared libraries.
(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
(5.1) We removed the legacy uselib() protection for preventing
overwriting of shared libraries. Nobody cared in 3 years.
(5.2) We allow write access to the elf interpreter after exec
completed treating it on a par with shared libraries.
Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.
Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: https://github.com/golang/go/issues/22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: https://github.com/buildkite/agent/pull/2736
Link: https://github.com/rust-lang/rust/issues/114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: https://github.com/dotnet/runtime/issues/58964
Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-31 13:01:43 +00:00
|
|
|
if (file)
|
|
|
|
fput(file);
|
2022-09-17 00:11:18 +00:00
|
|
|
}
|
|
|
|
|
2014-02-05 20:54:53 +00:00
|
|
|
static void free_bprm(struct linux_binprm *bprm)
|
2009-09-05 18:17:13 +00:00
|
|
|
{
|
2020-07-10 20:54:54 +00:00
|
|
|
if (bprm->mm) {
|
|
|
|
acct_arg_size(bprm, 0);
|
|
|
|
mmput(bprm->mm);
|
|
|
|
}
|
2009-09-05 18:17:13 +00:00
|
|
|
free_arg_pages(bprm);
|
|
|
|
if (bprm->cred) {
|
2010-10-27 22:34:08 +00:00
|
|
|
mutex_unlock(¤t->signal->cred_guard_mutex);
|
2009-09-05 18:17:13 +00:00
|
|
|
abort_creds(bprm->cred);
|
|
|
|
}
|
2022-09-17 00:11:18 +00:00
|
|
|
do_close_execat(bprm->file);
|
2020-05-14 20:17:40 +00:00
|
|
|
if (bprm->executable)
|
|
|
|
fput(bprm->executable);
|
exec: do not leave bprm->interp on stack
If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.
Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules. Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted. They leave bprm->interp
pointing to their local stack. This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.
After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules. As
such, we need to protect the changes to interp.
This changes the logic to require allocation for any changes to the
bprm->interp. To avoid adding a new kmalloc to every exec, the default
value is left as-is. Only when passing through binfmt_script or
binfmt_misc does an allocation take place.
For a proof of concept, see DoTest.sh from:
http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: halfdog <me@halfdog.net>
Cc: P J P <ppandit@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 23:05:16 +00:00
|
|
|
/* If a binfmt changed the interp, free it. */
|
|
|
|
if (bprm->interp != bprm->filename)
|
|
|
|
kfree(bprm->interp);
|
2020-07-11 13:16:15 +00:00
|
|
|
kfree(bprm->fdpath);
|
2009-09-05 18:17:13 +00:00
|
|
|
kfree(bprm);
|
|
|
|
}
|
|
|
|
|
2024-01-09 00:43:04 +00:00
|
|
|
static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int flags)
|
2020-07-10 20:39:45 +00:00
|
|
|
{
|
2024-01-09 00:43:04 +00:00
|
|
|
struct linux_binprm *bprm;
|
|
|
|
struct file *file;
|
2020-07-11 13:16:15 +00:00
|
|
|
int retval = -ENOMEM;
|
2024-01-09 00:43:04 +00:00
|
|
|
|
|
|
|
file = do_open_execat(fd, filename, flags);
|
|
|
|
if (IS_ERR(file))
|
|
|
|
return ERR_CAST(file);
|
|
|
|
|
|
|
|
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
|
|
|
|
if (!bprm) {
|
2022-09-17 00:11:18 +00:00
|
|
|
do_close_execat(file);
|
2024-01-09 00:43:04 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
|
|
|
bprm->file = file;
|
2020-07-11 13:16:15 +00:00
|
|
|
|
|
|
|
if (fd == AT_FDCWD || filename->name[0] == '/') {
|
|
|
|
bprm->filename = filename->name;
|
|
|
|
} else {
|
|
|
|
if (filename->name[0] == '\0')
|
|
|
|
bprm->fdpath = kasprintf(GFP_KERNEL, "/dev/fd/%d", fd);
|
|
|
|
else
|
|
|
|
bprm->fdpath = kasprintf(GFP_KERNEL, "/dev/fd/%d/%s",
|
|
|
|
fd, filename->name);
|
|
|
|
if (!bprm->fdpath)
|
|
|
|
goto out_free;
|
|
|
|
|
2024-01-09 00:43:04 +00:00
|
|
|
/*
|
|
|
|
* Record that a name derived from an O_CLOEXEC fd will be
|
|
|
|
* inaccessible after exec. This allows the code in exec to
|
|
|
|
* choose to fail when the executable is not mmaped into the
|
|
|
|
* interpreter and an open file descriptor is not passed to
|
|
|
|
* the interpreter. This makes for a better user experience
|
|
|
|
* than having the interpreter start and then immediately fail
|
|
|
|
* when it finds the executable is inaccessible.
|
|
|
|
*/
|
|
|
|
if (get_close_on_exec(fd))
|
|
|
|
bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
|
|
|
|
|
2020-07-11 13:16:15 +00:00
|
|
|
bprm->filename = bprm->fdpath;
|
|
|
|
}
|
|
|
|
bprm->interp = bprm->filename;
|
2020-07-10 20:54:54 +00:00
|
|
|
|
|
|
|
retval = bprm_mm_init(bprm);
|
2024-01-09 00:43:04 +00:00
|
|
|
if (!retval)
|
|
|
|
return bprm;
|
2020-07-11 13:16:15 +00:00
|
|
|
|
|
|
|
out_free:
|
|
|
|
free_bprm(bprm);
|
|
|
|
return ERR_PTR(retval);
|
2020-07-10 20:39:45 +00:00
|
|
|
}
|
|
|
|
|
2017-10-03 23:15:42 +00:00
|
|
|
int bprm_change_interp(const char *interp, struct linux_binprm *bprm)
|
exec: do not leave bprm->interp on stack
If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.
Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules. Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted. They leave bprm->interp
pointing to their local stack. This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.
After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules. As
such, we need to protect the changes to interp.
This changes the logic to require allocation for any changes to the
bprm->interp. To avoid adding a new kmalloc to every exec, the default
value is left as-is. Only when passing through binfmt_script or
binfmt_misc does an allocation take place.
For a proof of concept, see DoTest.sh from:
http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: halfdog <me@halfdog.net>
Cc: P J P <ppandit@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 23:05:16 +00:00
|
|
|
{
|
|
|
|
/* If a binfmt changed the interp, free it first. */
|
|
|
|
if (bprm->interp != bprm->filename)
|
|
|
|
kfree(bprm->interp);
|
|
|
|
bprm->interp = kstrdup(interp, GFP_KERNEL);
|
|
|
|
if (!bprm->interp)
|
|
|
|
return -ENOMEM;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(bprm_change_interp);
|
|
|
|
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
/*
|
|
|
|
* determine how safe it is to execute the proposed program
|
2010-10-27 22:34:08 +00:00
|
|
|
* - the caller must hold ->cred_guard_mutex to protect against
|
2014-06-05 07:23:17 +00:00
|
|
|
* PTRACE_ATTACH or seccomp thread-sync
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
*/
|
2014-01-23 23:55:50 +00:00
|
|
|
static void check_unsafe_exec(struct linux_binprm *bprm)
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
{
|
CRED: Fix SUID exec regression
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:45:46 +00:00
|
|
|
struct task_struct *p = current, *t;
|
2009-03-30 11:35:18 +00:00
|
|
|
unsigned n_fs;
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
|
2017-01-23 04:26:31 +00:00
|
|
|
if (p->ptrace)
|
|
|
|
bprm->unsafe |= LSM_UNSAFE_PTRACE;
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
|
Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs
With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-04-12 21:47:50 +00:00
|
|
|
/*
|
|
|
|
* This isn't strictly necessary, but it makes it harder for LSMs to
|
|
|
|
* mess up.
|
|
|
|
*/
|
2014-05-21 22:23:46 +00:00
|
|
|
if (task_no_new_privs(current))
|
Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs
With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
v18: updated change desc
v17: using new define values as per 3.4
Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-04-12 21:47:50 +00:00
|
|
|
bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
|
|
|
|
|
2022-10-18 07:17:24 +00:00
|
|
|
/*
|
|
|
|
* If another task is sharing our fs, we cannot safely
|
|
|
|
* suid exec because the differently privileged task
|
|
|
|
* will be able to manipulate the current directory, etc.
|
|
|
|
* It would be nice to force an unshare instead...
|
|
|
|
*/
|
CRED: Fix SUID exec regression
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:45:46 +00:00
|
|
|
n_fs = 1;
|
2010-08-17 18:37:33 +00:00
|
|
|
spin_lock(&p->fs->lock);
|
2009-04-23 23:02:45 +00:00
|
|
|
rcu_read_lock();
|
2023-10-30 15:57:10 +00:00
|
|
|
for_other_threads(p, t) {
|
CRED: Fix SUID exec regression
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:45:46 +00:00
|
|
|
if (t->fs == p->fs)
|
|
|
|
n_fs++;
|
|
|
|
}
|
2009-04-23 23:02:45 +00:00
|
|
|
rcu_read_unlock();
|
CRED: Fix SUID exec regression
The patch:
commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:45:46 +00:00
|
|
|
|
2024-01-24 19:15:33 +00:00
|
|
|
/* "users" and "in_exec" locked for copy_fs() */
|
2014-01-23 23:55:50 +00:00
|
|
|
if (p->fs->users > n_fs)
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
bprm->unsafe |= LSM_UNSAFE_SHARE;
|
2014-01-23 23:55:50 +00:00
|
|
|
else
|
|
|
|
p->fs->in_exec = 1;
|
2010-08-17 18:37:33 +00:00
|
|
|
spin_unlock(&p->fs->lock);
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
}
|
|
|
|
|
2020-05-30 03:00:54 +00:00
|
|
|
static void bprm_fill_uid(struct linux_binprm *bprm, struct file *file)
|
2015-04-19 00:48:39 +00:00
|
|
|
{
|
2020-05-30 03:00:54 +00:00
|
|
|
/* Handle suid and sgid on files */
|
2023-01-13 11:49:30 +00:00
|
|
|
struct mnt_idmap *idmap;
|
2022-08-20 15:46:10 +00:00
|
|
|
struct inode *inode = file_inode(file);
|
2015-04-19 00:48:39 +00:00
|
|
|
unsigned int mode;
|
2022-06-22 20:12:16 +00:00
|
|
|
vfsuid_t vfsuid;
|
|
|
|
vfsgid_t vfsgid;
|
exec: Fix ToCToU between perm check and set-uid/gid usage
When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.
For example, if a file could change permissions from executable and not
set-id:
---------x 1 root root 16048 Aug 7 13:16 target
to set-id and non-executable:
---S------ 1 root root 16048 Aug 7 13:16 target
it is possible to gain root privileges when execution should have been
disallowed.
While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:
-rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target
becomes:
-rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target
But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".
Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.
Reported-by: Marco Vanotti <mvanotti@google.com>
Tested-by: Marco Vanotti <mvanotti@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-08 18:39:08 +00:00
|
|
|
int err;
|
2015-04-19 00:48:39 +00:00
|
|
|
|
2020-05-30 03:00:54 +00:00
|
|
|
if (!mnt_may_suid(file->f_path.mnt))
|
2015-04-19 00:48:39 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
if (task_no_new_privs(current))
|
|
|
|
return;
|
|
|
|
|
|
|
|
mode = READ_ONCE(inode->i_mode);
|
|
|
|
if (!(mode & (S_ISUID|S_ISGID)))
|
|
|
|
return;
|
|
|
|
|
2023-01-13 11:49:30 +00:00
|
|
|
idmap = file_mnt_idmap(file);
|
2021-01-21 13:19:42 +00:00
|
|
|
|
2015-04-19 00:48:39 +00:00
|
|
|
/* Be careful if suid/sgid is set */
|
2016-01-22 20:40:57 +00:00
|
|
|
inode_lock(inode);
|
2015-04-19 00:48:39 +00:00
|
|
|
|
exec: Fix ToCToU between perm check and set-uid/gid usage
When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.
For example, if a file could change permissions from executable and not
set-id:
---------x 1 root root 16048 Aug 7 13:16 target
to set-id and non-executable:
---S------ 1 root root 16048 Aug 7 13:16 target
it is possible to gain root privileges when execution should have been
disallowed.
While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:
-rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target
becomes:
-rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target
But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".
Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.
Reported-by: Marco Vanotti <mvanotti@google.com>
Tested-by: Marco Vanotti <mvanotti@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-08 18:39:08 +00:00
|
|
|
/* Atomically reload and check mode/uid/gid now that lock held. */
|
2015-04-19 00:48:39 +00:00
|
|
|
mode = inode->i_mode;
|
2023-01-13 11:49:30 +00:00
|
|
|
vfsuid = i_uid_into_vfsuid(idmap, inode);
|
|
|
|
vfsgid = i_gid_into_vfsgid(idmap, inode);
|
exec: Fix ToCToU between perm check and set-uid/gid usage
When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.
For example, if a file could change permissions from executable and not
set-id:
---------x 1 root root 16048 Aug 7 13:16 target
to set-id and non-executable:
---S------ 1 root root 16048 Aug 7 13:16 target
it is possible to gain root privileges when execution should have been
disallowed.
While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:
-rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target
becomes:
-rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target
But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".
Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.
Reported-by: Marco Vanotti <mvanotti@google.com>
Tested-by: Marco Vanotti <mvanotti@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-08 18:39:08 +00:00
|
|
|
err = inode_permission(idmap, inode, MAY_EXEC);
|
2016-01-22 20:40:57 +00:00
|
|
|
inode_unlock(inode);
|
2015-04-19 00:48:39 +00:00
|
|
|
|
exec: Fix ToCToU between perm check and set-uid/gid usage
When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.
For example, if a file could change permissions from executable and not
set-id:
---------x 1 root root 16048 Aug 7 13:16 target
to set-id and non-executable:
---S------ 1 root root 16048 Aug 7 13:16 target
it is possible to gain root privileges when execution should have been
disallowed.
While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:
-rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target
becomes:
-rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target
But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".
Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.
Reported-by: Marco Vanotti <mvanotti@google.com>
Tested-by: Marco Vanotti <mvanotti@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-08 18:39:08 +00:00
|
|
|
/* Did the exec bit vanish out from under us? Give up. */
|
|
|
|
if (err)
|
|
|
|
return;
|
|
|
|
|
2015-04-19 00:48:39 +00:00
|
|
|
/* We ignore suid/sgid if there are no mappings for them in the ns */
|
2022-06-22 20:12:16 +00:00
|
|
|
if (!vfsuid_has_mapping(bprm->cred->user_ns, vfsuid) ||
|
|
|
|
!vfsgid_has_mapping(bprm->cred->user_ns, vfsgid))
|
2015-04-19 00:48:39 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
if (mode & S_ISUID) {
|
|
|
|
bprm->per_clear |= PER_CLEAR_ON_SETID;
|
2022-06-22 20:12:16 +00:00
|
|
|
bprm->cred->euid = vfsuid_into_kuid(vfsuid);
|
2015-04-19 00:48:39 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
|
|
|
|
bprm->per_clear |= PER_CLEAR_ON_SETID;
|
2022-06-22 20:12:16 +00:00
|
|
|
bprm->cred->egid = vfsgid_into_kgid(vfsgid);
|
2015-04-19 00:48:39 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-05-30 03:00:54 +00:00
|
|
|
/*
|
|
|
|
* Compute brpm->cred based upon the final binary.
|
|
|
|
*/
|
|
|
|
static int bprm_creds_from_file(struct linux_binprm *bprm)
|
|
|
|
{
|
|
|
|
/* Compute creds based on which file? */
|
|
|
|
struct file *file = bprm->execfd_creds ? bprm->executable : bprm->file;
|
|
|
|
|
|
|
|
bprm_fill_uid(bprm, file);
|
|
|
|
return security_bprm_creds_from_file(bprm, file);
|
|
|
|
}
|
|
|
|
|
2014-01-23 23:55:50 +00:00
|
|
|
/*
|
|
|
|
* Fill the binprm structure from the inode.
|
2020-05-30 03:00:54 +00:00
|
|
|
* Read the first BINPRM_BUF_SIZE bytes
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
*
|
|
|
|
* This may be called multiple times for binary chains (scripts for example).
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2020-05-14 03:25:20 +00:00
|
|
|
static int prepare_binprm(struct linux_binprm *bprm)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2017-09-01 15:39:13 +00:00
|
|
|
loff_t pos = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
memset(bprm->buf, 0, BINPRM_BUF_SIZE);
|
2017-09-01 15:39:13 +00:00
|
|
|
return kernel_read(bprm->file, bprm->buf, BINPRM_BUF_SIZE, &pos);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2007-05-08 07:25:16 +00:00
|
|
|
/*
|
|
|
|
* Arguments are '\0' separated strings found at the location bprm->p
|
|
|
|
* points to; chop off the first by relocating brpm->p to right after
|
|
|
|
* the first '\0' encountered.
|
|
|
|
*/
|
2007-07-19 08:48:16 +00:00
|
|
|
int remove_arg_zero(struct linux_binprm *bprm)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2007-07-19 08:48:16 +00:00
|
|
|
unsigned long offset;
|
|
|
|
char *kaddr;
|
|
|
|
struct page *page;
|
2007-05-08 07:25:16 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
if (!bprm->argc)
|
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
do {
|
|
|
|
offset = bprm->p & ~PAGE_MASK;
|
|
|
|
page = get_arg_page(bprm, bprm->p, 0);
|
2024-03-09 21:46:30 +00:00
|
|
|
if (!page)
|
|
|
|
return -EFAULT;
|
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 18:28:56 +00:00
|
|
|
kaddr = kmap_local_page(page);
|
2007-05-08 07:25:16 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
for (; offset < PAGE_SIZE && kaddr[offset];
|
|
|
|
offset++, bprm->p++)
|
|
|
|
;
|
2007-05-08 07:25:16 +00:00
|
|
|
|
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.
Since the use of kmap_local_page() in exec.c is safe, it should be
preferred everywhere in exec.c.
As said, since kmap_local_page() can be also called from atomic context,
and since remove_arg_zero() doesn't (and shouldn't ever) rely on an
implicit preempt_disable(), this function can also safely replace
kmap_atomic().
Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in
fs/exec.c.
Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com
2022-08-03 18:28:56 +00:00
|
|
|
kunmap_local(kaddr);
|
2007-07-19 08:48:16 +00:00
|
|
|
put_arg_page(page);
|
|
|
|
} while (offset == PAGE_SIZE);
|
2007-05-08 07:25:16 +00:00
|
|
|
|
2007-07-19 08:48:16 +00:00
|
|
|
bprm->p++;
|
|
|
|
bprm->argc--;
|
2007-05-08 07:25:16 +00:00
|
|
|
|
2024-03-09 21:46:30 +00:00
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(remove_arg_zero);
|
|
|
|
|
2013-09-11 21:24:44 +00:00
|
|
|
#define printable(c) (((c)=='\t') || ((c)=='\n') || (0x20<=(c) && (c)<=0x7e))
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* cycle the list of binary formats handler, until one recognizes the image
|
|
|
|
*/
|
2020-05-18 23:43:20 +00:00
|
|
|
static int search_binary_handler(struct linux_binprm *bprm)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2013-09-11 21:24:44 +00:00
|
|
|
bool need_retry = IS_ENABLED(CONFIG_MODULES);
|
2005-04-16 22:20:36 +00:00
|
|
|
struct linux_binfmt *fmt;
|
2013-09-11 21:24:44 +00:00
|
|
|
int retval;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2020-05-14 03:25:20 +00:00
|
|
|
retval = prepare_binprm(bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
return retval;
|
2012-12-18 00:03:20 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
retval = security_bprm_check(bprm);
|
|
|
|
if (retval)
|
|
|
|
return retval;
|
|
|
|
|
|
|
|
retval = -ENOENT;
|
2013-09-11 21:24:44 +00:00
|
|
|
retry:
|
|
|
|
read_lock(&binfmt_lock);
|
|
|
|
list_for_each_entry(fmt, &formats, lh) {
|
|
|
|
if (!try_module_get(fmt->module))
|
|
|
|
continue;
|
|
|
|
read_unlock(&binfmt_lock);
|
2019-05-14 22:44:37 +00:00
|
|
|
|
2013-09-11 21:24:44 +00:00
|
|
|
retval = fmt->load_binary(bprm);
|
2019-05-14 22:44:37 +00:00
|
|
|
|
2014-05-05 00:11:36 +00:00
|
|
|
read_lock(&binfmt_lock);
|
|
|
|
put_binfmt(fmt);
|
2020-05-18 23:43:20 +00:00
|
|
|
if (bprm->point_of_no_return || (retval != -ENOEXEC)) {
|
2014-05-05 00:11:36 +00:00
|
|
|
read_unlock(&binfmt_lock);
|
2013-09-11 21:24:44 +00:00
|
|
|
return retval;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
2013-09-11 21:24:44 +00:00
|
|
|
read_unlock(&binfmt_lock);
|
|
|
|
|
2014-05-05 00:11:36 +00:00
|
|
|
if (need_retry) {
|
2013-09-11 21:24:44 +00:00
|
|
|
if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
|
|
|
|
printable(bprm->buf[2]) && printable(bprm->buf[3]))
|
|
|
|
return retval;
|
2013-09-11 21:24:45 +00:00
|
|
|
if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0)
|
|
|
|
return retval;
|
2013-09-11 21:24:44 +00:00
|
|
|
need_retry = false;
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2022-10-18 07:17:24 +00:00
|
|
|
/* binfmt handlers will call back into begin_new_exec() on success. */
|
2013-09-11 21:24:38 +00:00
|
|
|
static int exec_binprm(struct linux_binprm *bprm)
|
|
|
|
{
|
|
|
|
pid_t old_pid, old_vpid;
|
2020-05-18 23:43:20 +00:00
|
|
|
int ret, depth;
|
2013-09-11 21:24:38 +00:00
|
|
|
|
|
|
|
/* Need to fetch pid before load_binary changes it */
|
|
|
|
old_pid = current->pid;
|
|
|
|
rcu_read_lock();
|
|
|
|
old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2020-05-18 23:43:20 +00:00
|
|
|
/* This allows 4 levels of binfmt rewrites before failing hard. */
|
|
|
|
for (depth = 0;; depth++) {
|
|
|
|
struct file *exec;
|
|
|
|
if (depth > 5)
|
|
|
|
return -ELOOP;
|
|
|
|
|
|
|
|
ret = search_binary_handler(bprm);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
if (!bprm->interpreter)
|
|
|
|
break;
|
|
|
|
|
|
|
|
exec = bprm->file;
|
|
|
|
bprm->file = bprm->interpreter;
|
|
|
|
bprm->interpreter = NULL;
|
|
|
|
|
|
|
|
if (unlikely(bprm->have_execfd)) {
|
|
|
|
if (bprm->executable) {
|
|
|
|
fput(exec);
|
|
|
|
return -ENOEXEC;
|
|
|
|
}
|
|
|
|
bprm->executable = exec;
|
|
|
|
} else
|
|
|
|
fput(exec);
|
2013-09-11 21:24:38 +00:00
|
|
|
}
|
|
|
|
|
2020-05-18 23:43:20 +00:00
|
|
|
audit_bprm(bprm);
|
|
|
|
trace_sched_process_exec(current, old_pid, bprm);
|
|
|
|
ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
|
|
|
|
proc_exec_connector(current);
|
|
|
|
return 0;
|
2013-09-11 21:24:38 +00:00
|
|
|
}
|
|
|
|
|
2024-01-09 00:43:04 +00:00
|
|
|
static int bprm_execve(struct linux_binprm *bprm)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
int retval;
|
2020-09-13 19:09:39 +00:00
|
|
|
|
2009-09-05 18:17:13 +00:00
|
|
|
retval = prepare_bprm_creds(bprm);
|
|
|
|
if (retval)
|
2020-11-20 23:14:18 +00:00
|
|
|
return retval;
|
2009-03-30 11:20:30 +00:00
|
|
|
|
2022-10-18 07:17:24 +00:00
|
|
|
/*
|
|
|
|
* Check for unsafe execution states before exec_binprm(), which
|
|
|
|
* will call back into begin_new_exec(), into bprm_creds_from_file(),
|
|
|
|
* where setuid-ness is evaluated.
|
|
|
|
*/
|
2014-01-23 23:55:50 +00:00
|
|
|
check_unsafe_exec(bprm);
|
2009-09-05 18:17:13 +00:00
|
|
|
current->in_execve = 1;
|
2022-11-22 20:39:09 +00:00
|
|
|
sched_mm_cid_before_execve(current);
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
sched_exec();
|
|
|
|
|
2020-03-22 20:46:24 +00:00
|
|
|
/* Set the unchanging part of bprm->cred */
|
|
|
|
retval = security_bprm_creds_for_exec(bprm);
|
|
|
|
if (retval)
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
|
|
|
|
2013-09-11 21:24:38 +00:00
|
|
|
retval = exec_binprm(bprm);
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
if (retval < 0)
|
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2022-11-22 20:39:09 +00:00
|
|
|
sched_mm_cid_after_execve(current);
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
/* execve succeeded */
|
2009-03-30 11:20:30 +00:00
|
|
|
current->fs->in_exec = 0;
|
2009-02-05 08:18:11 +00:00
|
|
|
current->in_execve = 0;
|
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-02 12:43:54 +00:00
|
|
|
rseq_execve(current);
|
2023-03-28 23:52:09 +00:00
|
|
|
user_events_execve(current);
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
acct_update_integrals(current);
|
2019-07-16 15:20:45 +00:00
|
|
|
task_numa_free(current, false);
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
return retval;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
out:
|
2020-04-04 14:42:56 +00:00
|
|
|
/*
|
2021-02-24 20:00:48 +00:00
|
|
|
* If past the point of no return ensure the code never
|
2020-04-04 14:42:56 +00:00
|
|
|
* returns to the userspace process. Use an existing fatal
|
|
|
|
* signal if present otherwise terminate the process with
|
|
|
|
* SIGSEGV.
|
|
|
|
*/
|
|
|
|
if (bprm->point_of_no_return && !fatal_signal_pending(current))
|
2021-10-25 15:50:57 +00:00
|
|
|
force_fatal_sig(SIGSEGV);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2022-11-22 20:39:09 +00:00
|
|
|
sched_mm_cid_after_execve(current);
|
2014-01-23 23:55:50 +00:00
|
|
|
current->fs->in_exec = 0;
|
2009-02-05 08:18:11 +00:00
|
|
|
current->in_execve = 0;
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
|
2020-07-12 12:17:50 +00:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int do_execveat_common(int fd, struct filename *filename,
|
|
|
|
struct user_arg_ptr argv,
|
|
|
|
struct user_arg_ptr envp,
|
|
|
|
int flags)
|
|
|
|
{
|
|
|
|
struct linux_binprm *bprm;
|
|
|
|
int retval;
|
|
|
|
|
|
|
|
if (IS_ERR(filename))
|
|
|
|
return PTR_ERR(filename);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We move the actual failure in case of RLIMIT_NPROC excess from
|
|
|
|
* set*uid() to execve() because too many poorly written programs
|
|
|
|
* don't check setuid() return code. Here we additionally recheck
|
|
|
|
* whether NPROC limit is still exceeded.
|
|
|
|
*/
|
|
|
|
if ((current->flags & PF_NPROC_EXCEEDED) &&
|
2022-05-18 17:17:30 +00:00
|
|
|
is_rlimit_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
|
2020-07-12 12:17:50 +00:00
|
|
|
retval = -EAGAIN;
|
|
|
|
goto out_ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We're below the limit (still or again), so we don't want to make
|
|
|
|
* further execve() calls fail. */
|
|
|
|
current->flags &= ~PF_NPROC_EXCEEDED;
|
|
|
|
|
2024-01-09 00:43:04 +00:00
|
|
|
bprm = alloc_bprm(fd, filename, flags);
|
2020-07-12 12:17:50 +00:00
|
|
|
if (IS_ERR(bprm)) {
|
|
|
|
retval = PTR_ERR(bprm);
|
|
|
|
goto out_ret;
|
|
|
|
}
|
|
|
|
|
2020-07-12 13:23:54 +00:00
|
|
|
retval = count(argv, MAX_ARG_STRINGS);
|
2022-02-01 00:09:47 +00:00
|
|
|
if (retval == 0)
|
|
|
|
pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
|
|
|
|
current->comm, bprm->filename);
|
2020-07-12 13:23:54 +00:00
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
bprm->argc = retval;
|
|
|
|
|
|
|
|
retval = count(envp, MAX_ARG_STRINGS);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
bprm->envc = retval;
|
|
|
|
|
|
|
|
retval = bprm_stack_limits(bprm);
|
2020-07-12 12:17:50 +00:00
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
retval = copy_string_kernel(bprm->filename, bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
bprm->exec = bprm->p;
|
|
|
|
|
|
|
|
retval = copy_strings(bprm->envc, envp, bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
retval = copy_strings(bprm->argc, argv, bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
2022-02-01 00:09:47 +00:00
|
|
|
/*
|
|
|
|
* When argv is empty, add an empty string ("") as argv[0] to
|
|
|
|
* ensure confused userspace programs that start processing
|
|
|
|
* from argv[1] won't end up walking envp. See also
|
|
|
|
* bprm_stack_limits().
|
|
|
|
*/
|
|
|
|
if (bprm->argc == 0) {
|
|
|
|
retval = copy_string_kernel("", bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
bprm->argc = 1;
|
|
|
|
}
|
|
|
|
|
2024-01-09 00:43:04 +00:00
|
|
|
retval = bprm_execve(bprm);
|
CRED: Make execve() take advantage of copy-on-write credentials
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13 23:39:24 +00:00
|
|
|
out_free:
|
2008-05-10 20:38:25 +00:00
|
|
|
free_bprm(bprm);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
out_ret:
|
2020-06-25 18:56:40 +00:00
|
|
|
putname(filename);
|
2005-04-16 22:20:36 +00:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2020-07-13 17:06:48 +00:00
|
|
|
int kernel_execve(const char *kernel_filename,
|
|
|
|
const char *const *argv, const char *const *envp)
|
|
|
|
{
|
|
|
|
struct filename *filename;
|
|
|
|
struct linux_binprm *bprm;
|
|
|
|
int fd = AT_FDCWD;
|
|
|
|
int retval;
|
|
|
|
|
2022-04-11 19:15:54 +00:00
|
|
|
/* It is non-sense for kernel threads to call execve */
|
|
|
|
if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
|
2022-04-11 16:40:14 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-13 17:06:48 +00:00
|
|
|
filename = getname_kernel(kernel_filename);
|
|
|
|
if (IS_ERR(filename))
|
|
|
|
return PTR_ERR(filename);
|
|
|
|
|
2024-01-09 00:43:04 +00:00
|
|
|
bprm = alloc_bprm(fd, filename, 0);
|
2020-07-13 17:06:48 +00:00
|
|
|
if (IS_ERR(bprm)) {
|
|
|
|
retval = PTR_ERR(bprm);
|
|
|
|
goto out_ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
retval = count_strings_kernel(argv);
|
2022-02-01 00:09:47 +00:00
|
|
|
if (WARN_ON_ONCE(retval == 0))
|
|
|
|
retval = -EINVAL;
|
2020-07-13 17:06:48 +00:00
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
bprm->argc = retval;
|
|
|
|
|
|
|
|
retval = count_strings_kernel(envp);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
bprm->envc = retval;
|
|
|
|
|
|
|
|
retval = bprm_stack_limits(bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
retval = copy_string_kernel(bprm->filename, bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
bprm->exec = bprm->p;
|
|
|
|
|
|
|
|
retval = copy_strings_kernel(bprm->envc, envp, bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
retval = copy_strings_kernel(bprm->argc, argv, bprm);
|
|
|
|
if (retval < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
2024-01-09 00:43:04 +00:00
|
|
|
retval = bprm_execve(bprm);
|
2020-07-13 17:06:48 +00:00
|
|
|
out_free:
|
|
|
|
free_bprm(bprm);
|
|
|
|
out_ret:
|
|
|
|
putname(filename);
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int do_execve(struct filename *filename,
|
2011-03-06 17:02:37 +00:00
|
|
|
const char __user *const __user *__argv,
|
2012-10-21 01:49:33 +00:00
|
|
|
const char __user *const __user *__envp)
|
2011-03-06 17:02:37 +00:00
|
|
|
{
|
2011-03-06 17:02:54 +00:00
|
|
|
struct user_arg_ptr argv = { .ptr.native = __argv };
|
|
|
|
struct user_arg_ptr envp = { .ptr.native = __envp };
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
|
|
|
|
}
|
|
|
|
|
2020-07-13 17:06:48 +00:00
|
|
|
static int do_execveat(int fd, struct filename *filename,
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
const char __user *const __user *__argv,
|
|
|
|
const char __user *const __user *__envp,
|
|
|
|
int flags)
|
|
|
|
{
|
|
|
|
struct user_arg_ptr argv = { .ptr.native = __argv };
|
|
|
|
struct user_arg_ptr envp = { .ptr.native = __envp };
|
|
|
|
|
|
|
|
return do_execveat_common(fd, filename, argv, envp, flags);
|
2011-03-06 17:02:54 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_COMPAT
|
2014-02-05 20:54:53 +00:00
|
|
|
static int compat_do_execve(struct filename *filename,
|
2012-09-30 17:38:55 +00:00
|
|
|
const compat_uptr_t __user *__argv,
|
2012-10-21 01:46:25 +00:00
|
|
|
const compat_uptr_t __user *__envp)
|
2011-03-06 17:02:54 +00:00
|
|
|
{
|
|
|
|
struct user_arg_ptr argv = {
|
|
|
|
.is_compat = true,
|
|
|
|
.ptr.compat = __argv,
|
|
|
|
};
|
|
|
|
struct user_arg_ptr envp = {
|
|
|
|
.is_compat = true,
|
|
|
|
.ptr.compat = __envp,
|
|
|
|
};
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int compat_do_execveat(int fd, struct filename *filename,
|
|
|
|
const compat_uptr_t __user *__argv,
|
|
|
|
const compat_uptr_t __user *__envp,
|
|
|
|
int flags)
|
|
|
|
{
|
|
|
|
struct user_arg_ptr argv = {
|
|
|
|
.is_compat = true,
|
|
|
|
.ptr.compat = __argv,
|
|
|
|
};
|
|
|
|
struct user_arg_ptr envp = {
|
|
|
|
.is_compat = true,
|
|
|
|
.ptr.compat = __envp,
|
|
|
|
};
|
|
|
|
return do_execveat_common(fd, filename, argv, envp, flags);
|
2011-03-06 17:02:37 +00:00
|
|
|
}
|
2011-03-06 17:02:54 +00:00
|
|
|
#endif
|
2011-03-06 17:02:37 +00:00
|
|
|
|
2009-09-23 22:56:59 +00:00
|
|
|
void set_binfmt(struct linux_binfmt *new)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2009-09-23 22:57:41 +00:00
|
|
|
struct mm_struct *mm = current->mm;
|
|
|
|
|
|
|
|
if (mm->binfmt)
|
|
|
|
module_put(mm->binfmt->module);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-09-23 22:57:41 +00:00
|
|
|
mm->binfmt = new;
|
2009-09-23 22:56:59 +00:00
|
|
|
if (new)
|
|
|
|
__module_get(new->module);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(set_binfmt);
|
|
|
|
|
2007-07-19 08:48:27 +00:00
|
|
|
/*
|
2014-01-23 23:55:32 +00:00
|
|
|
* set_dumpable stores three-value SUID_DUMP_* into mm->flags.
|
2007-07-19 08:48:27 +00:00
|
|
|
*/
|
|
|
|
void set_dumpable(struct mm_struct *mm, int value)
|
|
|
|
{
|
2014-01-23 23:55:32 +00:00
|
|
|
if (WARN_ON((unsigned)value > SUID_DUMP_ROOT))
|
|
|
|
return;
|
|
|
|
|
2019-03-08 00:29:23 +00:00
|
|
|
set_mask_bits(&mm->flags, MMF_DUMPABLE_MASK, value);
|
2007-07-19 08:48:27 +00:00
|
|
|
}
|
|
|
|
|
2012-09-30 17:38:55 +00:00
|
|
|
SYSCALL_DEFINE3(execve,
|
|
|
|
const char __user *, filename,
|
|
|
|
const char __user *const __user *, argv,
|
|
|
|
const char __user *const __user *, envp)
|
|
|
|
{
|
2014-02-05 20:54:53 +00:00
|
|
|
return do_execve(getname(filename), argv, envp);
|
2012-09-30 17:38:55 +00:00
|
|
|
}
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
|
|
|
|
SYSCALL_DEFINE5(execveat,
|
|
|
|
int, fd, const char __user *, filename,
|
|
|
|
const char __user *const __user *, argv,
|
|
|
|
const char __user *const __user *, envp,
|
|
|
|
int, flags)
|
|
|
|
{
|
|
|
|
return do_execveat(fd,
|
2021-07-08 06:34:42 +00:00
|
|
|
getname_uflags(filename, flags),
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
argv, envp, flags);
|
|
|
|
}
|
|
|
|
|
2012-09-30 17:38:55 +00:00
|
|
|
#ifdef CONFIG_COMPAT
|
2014-03-04 09:53:50 +00:00
|
|
|
COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
|
|
|
|
const compat_uptr_t __user *, argv,
|
|
|
|
const compat_uptr_t __user *, envp)
|
2012-09-30 17:38:55 +00:00
|
|
|
{
|
2014-02-05 20:54:53 +00:00
|
|
|
return compat_do_execve(getname(filename), argv, envp);
|
2012-09-30 17:38:55 +00:00
|
|
|
}
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
|
|
|
|
COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
|
|
|
|
const char __user *, filename,
|
|
|
|
const compat_uptr_t __user *, argv,
|
|
|
|
const compat_uptr_t __user *, envp,
|
|
|
|
int, flags)
|
|
|
|
{
|
|
|
|
return compat_do_execveat(fd,
|
2021-07-08 06:34:42 +00:00
|
|
|
getname_uflags(filename, flags),
|
syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:57:29 +00:00
|
|
|
argv, envp, flags);
|
|
|
|
}
|
2012-09-30 17:38:55 +00:00
|
|
|
#endif
|
2022-01-22 06:13:17 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
|
sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 18:59:29 +00:00
|
|
|
static int proc_dointvec_minmax_coredump(const struct ctl_table *table, int write,
|
2022-01-22 06:13:17 +00:00
|
|
|
void *buffer, size_t *lenp, loff_t *ppos)
|
|
|
|
{
|
|
|
|
int error = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
|
|
|
|
|
|
|
|
if (!error)
|
|
|
|
validate_coredump_safety();
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct ctl_table fs_exec_sysctls[] = {
|
|
|
|
{
|
|
|
|
.procname = "suid_dumpable",
|
|
|
|
.data = &suid_dumpable,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax_coredump,
|
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
.extra2 = SYSCTL_TWO,
|
|
|
|
},
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init init_fs_exec_sysctls(void)
|
|
|
|
{
|
|
|
|
register_sysctl_init("fs", fs_exec_sysctls);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
fs_initcall(init_fs_exec_sysctls);
|
|
|
|
#endif /* CONFIG_SYSCTL */
|
2024-05-20 00:17:59 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_EXEC_KUNIT_TEST
|
2024-07-20 17:03:14 +00:00
|
|
|
#include "tests/exec_kunit.c"
|
2024-05-20 00:17:59 +00:00
|
|
|
#endif
|