sched/numa: Fix unsafe get_task_struct() in task_numa_assign()

Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:

task_numa_compare()                    do_exit()
    ...                                        current->flags |= PF_EXITING;
    ...                                    release_task()
    ...                                        ~~delayed_put_task_struct()~~
    ...                                    schedule()
    rcu_read_lock()                        ...
    cur = ACCESS_ONCE(dst_rq->curr)        ...
        ...                                rq->curr = next;
        ...                                    context_switch()
        ...                                        finish_task_switch()
        ...                                            put_task_struct()
        ...                                                __put_task_struct()
        ...                                                    free_task_struct()
        task_numa_assign()                                     ...
            get_task_struct()                                  ...

As noted by Oleg:

  <<The lockless get_task_struct(tsk) is only safe if tsk == current
    and didn't pass exit_notify(), or if this tsk was found on a rcu
    protected list (say, for_each_process() or find_task_by_vpid()).
    IOW, it is only safe if release_task() was not called before we
    take rcu_read_lock(), in this case we can rely on the fact that
    delayed_put_pid() can not drop the (potentially) last reference
    until rcu_read_unlock().

    And as Kirill pointed out task_numa_compare()->task_numa_assign()
    path does get_task_struct(dst_rq->curr) and this is not safe. The
    task_struct itself can't go away, but rcu_read_lock() can't save
    us from the final put_task_struct() in finish_task_switch(); this
    reference goes away without rcu gp>>

The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().

Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit is contained in:
Kirill Tkhai 2014-10-22 11:17:11 +04:00 committed by Ingo Molnar
parent aee38ea954
commit 1effd9f193

View File

@ -1164,9 +1164,19 @@ static void task_numa_compare(struct task_numa_env *env,
long moveimp = imp; long moveimp = imp;
rcu_read_lock(); rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
if (cur->pid == 0) /* idle */ raw_spin_lock_irq(&dst_rq->lock);
cur = dst_rq->curr;
/*
* No need to move the exiting task, and this ensures that ->curr
* wasn't reaped and thus get_task_struct() in task_numa_assign()
* is safe under RCU read lock.
* Note that rcu_read_lock() itself can't protect from the final
* put_task_struct() after the last schedule().
*/
if ((cur->flags & PF_EXITING) || is_idle_task(cur))
cur = NULL; cur = NULL;
raw_spin_unlock_irq(&dst_rq->lock);
/* /*
* "imp" is the fault differential for the source task between the * "imp" is the fault differential for the source task between the