Commit Graph

25348 Commits

Author SHA1 Message Date
Al Viro
db68ce10c4 new helper: uaccess_kernel()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-28 16:43:25 -04:00
Dave Airlie
e5c1ff1475 Linux 4.11-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJY2C9qAAoJEHm+PkMAQRiGaBQIAIGzdlZ6ImiP6zoukrRv7qUr
 44ITm0lsBiL85QGedhQQL+Y9UqwUmlqgFqnH0Gr8YHNbLJWXzdjGbl5aVo4KjASq
 104NLUDXtPww/xZdH4wJMzhuwucYwZOUyDOjOr0ak3cGxOE2xjNjHMZXxWUf20GO
 EpRr6WhV1DUAvAdjdNa9KlcOjMluNpMLLyL1CFLjrkkArrWAyqOURKHAb6ZLghfv
 iZV1qJTVPyYGpnlI3kuEgu2GuDjxqpoNLSr3wHyEHm/pBPEl7MX6zPbzcegBV8TY
 cRRlXo4notdsuknmSNcj0hHuTQvw1kl7BhieLKVsnCyCIM6jjX4TSQZFutmbzwM=
 =5iRl
 -----END PGP SIGNATURE-----

Backmerge tag 'v4.11-rc4' into drm-next

Linux 4.11-rc4

The i915 GVT team need the rc4 code to base some more code on.
2017-03-28 17:34:19 +10:00
Ingo Molnar
d652f4bbca Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-28 07:44:25 +02:00
Tetsuo Handa
e4e55b47ed LSM: Revive security_task_alloc() hook and per "struct task_struct" security blob.
We switched from "struct task_struct"->security to "struct cred"->security
in Linux 2.6.29. But not all LSM modules were happy with that change.
TOMOYO LSM module is an example which want to use per "struct task_struct"
security blob, for TOMOYO's security context is defined based on "struct
task_struct" rather than "struct cred". AppArmor LSM module is another
example which want to use it, for AppArmor is currently abusing the cred
a little bit to store the change_hat and setexeccon info. Although
security_task_free() hook was revived in Linux 3.4 because Yama LSM module
wanted to release per "struct task_struct" security blob,
security_task_alloc() hook and "struct task_struct"->security field were
not revived. Nowadays, we are getting proposals of lightweight LSM modules
which want to use per "struct task_struct" security blob.

We are already allowing multiple concurrent LSM modules (up to one fully
armored module which uses "struct cred"->security field or exclusive hooks
like security_xfrm_state_pol_flow_match(), plus unlimited number of
lightweight modules which do not use "struct cred"->security nor exclusive
hooks) as long as they are built into the kernel. But this patch does not
implement variable length "struct task_struct"->security field which will
become needed when multiple LSM modules want to use "struct task_struct"->
security field. Although it won't be difficult to implement variable length
"struct task_struct"->security field, let's think about it after we merged
this patch.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Tested-by: Djalal Harouni <tixxdz@gmail.com>
Acked-by: José Bollo <jobol@nonadev.net>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morris <james.l.morris@oracle.com>
Cc: José Bollo <jobol@nonadev.net>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-03-28 11:05:14 +11:00
James Morris
840c91dc6a update to v4.11-rc4 due to memory corruption bug in rc2 2017-03-28 11:03:35 +11:00
Paul Moore
ab6434a137 audit: move audit_signal_info() into kernel/auditsc.c
Commit 5b52330bbf ("audit: fix auditd/kernel connection state
tracking") made inlining audit_signal_info() a bit pointless as
it was always calling into auditd_test_task() so let's remove the
inline function in kernel/audit.h and convert __audit_signal_info()
in kernel/auditsc.c into audit_signal_info().

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-03-27 14:30:06 -04:00
Nicholas Mc Guire
75fa8e5d3b cgroup: switch to BUG_ON()
Use BUG_ON() rather than an explicit if followed by BUG() for
improved readability and also consistency.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-27 13:59:12 -04:00
Pavel Tatashin
7b09cc5a9d sched/clock: Fix broken stable to unstable transfer
When it is determined that the clock is actually unstable, and
we switch from stable to unstable, the __clear_sched_clock_stable()
function is eventually called.

In this function we set gtod_offset so the following holds true:

  sched_clock() + raw_offset == ktime_get_ns() + gtod_offset

But instead of getting the latest timestamps, we use the last values
from scd, so instead of sched_clock() we use scd->tick_raw, and
instead of ktime_get_ns() we use scd->tick_gtod.

However, later, when we use gtod_offset sched_clock_local() we do not
add it to scd->tick_gtod to calculate the correct clock value when we
determine the boundaries for min/max clocks.

This can result in tick granularity sched_clock() values, so fix it.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Fixes: 5680d8094f ("sched/clock: Provide better clock continuity")
Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-27 10:23:48 +02:00
Srikar Dronamraju
05b40e0577 sched/fair: Prefer sibiling only if local group is under-utilized
If the child domain prefers tasks to go siblings, the local group could
end up pulling tasks to itself even if the local group is almost equally
loaded as the source group.

Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
Everytime, local group has capacity and source group has atleast 2 threads,
local group tries to pull the task. This causes the threads to constantly
move between different cores. This is even more profound if the cores have
more threads, like in Power 8, smt 8 mode.

Fix this by only allowing local group to pull a task, if the source group
has more number of tasks than the local group.

Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.

Without patch:
 Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):

             1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
               366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
             3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )

 Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):

             6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
             3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
             5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )

 Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):

             8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
             2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
            10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )

With patch:

 Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):

             1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
               123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
             3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )

 Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):

             2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
               189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
             5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )

 Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):

             5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
               506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
            10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )

Which show that in these workloads CPU migrations get reduced significantly.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-27 10:22:26 +02:00
Peter Zijlstra
8ce371f984 lockdep: Fix per-cpu static objects
Since commit 383776fa75 ("locking/lockdep: Handle statically initialized
PER_CPU locks properly") we try to collapse per-cpu locks into a single
class by giving them all the same key. For this key we choose the canonical
address of the per-cpu object, which would be the offset into the per-cpu
area.

This has two problems:

 - there is a case where we run !0 lock->key through static_obj() and
   expect this to pass; it doesn't for canonical pointers.

 - 0 is a valid canonical address.

Cure both issues by redefining the canonical address as the address of the
per-cpu variable on the boot CPU.

Since I didn't want to rely on CPU0 being the boot-cpu, or even existing at
all, track the boot CPU in a variable.

Fixes: 383776fa75 ("locking/lockdep: Handle statically initialized PER_CPU locks properly")
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-mm@kvack.org
Cc: wfg@linux.intel.com
Cc: kernel test robot <fengguang.wu@intel.com>
Cc: LKP <lkp@01.org>
Link: http://lkml.kernel.org/r/20170320114108.kbvcsuepem45j5cr@hirez.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-26 15:09:45 +02:00
Linus Torvalds
4c3de7e5bf Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit
Pull audit fix from Paul Moore:
 "We've got an audit fix, and unfortunately it is big.

  While I'm not excited that we need to be sending you something this
  large during the -rcX phase, it does fix some very real, and very
  tangled, problems relating to locking, backlog queues, and the audit
  daemon connection.

  This code has passed our testsuite without problem and it has held up
  to my ad-hoc stress tests (arguably better than the existing code),
  please consider pulling this as fix for the next v4.11-rcX tag"

* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
  audit: fix auditd/kernel connection state tracking
2017-03-25 15:13:55 -07:00
Alexei Starovoitov
b1977682a3 bpf: improve verifier packet range checks
llvm can optimize the 'if (ptr > data_end)' checks to be in the order
slightly different than the original C code which will confuse verifier.
Like:
if (ptr + 16 > data_end)
  return TC_ACT_SHOT;
// may be followed by
if (ptr + 14 > data_end)
  return TC_ACT_SHOT;
while llvm can see that 'ptr' is valid for all 16 bytes,
the verifier could not.
Fix verifier logic to account for such case and add a test.

Reported-by: Huapeng Zhou <hzhou@fb.com>
Fixes: 969bf05eb3 ("bpf: direct packet access")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:51:28 -07:00
Rafael J. Wysocki
70e493f3bb Merge back schedutil governor updates for 4.12. 2017-03-25 02:35:48 +01:00
Steven Rostedt (VMware)
af0009fc16 tracing: Move trace_handle_return() out of line
Currently trace_handle_return() looks like this:

 static inline enum print_line_t trace_handle_return(struct trace_seq *s)
 {
        return trace_seq_has_overflowed(s) ?
                TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
 }

Where trace_seq_overflowed(s) is:

 static inline bool trace_seq_has_overflowed(struct trace_seq *s)
 {
	return s->full || seq_buf_has_overflowed(&s->seq);
 }

And seq_buf_has_overflowed(&s->seq) is:

 static inline bool
 seq_buf_has_overflowed(struct seq_buf *s)
 {
	return s->len > s->size;
 }

Making trace_handle_return() into:

 return (s->full || (s->seq->len > s->seq->size)) ?
           TRACE_TYPE_PARTIAL_LINE :
           TRACE_TYPE_HANDLED;

One would think this is not an issue to keep as an inline. But because this
is used in the TRACE_EVENT() macro, it is extended for every tracepoint in
the system. Taking a look at a single tracepoint x86_irq_vector (was the
first one I randomly chosen). As trace_handle_return is used in the
TRACE_EVENT() macro of trace_raw_output_##call() we disassemble
trace_raw_output_x86_irq_vector and do a diff:

- is the original
+ is the out-of-line code

I removed identical lines that were different just due to different
addresses.

--- /tmp/irq-vec-orig	2017-03-16 09:12:48.569384851 -0400
+++ /tmp/irq-vec-ool	2017-03-16 09:13:39.378153385 -0400
@@ -6,27 +6,23 @@
        53                      push   %rbx
        48 89 fb                mov    %rdi,%rbx
        4c 8b a7 c0 20 00 00    mov    0x20c0(%rdi),%r12
        e8 f7 72 13 00          callq  ffffffff81155c80 <trace_raw_output_prep>
        83 f8 01                cmp    $0x1,%eax
        74 05                   je     ffffffff8101e993 <trace_raw_output_x86_irq_vector+0x23>
        5b                      pop    %rbx
        41 5c                   pop    %r12
        5d                      pop    %rbp
        c3                      retq
        41 8b 54 24 08          mov    0x8(%r12),%edx
-       48 8d bb 98 10 00 00    lea    0x1098(%rbx),%rdi
+       48 81 c3 98 10 00 00    add    $0x1098,%rbx
-       48 c7 c6 7b 8a a0 81    mov    $0xffffffff81a08a7b,%rsi
+       48 c7 c6 ab 8a a0 81    mov    $0xffffffff81a08aab,%rsi
-       e8 c5 85 13 00          callq  ffffffff81156f70 <trace_seq_printf>

 === here's the start of the main difference ===

+       48 89 df                mov    %rbx,%rdi
+       e8 62 7e 13 00          callq  ffffffff81156810 <trace_seq_printf>
-       8b 93 b8 20 00 00       mov    0x20b8(%rbx),%edx
-       31 c0                   xor    %eax,%eax
-       85 d2                   test   %edx,%edx
-       75 11                   jne    ffffffff8101e9c8 <trace_raw_output_x86_irq_vector+0x58>
-       48 8b 83 a8 20 00 00    mov    0x20a8(%rbx),%rax
-       48 39 83 a0 20 00 00    cmp    %rax,0x20a0(%rbx)
-       0f 93 c0                setae  %al
+       48 89 df                mov    %rbx,%rdi
+       e8 4a c5 12 00          callq  ffffffff8114af00 <trace_handle_return>
        5b                      pop    %rbx
-       0f b6 c0                movzbl %al,%eax

 === end ===

        41 5c                   pop    %r12
        5d                      pop    %rbp
        c3                      retq

If you notice, the original has 22 bytes of text more than the out of line
version. As this is for every TRACE_EVENT() defined in the system, this can
become quite large.

   text	   data	    bss	    dec	    hex	filename
8690305	5450490	1298432	15439227	 eb957b	vmlinux-orig
8681725	5450490	1298432	15430647	 eb73f7	vmlinux-handle

This change has a total of 8580 bytes in savings.

 $ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
324

That's 324 tracepoints. But this does not include modules (which contain
many more tracepoints). For an allyesconfig build:

 $ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
1401

That's 1401 tracepoints giving us:

   text    data     bss     dec     hex filename
137920629       140221067       53264384        331406080       13c0db00 vmlinux-allyes-orig
137827709       140221067       53264384        331313160       13bf7008 vmlinux-allyes-handle

92920 bytes in savings!!!

Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 20:51:50 -04:00
Steven Rostedt (VMware)
42c269c88d ftrace: Allow for function tracing to record init functions on boot up
Adding a hook into free_reserve_area() that informs ftrace that boot up init
text is being free, lets ftrace safely remove those init functions from its
records, which keeps ftrace from trying to modify text that no longer
exists.

Note, this still does not allow for tracing .init text of modules, as
modules require different work for freeing its init code.

Link: http://lkml.kernel.org/r/1488502497.7212.24.camel@linux.intel.com

Cc: linux-mm@kvack.org
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Requested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 20:51:49 -04:00
Steven Rostedt (VMware)
dbeafd0d61 ftrace: Have function tracing start in early boot up
Register the function tracer right after the tracing buffers are initialized
in early boot up. This will allow function tracing to begin early if it is
enabled via the kernel command line.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 20:51:48 -04:00
Steven Rostedt (VMware)
9afecfbb95 tracing: Postpone tracer start-up tests till the system is more robust
As tracing can now be enabled very early in boot up, even before some
critical system services (like scheduling), do not run the tracer selftests
until after early_initcall() is performed. If a tracer is registered before
such time, it is saved off in a list and the test is run when the system is
able to handle more diverse functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 20:51:46 -04:00
Steven Rostedt (VMware)
e725c731e3 tracing: Split tracing initialization into two for early initialization
Create an early_trace_init() function that will initialize the buffers and
allow for ealier use of trace_printk(). This will also allow for future work
to have function tracing start earlier at boot up.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-24 13:08:43 -04:00
Sergey Senozhatsky
64ca752dcb printk: use console_trylock() in console_cpu_notify()
There is no need to always call blocking console_lock() in
console_cpu_notify(), it's quite possible that console_sem can
be locked by other CPU on the system, either already printing
or soon to begin printing the messages. console_lock() in this
case can simply block CPU hotplug for unknown period of time
(console_unlock() is time unbound). Not that hotplug is very
fast, but still, with other CPUs being online and doing
printk() console_cpu_notify() can stuck.

Use console_trylock() instead and opt-out if console_sem is
already acquired from another CPU, since that CPU will do
the printing for us.

Link: http://lkml.kernel.org/r/20170121104729.8585-1-sergey.senozhatsky@gmail.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-03-24 16:09:46 +01:00
Masanari Iida
0ba42a599f treewide: Fix typo in xml/driver-api/basics.xml
This patch fix spelling typos found in
Documentation/output/xml/driver-api/basics.xml.
It is because the xml file was generated from comments in source,
so I had to fix the comments.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-24 15:47:57 +01:00
Jason A. Donenfeld
de5540d088 padata: avoid race in reordering
Under extremely heavy uses of padata, crashes occur, and with list
debugging turned on, this happens instead:

[87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
__list_add+0xae/0x130
[87487.301868] list_add corruption. prev->next should be next
(ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
[87487.339011]  [<ffffffff9a53d075>] dump_stack+0x68/0xa3
[87487.342198]  [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
[87487.345364]  [<ffffffff99d6b91f>] __warn+0xff/0x140
[87487.348513]  [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
[87487.351659]  [<ffffffff9a58b5de>] __list_add+0xae/0x130
[87487.354772]  [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
[87487.357915]  [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
[87487.361084]  [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120

padata_reorder calls list_add_tail with the list to which its adding
locked, which seems correct:

spin_lock(&squeue->serial.lock);
list_add_tail(&padata->list, &squeue->serial.list);
spin_unlock(&squeue->serial.lock);

This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:

next_queue = per_cpu_ptr(pd->pqueue, cpu);
padata = NULL;
reorder = &next_queue->reorder;
if (!list_empty(&reorder->list)) {
       padata = list_entry(reorder->list.next,
                           struct padata_priv, list);
       spin_lock(&reorder->lock);
       list_del_init(&padata->list);
       atomic_dec(&pd->reorder_objects);
       spin_unlock(&reorder->lock);

       pd->processed++;

       goto out;
}
out:
return padata;

I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix is thus be hoist that lock outside of
that block.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-03-24 21:51:33 +08:00
Linus Torvalds
ebe64824e9 Power management fixes for v4.11-rc4
- Make intel_pstate use one set of global P-state limits in the
    active mode regardless of the scaling_governor settings for
    individual CPUs instead of switching back and forth between two
    of them in a way that is hard to control (Rafael Wysocki).
 
  - Drop a useless function from intel_pstate to prevent it from
    modifying the maximum supported frequency value unexpectedly
    which may confuse the cpufreq core (Rafael Wysocki).
 
  - Fix the cpufreq core to restore policy limits on CPU online so
    that the limits are not reset over system suspend/resume, among
    other things (Viresh Kumar).
 
  - Fix the initialization of the schedutil cpufreq governor to
    make the IO-wait boosting mechanism in it actually work on
    systems with one CPU per cpufreq policy (Rafael Wysocki).
 
  - Add a sanity check to the cpuidle core to prevent crashes from
    happening if the architecture code initialization fails to set
    up things as expected (Vaidyanathan Srinivasan).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJY1HLGAAoJEILEb/54YlRxjAsQAIFcYfJKosA8IlmKcricR/WH
 CqBMCH9S6Y5YsggrFjcj3lVJbf5P43/h3U++O+/97lJsevfp4inCChpvVSQIWX4v
 xzk2v+5Ms8ROqVSLy34yTaB0ysCC//J5FvdQLsj0Zw9W/8yvi0DosPfeiAD7sYeb
 4qkJK3+yv5sLtZ41FmdVYabhC5KHQSAV6p6X+KOZnFV8cm+8TfOSERhStXASMTGc
 tvDpjIjPA1GLpHYdOK4UQ+Er1Hgwk2fNX7eXrpHh7QCQx4eZEN+g7DAC95Ify9Am
 gkTFc5eUfOFKU5KMshdQh6gnfoNaKi4d3E/ahmnU+KQuyKiy4KMyTNcuUBszSaDM
 ZTm0GooseV0UajaLH08BNbfpqDsiKc2fm1qkCQkXxpGjs80/bYn5gK+fpvkziq9x
 210Wc7XTWf7JfmPs0d3gZekaohHtqJVigCuA4dXH6kvbDvbDcfKOza6rqNmpSFrQ
 ifWH6M12Ut/G5NfwihhTRhoKkeQjqHFgikNC8BjF2Myem20026Vr6MKZrubDAlkq
 VWP7lT2zNSs1btsNqDrA9+wejwK8OwwrpfZOx3hbYL6Q+u/AuljIJ79aRz8ROZcE
 jQZeOKprmlAaDIASdxIM4yjwzSQE0l/CaHteHfKaddmcWTrtKR+CEfU0ODzYWoK0
 dcQwyMF3tOToj7BjWqKM
 =jy4C
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "One of these is an intel_pstate regression fix and it is not a small
  change, but it mostly removes code that shouldn't be there. That code
  was acquired by mistake and has been a source of constant pain since
  then, so the time has come to get rid of it finally. We have not seen
  problems with this change in the lab, so fingers crossed.

  The rest is more usual: one more intel_pstate commit removing useless
  code, a cpufreq core fix to make it restore policy limits on CPU
  online (which prevents the limits from being reset over system
  suspend/resume), a schedutil cpufreq governor initialization fix to
  make it actually work as advertised on all systems and an extra sanity
  check in the cpuidle core to prevent crashes from happening if the
  arch code messes things up.

  Specifics:

   - Make intel_pstate use one set of global P-state limits in the
     active mode regardless of the scaling_governor settings for
     individual CPUs instead of switching back and forth between two of
     them in a way that is hard to control (Rafael Wysocki).

   - Drop a useless function from intel_pstate to prevent it from
     modifying the maximum supported frequency value unexpectedly which
     may confuse the cpufreq core (Rafael Wysocki).

   - Fix the cpufreq core to restore policy limits on CPU online so that
     the limits are not reset over system suspend/resume, among other
     things (Viresh Kumar).

   - Fix the initialization of the schedutil cpufreq governor to make
     the IO-wait boosting mechanism in it actually work on systems with
     one CPU per cpufreq policy (Rafael Wysocki).

   - Add a sanity check to the cpuidle core to prevent crashes from
     happening if the architecture code initialization fails to set up
     things as expected (Vaidyanathan Srinivasan)"

* tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Restore policy min/max limits on CPU online
  cpuidle: Validate cpu_dev in cpuidle_add_sysfs()
  cpufreq: intel_pstate: Fix policy data management in passive mode
  cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
  cpufreq: intel_pstate: One set of global limits in active mode
2017-03-23 20:00:39 -07:00
Rafael J. Wysocki
38d4ea229d cpufreq: schedutil: Trace frequency only if it has changed
sugov_update_commit() calls trace_cpu_frequency() to record the
current CPU frequency if it has not changed in the fast switch case
to prevent utilities from getting confused (they may report that the
CPU is idle if the frequency has not been recorded for too long, for
example).

However, that may cause the tracepoint to be triggered quite often
for no real reason (if the frequency doesn't change, we will not
modify the last update time stamp and governor computations may
run again shortly when that happens), so don't do that (arguably, it
is done to work around a utilities bug anyway).

That allows code duplication in sugov_update_commit() to be reduced
somewhat too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-03-24 02:57:22 +01:00
David S. Miller
16ae1f2236 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmmii.c
	drivers/net/hyperv/netvsc.c
	kernel/bpf/hashtab.c

Almost entirely overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 16:41:27 -07:00
Tom Hromatka
0107042768 sysrq: Reset the watchdog timers while displaying high-resolution timers
On systems with a large number of CPUs, running sysrq-<q> can cause
watchdog timeouts.  There are two slow sections of code in the sysrq-<q>
path in timer_list.c.

1. print_active_timers() - This function is called by print_cpu() and
   contains a slow goto loop.  On a machine with hundreds of CPUs, this
   loop took approximately 100ms for the first CPU in a NUMA node.
   (Subsequent CPUs in the same node ran much quicker.)  The total time
   to print all of the CPUs is ultimately long enough to trigger the
   soft lockup watchdog.

2. print_tickdevice() - This function outputs a large amount of textual
   information.  This function also took approximately 100ms per CPU.

Since sysrq-<q> is not a performance critical path, there should be no
harm in touching the nmi watchdog in both slow sections above.  Touching
it in just one location was insufficient on systems with hundreds of
CPUs as occasional timeouts were still observed during testing.

This issue was observed on an Oracle T7 machine with 128 CPUs, but I
anticipate it may affect other systems with similarly large numbers of
CPUs.

Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-03-23 12:46:53 -07:00
David Engraf
1b8955bc5a timers, sched_clock: Update timeout for clock wrap
The scheduler clock framework may not use the correct timeout for the clock
wrap. This happens when a new clock driver calls sched_clock_register()
after the kernel called sched_clock_postinit(). In this case the clock wrap
timeout is too long thus sched_clock_poll() is called too late and the clock
already wrapped.

On my ARM system the scheduler was no longer scheduling any other task than
the idle task because the sched_clock() wrapped.

Signed-off-by: David Engraf <david.engraf@sysgo.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-03-23 12:30:27 -07:00
Nicolai Stange
0695bd99c0 clockevents: Make clockevents_config() static
A clockevent device's rate should be configured before or at registration
and changed afterwards through clockevents_update_freq() only.

For the configuration at registration, we already have
clockevents_config_and_register().

Right now, there are no clockevents_config() users outside of the
clockevents core.

To mitigiate the risk of drivers errorneously reconfiguring their rates
through clockevents_config() *after* device registration, make
clockevents_config() static.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-03-23 12:14:05 -07:00
Linus Torvalds
f341d9f08a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Several netfilter fixes from Pablo and the crew:
      - Handle fragmented packets properly in netfilter conntrack, from
        Florian Westphal.
      - Fix SCTP ICMP packet handling, from Ying Xue.
      - Fix big-endian bug in nftables, from Liping Zhang.
      - Fix alignment of fake conntrack entry, from Steven Rostedt.

 2) Fix feature flags setting in fjes driver, from Taku Izumi.

 3) Openvswitch ipv6 tunnel source address not set properly, from Or
    Gerlitz.

 4) Fix jumbo MTU handling in amd-xgbe driver, from Thomas Lendacky.

 5) sk->sk_frag.page not released properly in some cases, from Eric
    Dumazet.

 6) Fix RTNL deadlocks in nl80211, from Johannes Berg.

 7) Fix erroneous RTNL lockdep splat in crypto, from Herbert Xu.

 8) Cure improper inflight handling during AF_UNIX GC, from Andrey
    Ulanov.

 9) sch_dsmark doesn't write to packet headers properly, from Eric
    Dumazet.

10) Fix SCM_TIMESTAMPING_OPT_STATS handling in TCP, from Soheil Hassas
    Yeganeh.

11) Add some IDs for Motorola qmi_wwan chips, from Tony Lindgren.

12) Fix nametbl deadlock in tipc, from Ying Xue.

13) GRO and LRO packets not counted correctly in mlx5 driver, from Gal
    Pressman.

14) Fix reset of internal PHYs in bcmgenet, from Doug Berger.

15) Fix hashmap allocation handling, from Alexei Starovoitov.

16) nl_fib_input() needs stronger netlink message length checking, from
    Eric Dumazet.

17) Fix double-free of sk->sk_filter during sock clone, from Daniel
    Borkmann.

18) Fix RX checksum offloading in aquantia driver, from Pavel Belous.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
  net:ethernet:aquantia: Fix for RX checksum offload.
  amd-xgbe: Fix the ECC-related bit position definitions
  sfc: cleanup a condition in efx_udp_tunnel_del()
  Bluetooth: btqcomsmd: fix compile-test dependency
  inet: frag: release spinlock before calling icmp_send()
  tcp: initialize icsk_ack.lrcvtime at session start time
  genetlink: fix counting regression on ctrl_dumpfamily()
  socket, bpf: fix sk_filter use after free in sk_clone_lock
  ipv4: provide stronger user input validation in nl_fib_input()
  bpf: fix hashmap extra_elems logic
  enic: update enic maintainers
  net: bcmgenet: remove bcmgenet_internal_phy_setup()
  ipv6: make sure to initialize sockc.tsflags before first use
  fjes: Do not load fjes driver if extended socket device is not power on.
  fjes: Do not load fjes driver if system does not have extended socket device.
  net/mlx5e: Count LRO packets correctly
  net/mlx5e: Count GSO packets correctly
  net/mlx5: Increase number of max QPs in default profile
  net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps
  net/mlx5e: Use the proper UAPI values when offloading TC vlan actions
  ...
2017-03-23 11:29:49 -07:00
Peter Zijlstra
56222b212e futex: Drop hb->lock before enqueueing on the rtmutex
When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
chain code will (falsely) report a deadlock and BUG.

The problem is that it hold hb->lock (now an rt_mutex) while doing
task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
interleaved just right with futex_unlock_pi() leads it to believe to see an
AB-BA deadlock.

  Task1 (holds rt_mutex,	Task2 (does FUTEX_LOCK_PI)
         does FUTEX_UNLOCK_PI)

				lock hb->lock
				lock rt_mutex (as per start_proxy)
  lock hb->lock

Which is a trivial AB-BA.

It is not an actual deadlock, because it won't be holding hb->lock by the
time it actually blocks on the rt_mutex, but the chainwalk code doesn't
know that and it would be a nightmare to handle this gracefully.

To avoid this problem, do the same as in futex_unlock_pi() and drop
hb->lock after acquiring wait_lock. This still fully serializes against
futex_unlock_pi(), since adding to the wait_list does the very same lock
dance, and removing it holds both locks.

Aside of solving the RT problem this makes the lock and unlock mechanism
symetric and reduces the hb->lock held time.

Reported-and-tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:14:59 +01:00
Peter Zijlstra
bebe5b5143 futex: Futex_unlock_pi() determinism
The problem with returning -EAGAIN when the waiter state mismatches is that
it becomes very hard to proof a bounded execution time on the
operation. And seeing that this is a RT operation, this is somewhat
important.

While in practise; given the previous patch; it will be very unlikely to
ever really take more than one or two rounds, proving so becomes rather
hard.

However, now that modifying wait_list is done while holding both hb->lock
and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
while still holding hb-lock. Doing a hand-over, without leaving a hole.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:10 +01:00
Peter Zijlstra
cfafcd117d futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
modifications are done under both hb->lock and wait_lock.

This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:

Before:

futex_lock_pi()			futex_unlock_pi()
  unlock hb->lock

				  lock hb->lock
				  unlock hb->lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock

  schedule()

  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  <idem>
				    -EAGAIN

  lock hb->lock


After:

futex_lock_pi()			futex_unlock_pi()

  lock hb->lock
  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock
  unlock hb->lock

  schedule()
				  lock hb->lock
				  unlock hb->lock
  lock hb->lock
  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  unlock hb->lock


It does however solve the earlier starvation/live-lock scenario which got
introduced with the -EAGAIN since unlike the before scenario; where the
-EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
after scenario it happens while futex_unlock_pi() actually holds a lock,
and then it is serialized on that lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:09 +01:00
Peter Zijlstra
38d589f2fd futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.

Split split_mutex_finish_proxy_lock() into two parts, one that does the
blocking and one that does remove_waiter() when the lock acquire failed.

When the rtmutex was acquired successfully the waiter can be removed in the
acquisiton path safely, since there is no concurrency on the lock owner.

This means that, except for futex_lock_pi(), all wait_list modifications
are done with both hb->lock and wait_lock held.

[bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:09 +01:00
Peter Zijlstra
50809358dd futex,rt_mutex: Introduce rt_mutex_init_waiter()
Since there's already two copies of this code, introduce a helper now
before adding a third one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:09 +01:00
Peter Zijlstra
16ffa12d74 futex: Pull rt_mutex_futex_unlock() out from under hb->lock
There's a number of 'interesting' problems, all caused by holding
hb->lock while doing the rt_mutex_unlock() equivalient.

Notably:

 - a PI inversion on hb->lock; and,

 - a SCHED_DEADLINE crash because of pointer instability.

The previous changes:

 - changed the locking rules to cover {uval,pi_state} with wait_lock.

 - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
   turn allows to rely on wait_lock atomicity completely.

 - simplified the waiter conundrum.

It's now sufficient to hold rtmutex::wait_lock and a reference on the
pi_state to protect the state consistency, so hb->lock can be dropped
before calling rt_mutex_futex_unlock().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:08 +01:00
Peter Zijlstra
73d786bd04 futex: Rework inconsistent rt_mutex/futex_q state
There is a weird state in the futex_unlock_pi() path when it interleaves
with a concurrent futex_lock_pi() at the point where it drops hb->lock.

In this case, it can happen that the rt_mutex wait_list and the futex_q
disagree on pending waiters, in particular rt_mutex will find no pending
waiters where futex_q thinks there are. In this case the rt_mutex unlock
code cannot assign an owner.

The futex side fixup code has to cleanup the inconsistencies with quite a
bunch of interesting corner cases.

Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
situation occurs. This then gives the futex_lock_pi() code the opportunity
to continue and the retried futex_unlock_pi() will now observe a coherent
state.

The only problem is that this breaks RT timeliness guarantees. That
is, consider the following scenario:

  T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

    CPU0

    T1
      lock_pi()
      queue_me()  <- Waiter is visible

    preemption

    T2
      unlock_pi()
	loops with -EAGAIN forever

Which is undesirable for PI primitives. Future patches will rectify
this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:08 +01:00
Peter Zijlstra
bf92cf3a51 futex: Cleanup refcounting
Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
becomes consistent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.801778516@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:08 +01:00
Peter Zijlstra
734009e96d futex: Change locking rules
Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
creates another set of problems, especially priority inversions on RT where
hb->lock becomes a rt_mutex itself.

The rt_mutex::wait_lock is the most obvious protection for keeping the
futex user space value and the kernel internal pi_state in sync.

Rework and document the locking so rt_mutex::wait_lock is held accross all
operations which modify the user space value and the pi state.

This allows to invoke rt_mutex_unlock() (including deboost) without holding
hb->lock as a next step.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:07 +01:00
Peter Zijlstra
5293c2efda futex,rt_mutex: Provide futex specific rt_mutex API
Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.

This means it cannot rely on the atomicy of wait_lock, which would be
preferred in order to not rely on hb->lock so much.

The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
race with the rt_mutex fastpath, however futexes have their own fast path.

Since futexes already have a bunch of separate rt_mutex accessors, complete
that set and implement a rt_mutex variant without fastpath for them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:07 +01:00
Peter Zijlstra
fffa954fb5 futex: Remove rt_mutex_deadlock_account_*()
These are unused and clutter up the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:07 +01:00
Peter Zijlstra
1b367ece0d futex: Use smp_store_release() in mark_wake_futex()
Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:06 +01:00
Peter Zijlstra
499f5aca2c futex: Cleanup variable names for futex_top_waiter()
futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
this to a variable 'match' totally obscures the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.554710645@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-23 19:10:06 +01:00
Vincent Guittot
bc4278987e sched/fair: Fix FTQ noise bench regression
A regression of the FTQ noise has been reported by Ying Huang,
on the following hardware:

  8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory

... which was caused by this commit:

  commit 4e5160766f ("sched/fair: Propagate asynchrous detach")

The only part of the patch that can increase the noise is the update
of blocked load of group entity in update_blocked_averages().

We can optimize this call and skip the update of group entity if its load
and utilization are already null and there is no pending propagation of load
in the task group.

This optimization partly restores the noise score. A more agressive
optimization has been tried but has shown worse score.

Reported-by: ying.huang@linux.intel.com
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: ying.huang@intel.com
Fixes: 4e5160766f ("sched/fair: Propagate asynchrous detach")
Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
[ Fixed typos, improved layout. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:44:51 +01:00
Wanpeng Li
d7921a5dda sched/core: Fix rq lock pinning warning after call balance callbacks
This can be reproduced by running rt-migrate-test:

 WARNING: CPU: 2 PID: 2195 at kernel/locking/lockdep.c:3670 lock_unpin_lock()
 unpinning an unpinned lock
 ...
 Call Trace:
  dump_stack()
  __warn()
  warn_slowpath_fmt()
  lock_unpin_lock()
  __balance_callback()
  __schedule()
  schedule()
  futex_wait_queue_me()
  futex_wait()
  do_futex()
  SyS_futex()
  do_syscall_64()
  entry_SYSCALL64_slow_path()

Revert the rq_lock_irqsave() usage here, the whole point of the
balance_callback() was to allow dropping rq->lock.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8a8c69c327 ("sched/core: Add rq->lock wrappers")
Link: http://lkml.kernel.org/r/1489718719-3951-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:44:51 +01:00
Peter Zijlstra
698eff6355 sched/clock, x86/perf: Fix "perf test tsc"
People reported that commit:

  5680d8094f ("sched/clock: Provide better clock continuity")

broke "perf test tsc".

That commit added another offset to the reported clock value; so
take that into account when computing the provided offset values.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5680d8094f ("sched/clock: Provide better clock continuity")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:31:49 +01:00
Peter Zijlstra
71fdb70eb4 sched/clock: Fix clear_sched_clock_stable() preempt wobbly
Paul reported a problems with clear_sched_clock_stable(). Since we run
all of __clear_sched_clock_stable() from workqueue context, there's a
preempt problem.

Solve it by only running the static_key_disable() from workqueue.

Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 07:31:48 +01:00
Dave Airlie
65d1086c44 Linux 4.11-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYzznuAAoJEHm+PkMAQRiGAzMIAJDBo5otTMMLhg8eKj8Cnab4
 2NyaoWDN6mtU427rzEKEfZlTtp3gIBVdFex5x442weIdw6BgRQW0dvF/uwEn08yI
 9Wx7VJmIUyH9M8VmhDtkUTFrhwUGr29qb3JhENMd7tv/CiJaehGRHCT3xqo5BDdu
 xiyPcwSkwP/NH24TS91G87gV6r0I0oKLSAxu+KifEFESrb8gaZaduslzpEj3m/Ds
 o9EPpfzaiGAdW5EdNfPtviYbBk7ZOXwtxdMV+zlvsLcaqtYnFEsJZd2WyZL0zGML
 VXBVxaYtlyTeA7Mt8YYUL+rDHELSOtCeN5zLfxUvYt+Yc0Y6LFBLDOE5h8b3eCw=
 =uKUo
 -----END PGP SIGNATURE-----

BackMerge tag 'v4.11-rc3' into drm-next

Linux 4.11-rc3 as requested by Daniel
2017-03-23 12:05:13 +10:00
Rafael J. Wysocki
b7eaf1aab9 cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task.  If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away.  That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.

This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway.  On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.

On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
2017-03-23 02:12:14 +01:00
Martin KaFai Lau
bcc6b1b7eb bpf: Add hash of maps support
This patch adds hash of maps support (hashmap->bpf_map).
BPF_MAP_TYPE_HASH_OF_MAPS is added.

A map-in-map contains a pointer to another map and lets call
this pointer 'inner_map_ptr'.

Notes on deleting inner_map_ptr from a hash map:

1. For BPF_F_NO_PREALLOC map-in-map, when deleting
   an inner_map_ptr, the htab_elem itself will go through
   a rcu grace period and the inner_map_ptr resides
   in the htab_elem.

2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
   when deleting an inner_map_ptr, the htab_elem may
   get reused immediately.  This situation is similar
   to the existing prealloc-ated use cases.

   However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
   inner_map->ops->map_free(inner_map) which will go
   through a rcu grace period (i.e. all bpf_map's map_free
   currently goes through a rcu grace period).  Hence,
   the inner_map_ptr is still safe for the rcu reader side.

This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
check_map_prealloc() in the verifier.  preallocation is a
must for BPF_PROG_TYPE_PERF_EVENT.  Hence, even we don't expect
heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
map-in-map first.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 15:45:45 -07:00
Martin KaFai Lau
56f668dfe0 bpf: Add array of maps support
This patch adds a few helper funcs to enable map-in-map
support (i.e. outer_map->inner_map).  The first outer_map type
BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
The next patch will introduce a hash of maps type.

Any bpf map type can be acted as an inner_map.  The exception
is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
indirection makes it harder to verify the owner_prog_type
and owner_jited.

Multi-level map-in-map is not supported (i.e. map->map is ok
but not map->map->map).

When adding an inner_map to an outer_map, it currently checks the
map_type, key_size, value_size, map_flags, max_entries and ops.
The verifier also uses those map's properties to do static analysis.
map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
is using a preallocated hashtab for the inner_hash also.  ops and
max_entries are needed to generate inlined map-lookup instructions.
For simplicity reason, a simple '==' test is used for both map_flags
and max_entries.  The equality of ops is implied by the equality of
map_type.

During outer_map creation time, an inner_map_fd is needed to create an
outer_map.  However, the inner_map_fd's life time does not depend on the
outer_map.  The inner_map_fd is merely used to initialize
the inner_map_meta of the outer_map.

Also, for the outer_map:

* It allows element update and delete from syscall
* It allows element lookup from bpf_prog

The above is similar to the current fd_array pattern.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 15:45:45 -07:00
Martin KaFai Lau
fad73a1a35 bpf: Fix and simplifications on inline map lookup
Fix in verifier:
For the same bpf_map_lookup_elem() instruction (i.e. "call 1"),
a broken case is "a different type of map could be used for the
same lookup instruction". For example, an array in one case and a
hashmap in another.  We have to resort to the old dynamic call behavior
in this case.  The fix is to check for collision on insn_aux->map_ptr.
If there is collision, don't inline the map lookup.

Please see the "do_reg_lookup()" in test_map_in_map_kern.c in the later
patch for how-to trigger the above case.

Simplifications on array_map_gen_lookup():
1. Calculate elem_size from map->value_size.  It removes the
   need for 'struct bpf_array' which makes the later map-in-map
   implementation easier.
2. Remove the 'elem_size == 1' test

Fixes: 81ed18ab30 ("bpf: add helper inlining infra and optimize map_array lookup")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 15:45:45 -07:00
Alexei Starovoitov
8c290e60fa bpf: fix hashmap extra_elems logic
In both kmalloc and prealloc mode the bpf_map_update_elem() is using
per-cpu extra_elems to do atomic update when the map is full.
There are two issues with it. The logic can be misused, since it allows
max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
at map creation time can fail percpu alloc for large map values with a warn:
WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
illegal size (32824) or align (8) for percpu allocation

The fixes for both of these issues are different for kmalloc and prealloc modes.
For prealloc mode allocate extra num_possible_cpus elements and store
their pointers into extra_elems array instead of actual elements.
Hence we can use these hidden(spare) elements not only when the map is full
but during bpf_map_update_elem() that replaces existing element too.
That also improves performance, since pcpu_freelist_pop/push is avoided.
Unfortunately this approach cannot be used for kmalloc mode which needs
to kfree elements after rcu grace period. Therefore switch it back to normal
kmalloc even when full and old element exists like it was prior to
commit 6c90598174 ("bpf: pre-allocate hash map elements").

Add tests to check for over max_entries and large map values.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 14:12:18 -07:00
Paul Moore
5b52330bbf audit: fix auditd/kernel connection state tracking
What started as a rather straightforward race condition reported by
Dmitry using the syzkaller fuzzer ended up revealing some major
problems with how the audit subsystem managed its netlink sockets and
its connection with the userspace audit daemon.  Fixing this properly
had quite the cascading effect and what we are left with is this rather
large and complicated patch.  My initial goal was to try and decompose
this patch into multiple smaller patches, but the way these changes
are intertwined makes it difficult to split these changes into
meaningful pieces that don't break or somehow make things worse for
the intermediate states.

The patch makes a number of changes, but the most significant are
highlighted below:

* The auditd tracking variables, e.g. audit_sock, are now gone and
replaced by a RCU/spin_lock protected variable auditd_conn which is
a structure containing all of the auditd tracking information.

* We no longer track the auditd sock directly, instead we track it
via the network namespace in which it resides and we use the audit
socket associated with that namespace.  In spirit, this is what the
code was trying to do prior to this patch (at least I think that is
what the original authors intended), but it was done rather poorly
and added a layer of obfuscation that only masked the underlying
problems.

* Big backlog queue cleanup, again.  In v4.10 we made some pretty big
changes to how the audit backlog queues work, here we haven't changed
the queue design so much as cleaned up the implementation.  Brought
about by the locking changes, we've simplified kauditd_thread() quite
a bit by consolidating the queue handling into a new helper function,
kauditd_send_queue(), which allows us to eliminate a lot of very
similar code and makes the looping logic in kauditd_thread() clearer.

* All netlink messages sent to auditd are now sent via
auditd_send_unicast_skb().  Other than just making sense, this makes
the lock handling easier.

* Change the audit_log_start() sleep behavior so that we never sleep
on auditd events (unchanged) or if the caller is holding the
audit_cmd_mutex (changed).  Previously we didn't sleep if the caller
was auditd or if the message type fell between a certain range; the
type check was a poor effort of doing what the cmd_mutex check now
does.  Richard Guy Briggs originally proposed not sleeping the
cmd_mutex owner several years ago but his patch wasn't acceptable
at the time.  At least the idea lives on here.

* A problem with the lost record counter has been resolved.  Steve
Grubb and I both happened to notice this problem and according to
some quick testing by Steve, this problem goes back quite some time.
It's largely a harmless problem, although it may have left some
careful sysadmins quite puzzled.

Cc: <stable@vger.kernel.org> # 4.10.x-
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-03-21 11:26:35 -04:00
Rafael J. Wysocki
4296f23ed4 cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.

That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.

Fixes: 21ca6d2c52 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
2017-03-21 01:03:08 +01:00
Linus Torvalds
3e51f893de Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Thomas Gleixner:
 "A single fix preventing the concurrent execution of the CPU hotplug
  callback install/invocation machinery. Long standing bug caused by a
  massive brain slip of that Gleixner dude, which went unnoticed for
  almost a year"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Serialize callback invocations proper
2017-03-18 08:33:44 -07:00
Linus Torvalds
a7fc726bb2 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A set of perf related fixes:

   - fix a CR4.PCE propagation issue caused by usage of mm instead of
     active_mm and therefore propagated the wrong value.

   - perf core fixes, which plug a use-after-free issue and make the
     event inheritance on fork more robust.

   - a tooling fix for symbol handling"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf symbols: Fix symbols__fixup_end heuristic for corner cases
  x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
  x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
  perf/core: Better explain the inherit magic
  perf/core: Simplify perf_event_free_task()
  perf/core: Fix event inheritance on fork()
  perf/core: Fix use-after-free in perf_release()
2017-03-17 13:59:52 -07:00
Linus Torvalds
cd21debe53 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "From the scheduler departement:

   - a bunch of sched deadline related fixes which deal with various
     buglets and corner cases.

   - two fixes for the loadavg spikes which are caused by the delayed
     NOHZ accounting"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Use deadline instead of period when calculating overflow
  sched/deadline: Throttle a constrained deadline task activated after the deadline
  sched/deadline: Make sure the replenishment timer fires in the next period
  sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
  sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
  sched/deadline: Add missing update_rq_clock() in dl_task_timer()
2017-03-17 13:19:07 -07:00
Linus Torvalds
b5f13082b1 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "Three fixes related to locking:

   - fix a SIGKILL issue for RWSEM_GENERIC_SPINLOCK which has been fixed
     for the XCHGADD variant already

   - plug a potential use after free in the futex code

   - prevent leaking a held spinlock in an futex error handling code
     path"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
  futex: Add missing error handling to FUTEX_REQUEUE_PI
  futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
2017-03-17 13:16:24 -07:00
Stephen Boyd
016da20148 hrtimer: Remove hrtimer_peek_ahead_timers() leftovers
This function was removed in commit c6eb3f70d4 (hrtimer: Get rid of
hrtimer softirq, 2015-04-14) but the prototype wasn't ever deleted.

Delete it now.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/20170317010814.2591-1-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-17 18:24:58 +01:00
Tejun Heo
77f88796ce cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-17 10:18:47 -04:00
Alexei Starovoitov
9015d2f595 bpf: inline htab_map_lookup_elem()
Optimize:
bpf_call
  bpf_map_lookup_elem
    map->ops->map_lookup_elem
      htab_map_lookup_elem
        __htab_map_lookup_elem
into:
bpf_call
  __htab_map_lookup_elem

to improve performance of JITed programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:44:11 -07:00
Alexei Starovoitov
81ed18ab30 bpf: add helper inlining infra and optimize map_array lookup
Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem()
into a sequence of bpf instructions.
When JIT is on the sequence of bpf instructions is the sequence
of native cpu instructions with significantly faster performance
than indirect call and two function's prologue/epilogue.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:44:11 -07:00
Alexei Starovoitov
8041902dae bpf: adjust insn_aux_data when patching insns
convert_ctx_accesses() replaces single bpf instruction with a set of
instructions. Adjust corresponding insn_aux_data while patching.
It's needed to make sure subsequent 'for(all insn)' loops
have matching insn and insn_aux_data.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:44:11 -07:00
Alexei Starovoitov
79741b3bde bpf: refactor fixup_bpf_calls()
reduce indent and make it iterate over instructions similar to
convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:44:11 -07:00
Alexei Starovoitov
e245c5c6a5 bpf: move fixup_bpf_calls() function
no functional change.
move fixup_bpf_calls() to verifier.c
it's being refactored in the next patch

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:44:11 -07:00
Heiko Carstens
55adc1d05d mm: add private lock to serialize memory hotplug operations
Commit bfc8c90139 ("mem-hotplug: implement get/put_online_mems")
introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
in order to allow similar semantics for memory hotplug like for cpu
hotplug.

The corresponding functions for cpu hotplug are get/put_online_cpus()
and cpu_hotplug_begin/done() for cpu hotplug.

The commit however missed to introduce functions that would serialize
memory hotplug operations like they are done for cpu hotplug with
cpu_maps_update_begin/done().

This basically leaves mem_hotplug.active_writer unprotected and allows
concurrent writers to modify it, which may lead to problems as outlined
by commit f931ab479d ("mm: fix devm_memremap_pages crash, use
mem_hotplug_{begin, done}").

That commit was extended again with commit b5d24fda9c ("mm,
devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
done}") which serializes memory hotplug operations for some call sites
by using the device_hotplug lock.

In addition with commit 3fc2192410 ("mm: validate device_hotplug is held
for memory hotplug") a sanity check was added to mem_hotplug_begin() to
verify that the device_hotplug lock is held.

This in turn triggers the following warning on s390:

WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
 Call Trace:
  assert_held_device_hotplug+0x40/0x58)
  mem_hotplug_begin+0x34/0xc8
  add_memory_resource+0x7e/0x1f8
  add_memory+0xda/0x130
  add_memory_merged+0x15c/0x178
  sclp_detect_standby_memory+0x2ae/0x2f8
  do_one_initcall+0xa2/0x150
  kernel_init_freeable+0x228/0x2d8
  kernel_init+0x2a/0x140
  kernel_thread_starter+0x6/0xc

One possible fix would be to add more lock_device_hotplug() and
unlock_device_hotplug() calls around each call site of
mem_hotplug_begin/end().  But that would give the device_hotplug lock
additional semantics it better should not have (serialize memory hotplug
operations).

Instead add a new memory_add_remove_lock which has the similar semantics
like cpu_add_remove_lock for cpu hotplug.

To keep things hopefully a bit easier the lock will be locked and unlocked
within the mem_hotplug_begin/end() functions.

Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-16 16:56:18 -07:00
Ingo Molnar
61f63e3837 perf/core improvements and fixes:
New features:
 
 - Add 'brstackinsn' field in 'perf script' to reuse the x86 instruction
   decoder used in the Intel PT code to study hot paths to samples (Andi Kleen)
 
 Kernel:
 
 - Default UPROBES_EVENTS to Y (Alexei Starovoitov)
 
 - Fix check for kretprobe offset within function entry (Naveen N. Rao)
 
 Infrastructure:
 
 - Introduce util func is_sdt_event() (Ravi Bangoria)
 
 - Make perf_event__synthesize_mmap_events() scale on older kernels where
   reading /proc/pid/maps is way slower than reading /proc/pid/task/pid/maps (Stephane Eranian)
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYyrdSAAoJENZQFvNTUqpAe+4P/3c4ilBSOxLCCxGO7jDYo9oq
 /KqlvsCIg7+vo5eqrOUJAb4qXFnvpYxwjMMkL5rx7gdsBCRfRXIINGWUMrq5mNyk
 MgxuqYnp+yRuxLYml2wn+tdwLzcHWSN2EO9mqQ14N4I+HvgdLmVPQ44ACQXs6KfL
 dk/Ix8YtnFWl2sDZjvyr7ZBqwCPzzklZgHM6erxNUr/WJspzUiixAWqUmewodOUl
 P3PitlHXkITOK3AxSqOjJ4g1k933215nGih7hr0XdjEm4pIYaYksShQ6k9DASCrv
 dn2o1pF1LTu7KCtAo70aaSB7GXydwoA//o2gRbDkSwJJ25DIImZxJXQz9PAYDOo1
 vXSIhmlQ72c4/Yv/XzVOrIoMMMpmWKS3lGZxMVGR/Ie9Gw4kbotkaoEqEpNQsaDZ
 iIaU5v/EcvvToT7T7VHrGg0+vmHgYxm5gSlyASi2IrO2/wJAs0v2pYfuL6gYhXGp
 mhv/pHUv4l9OW+Ubm+zJEEcg337c2RQU5wT/bk4PihxY6nQyEH2Pn5VzdNbZLuMR
 eWnqTH/md+8/bkhmuZJp71wm60oPHoPvbDjvtfVmXAa52AzO+NWSc9Veke3C/QRm
 XgNkrXlzeKopEso3j4gw2iAolqw9t8FHFLGgbTkS+6UCKjAM7vNLiIV02LQqhM50
 qCnKEusMDCRgzeOXxYt+
 =Bg5M
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-mingo-4.12-20170316' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

New features:

 - Add 'brstackinsn' field in 'perf script' to reuse the x86 instruction
   decoder used in the Intel PT code to study hot paths to samples (Andi Kleen)

Kernel changes:

 - Default UPROBES_EVENTS to Y (Alexei Starovoitov)

 - Fix check for kretprobe offset within function entry (Naveen N. Rao)

Infrastructure changes:

 - Introduce util func is_sdt_event() (Ravi Bangoria)

 - Make perf_event__synthesize_mmap_events() scale on older kernels where
   reading /proc/pid/maps is way slower than reading /proc/pid/task/pid/maps (Stephane Eranian)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 17:29:23 +01:00
Arnaldo Carvalho de Melo
61f35d7506 uprobes: Default UPROBES_EVENTS to Y
As it is already turned on by most distros, so just flip the default to
Y.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Wang Nan <wangnan0@huawei.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: David Miller <davem@davemloft.net>
Cc: Hemant Kumar <hemant@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170316005817.GA6805@ast-mbp.thefacebook.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-16 12:42:02 -03:00
Peter Zijlstra
d8a8cfc769 perf/core: Better explain the inherit magic
While going through the event inheritance code Oleg got confused.

Add some comments to better explain the silent dissapearance of
orphaned events.

So what happens is that at perf_event_release_kernel() time; when an
event looses its connection to userspace (and ceases to exist from the
user's perspective) we can still have an arbitrary amount of inherited
copies of the event. We want to synchronously find and remove all
these child events.

Since that requires a bit of lock juggling, there is the possibility
that concurrent clone()s will create new child events. Therefore we
first mark the parent event as DEAD, which marks all the extant child
events as orphaned.

We then avoid copying orphaned events; in order to avoid getting more
of them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 14:16:53 +01:00
Peter Zijlstra
15121c789e perf/core: Simplify perf_event_free_task()
We have ctx->event_list that contains all events; no need to
repeatedly iterate the group lists to find them all.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 14:16:53 +01:00
Peter Zijlstra
e7cc4865f0 perf/core: Fix event inheritance on fork()
While hunting for clues to a use-after-free, Oleg spotted that
perf_event_init_context() can loose an error value with the result
that fork() can succeed even though we did not fully inherit the perf
event context.

Spotted-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: oleg@redhat.com
Cc: stable@vger.kernel.org
Fixes: 889ff01506 ("perf/core: Split context's event group list into pinned and non-pinned lists")
Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 14:16:52 +01:00
Peter Zijlstra
e552a8389a perf/core: Fix use-after-free in perf_release()
Dmitry reported syzcaller tripped a use-after-free in perf_release().

After much puzzlement Oleg spotted the below scenario:

  Task1                           Task2

  fork()
    perf_event_init_task()
    /* ... */
    goto bad_fork_$foo;
    /* ... */
    perf_event_free_task()
      mutex_lock(ctx->lock)
      perf_free_event(B)

                                  perf_event_release_kernel(A)
                                    mutex_lock(A->child_mutex)
                                    list_for_each_entry(child, ...) {
                                      /* child == B */
                                      ctx = B->ctx;
                                      get_ctx(ctx);
                                      mutex_unlock(A->child_mutex);

        mutex_lock(A->child_mutex)
        list_del_init(B->child_list)
        mutex_unlock(A->child_mutex)

        /* ... */

      mutex_unlock(ctx->lock);
      put_ctx() /* >0 */
    free_task();
                                      mutex_lock(ctx->lock);
                                      mutex_lock(A->child_mutex);
                                      /* ... */
                                      mutex_unlock(A->child_mutex);
                                      mutex_unlock(ctx->lock)
                                      put_ctx() /* 0 */
                                        ctx->task && !TOMBSTONE
                                          put_task_struct() /* UAF */

This patch closes the hole by making perf_event_free_task() destroy the
task <-> ctx relation such that perf_event_release_kernel() will no longer
observe the now dead task.

Spotted-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: fweisbec@gmail.com
Cc: oleg@redhat.com
Cc: stable@vger.kernel.org
Fixes: c6e5b73242 ("perf: Synchronously clean up child events")
Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop
Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 14:16:52 +01:00
Peter Zijlstra
bf7b3ac2e3 locking/ww_mutex: Improve test to cover acquire context changes
Currently each thread starts an acquire context only once, and
performs all its loop iterations under it.

This means that the Wound/Wait relations between threads are fixed.

To make things a little more realistic and cover more of the
functionality with the test, open a new acquire context for each loop.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:09 +01:00
Thomas Gleixner
383776fa75 locking/lockdep: Handle statically initialized PER_CPU locks properly
If a PER_CPU struct which contains a spin_lock is statically initialized
via:

DEFINE_PER_CPU(struct foo, bla) = {
	.lock = __SPIN_LOCK_UNLOCKED(bla.lock)
};

then lockdep assigns a seperate key to each lock because the logic for
assigning a key to statically initialized locks is to use the address as
the key. With per CPU locks the address is obvioulsy different on each CPU.

That's wrong, because all locks should have the same key.

To solve this the following modifications are required:

 1) Extend the is_kernel/module_percpu_addr() functions to hand back the
    canonical address of the per CPU address, i.e. the per CPU address
    minus the per CPU offset.

 2) Check the lock address with these functions and if the per CPU check
    matches use the returned canonical address as the lock key, so all per
    CPU locks have the same key.

 3) Move the static_obj(key) check into look_up_lock_class() so this check
    can be avoided for statically initialized per CPU locks.  That's
    required because the canonical address fails the static_obj(key) check
    for obvious reasons.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Merged Dan's fixups for !MODULES and !SMP into this patch. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Murphy <dmurphy@ti.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170227143736.pectaimkjkan5kow@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:08 +01:00
J. R. Okajima
6419c4af77 locking/lockdep: Add new check to lock_downgrade()
Commit:

  f8319483f5 ("locking/lockdep: Provide a type check for lock_is_held")

didn't fully cover rwsems as downgrade_write() was left out.

Introduce lock_downgrade() and use it to add new checks.

See-also: http://marc.info/?l=linux-kernel&m=148581164003149&w=2
Originally-written-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-3-git-send-email-hooanon05g@gmail.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:07 +01:00
J. R. Okajima
e969970be0 locking/lockdep: Factor out the validate_held_lock() helper function
Behaviour should not change.

Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-2-git-send-email-hooanon05g@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:07 +01:00
J. R. Okajima
41c2c5b86a locking/lockdep: Factor out the find_held_lock() helper function
A simple consolidataion to factor out repeated patterns.

The behaviour should not change.

Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486053497-9948-1-git-send-email-hooanon05g@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:57:06 +01:00
Alexander Shishkin
ae0c2d995d perf/core: Add a flag for partial AUX records
The Intel PT driver needs to be able to communicate partial AUX transactions,
that is, transactions with gaps in data for reasons other than no room
left in the buffer (i.e. truncated transactions). Therefore, this condition
does not imply a wakeup for the consumer.

To this end, add a new "partial" AUX flag.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170220133352.17995-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:51:11 +01:00
Will Deacon
f4c0b0aa58 perf/core: Keep AUX flags in the output handle
In preparation for adding more flags to perf AUX records, introduce a
separate API for setting the flags for a session, rather than appending
more bool arguments to perf_aux_output_end. This allows to set each
flag at the time a corresponding condition is detected, instead of
tracking it in each driver's private state.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170220133352.17995-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:51:10 +01:00
Ingo Molnar
2b95bd7d58 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:50:50 +01:00
Peter Zijlstra
15ff991e80 sched/core: Avoid double update_rq_clock() in move_queued_task()
Address this case:

  WARNING: CPU: 0 PID: 2070 at ../kernel/sched/core.c:109 update_rq_clock+0x74/0x80
  rq->clock_update_flags & RQCF_UPDATED

  Call Trace:
  update_rq_clock()
  move_queued_task()
  __set_cpus_allowed_ptr()
  ...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:26 +01:00
Peter Zijlstra
5704ac0ae7 sched/core: Fix double update_rq_clock) calls in attach_task()/detach_task()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:25 +01:00
Peter Zijlstra
7a57f32a4d sched/core: Avoid obvious double update_rq_clock()
Add DEQUEUE_NOCLOCK to all places where we just did an
update_rq_clock() already.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:25 +01:00
Peter Zijlstra
bce4dc80c6 sched/core: Simplify update_rq_clock() in __schedule()
Instead of relying on deactivate_task() to call update_rq_clock() and
handling the case where it didn't happen (task_on_rq_queued),
unconditionally do update_rq_clock() and skip any further updates.

This also avoids a double update on deactivate_task() + ttwu_local().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:24 +01:00
Peter Zijlstra
77558e4d01 sched/core: Make sched_ttwu_pending() atomic in time
Since all tasks on the wake_list are woken under a single rq->lock
avoid calling update_rq_clock() for each task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:24 +01:00
Peter Zijlstra
7134b3e941 sched/core: Add ENQUEUE_NOCLOCK to ENQUEUE_RESTORE
In all cases, ENQUEUE_RESTORE should also have ENQUEUE_NOCLOCK because
DEQUEUE_SAVE will have done an update_rq_clock().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:23 +01:00
Peter Zijlstra
0a67d1ee30 sched/core: Add {EN,DE}QUEUE_NOCLOCK flags
Currently {en,de}queue_task() do an unconditional update_rq_clock().
However since we want to avoid duplicate updates, so that each
rq->lock section appears atomic in time, we need to be able to skip
these clock updates.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:23 +01:00
Peter Zijlstra
8a8c69c327 sched/core: Add rq->lock wrappers
The missing update_rq_clock() check can work with partial rq->lock
wrappery, since a missing wrapper can cause the warning to not be
emitted when it should have, but cannot cause the warning to trigger
when it should not have.

The duplicate update_rq_clock() check however can cause false warnings
to trigger. Therefore add more comprehensive rq->lock wrappery.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:22 +01:00
Peter Zijlstra
26ae58d23b sched/core: Add WARNING for multiple update_rq_clock() calls
Now that we have no missing calls, add a warning to find multiple
calls.

By having only a single update_rq_clock() call per rq-lock section,
the section appears 'atomic' wrt time.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:46:21 +01:00
Steven Rostedt (VMware)
3e777f9909 sched/rt: Add comments describing the RT IPI pull method
While looking into optimizations for the RT scheduler IPI logic, I realized
that the comments are lacking to describe it efficiently. It deserves a
lengthy description describing its design.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home
[ Small typographical edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:41:35 +01:00
Steven Rostedt (VMware)
2317d5f1c3 sched/deadline: Use deadline instead of period when calculating overflow
I was testing Daniel's changes with his test case, and tweaked it a
little. Instead of having the runtime equal to the deadline, I
increased the deadline ten fold.

Daniel's test case had:

	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */

To make it more interesting, I changed it to:

	attr.sched_runtime  =  2 * 1000 * 1000;		/* 2 ms */
	attr.sched_deadline = 20 * 1000 * 1000;		/* 20 ms */
	attr.sched_period   =  2 * 1000 * 1000 * 1000;	/* 2 s */

The results were rather surprising. The behavior that Daniel's patch
was fixing came back. The task started using much more than .1% of the
CPU. More like 20%.

Looking into this I found that it was due to the dl_entity_overflow()
constantly returning true. That's because it uses the relative period
against relative runtime vs the absolute deadline against absolute
runtime.

  runtime / (deadline - t) > dl_runtime / dl_period

There's even a comment mentioning this, and saying that when relative
deadline equals relative period, that the equation is the same as using
deadline instead of period. That comment is backwards! What we really
want is:

  runtime / (deadline - t) > dl_runtime / dl_deadline

We care about if the runtime can make its deadline, not its period. And
then we can say "when the deadline equals the period, the equation is
the same as using dl_period instead of dl_deadline".

After correcting this, now when the task gets enqueued, it can throttle
correctly, and Daniel's fix to the throttling of sleeping deadline
tasks works even when the runtime and deadline are not the same.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:37:38 +01:00
Daniel Bristot de Oliveira
df8eac8caf sched/deadline: Throttle a constrained deadline task activated after the deadline
During the activation, CBS checks if it can reuse the current task's
runtime and period. If the deadline of the task is in the past, CBS
cannot use the runtime, and so it replenishes the task. This rule
works fine for implicit deadline tasks (deadline == period), and the
CBS was designed for implicit deadline tasks. However, a task with
constrained deadline (deadine < period) might be awakened after the
deadline, but before the next period. In this case, replenishing the
task would allow it to run for runtime / deadline. As in this case
deadline < period, CBS enables a task to run for more than the
runtime / period. In a very loaded system, this can cause a domino
effect, making other tasks miss their deadlines.

To avoid this problem, in the activation of a constrained deadline
task after the deadline but before the next period, throttle the
task and set the replenishing timer to the begin of the next period,
unless it is boosted.

Reproducer:

 --------------- %< ---------------
  int main (int argc, char **argv)
  {
	int ret;
	int flags = 0;
	unsigned long l = 0;
	struct timespec ts;
	struct sched_attr attr;

	memset(&attr, 0, sizeof(attr));
	attr.size = sizeof(attr);

	attr.sched_policy   = SCHED_DEADLINE;
	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */

	ts.tv_sec = 0;
	ts.tv_nsec = 2000 * 1000;			/* 2 ms */

	ret = sched_setattr(0, &attr, flags);

	if (ret < 0) {
		perror("sched_setattr");
		exit(-1);
	}

	for(;;) {
		/* XXX: you may need to adjust the loop */
		for (l = 0; l < 150000; l++);
		/*
		 * The ideia is to go to sleep right before the deadline
		 * and then wake up before the next period to receive
		 * a new replenishment.
		 */
		nanosleep(&ts, NULL);
	}

	exit(0);
  }
  --------------- >% ---------------

On my box, this reproducer uses almost 50% of the CPU time, which is
obviously wrong for a task with 2/2000 reservation.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:37:38 +01:00
Daniel Bristot de Oliveira
5ac69d3778 sched/deadline: Make sure the replenishment timer fires in the next period
Currently, the replenishment timer is set to fire at the deadline
of a task. Although that works for implicit deadline tasks because the
deadline is equals to the begin of the next period, that is not correct
for constrained deadline tasks (deadline < period).

For instance:

f.c:
 --------------- %< ---------------
int main (void)
{
	for(;;);
}
 --------------- >% ---------------

  # gcc -o f f.c

  # trace-cmd record -e sched:sched_switch                              \
				   -e syscalls:sys_exit_sched_setattr   \
   chrt -d --sched-runtime  490000000					\
           --sched-deadline 500000000					\
	   --sched-period  1000000000 0 ./f

  # trace-cmd report | grep "{pid of ./f}"

After setting parameters, the task is replenished and continue running
until being throttled:

         f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0

The task is throttled after running 492318 ms, as expected:

         f-11295 [003] 13322.606094: sched_switch:   f:11295 [-1] R ==> watchdog/3:32 [0]

But then, the task is replenished 500719 ms after the first
replenishment:

    <idle>-0     [003] 13322.614495: sched_switch:   swapper/3:0 [120] R ==> f:11295 [-1]

Running for 490277 ms:

         f-11295 [003] 13323.104772: sched_switch:   f:11295 [-1] R ==>  swapper/3:0 [120]

Hence, in the first period, the task runs 2 * runtime, and that is a bug.

During the first replenishment, the next deadline is set one period away.
So the runtime / period starts to be respected. However, as the second
replenishment took place in the wrong instant, the next replenishment
will also be held in a wrong instant of time. Rather than occurring in
the nth period away from the first activation, it is taking place
in the (nth period - relative deadline).

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:37:37 +01:00
Niklas Cassel
17fcbd590d locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
We hang if SIGKILL has been sent, but the task is stuck in down_read()
(after do_exit()), even though no task is doing down_write() on the
rwsem in question:

  INFO: task libupnp:21868 blocked for more than 120 seconds.
  libupnp         D    0 21868      1 0x08100008
  ...
  Call Trace:
  __schedule()
  schedule()
  __down_read()
  do_exit()
  do_group_exit()
  __wake_up_parent()

This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
the following commit:

 04cafed7fc ("locking/rwsem: Fix down_write_killable()")

... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y.

Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Niklas Cassel <niklass@axis.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d47996082f ("locking/rwsem: Introduce basis for down_write_killable()")
Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:28:30 +01:00
Matt Fleming
caeb588297 sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
'calc_load_update' is accessed without any kind of locking and there's
a clear assumption in the code that only a single value is read or
written.

Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
unintentionally seeing multiple values, or having the load/stores
split.

Technically the loads in calc_global_*() don't require this since
those are the only functions that update 'calc_load_update', but I've
added the READ_ONCE() for consistency.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:21:01 +01:00
Matt Fleming
6e5f32f7a4 sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

  this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

  this_rq->calc_load_update = calc_load_update		     [ next window ]

But what we actually do is:

  this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

  commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:21:00 +01:00
Wanpeng Li
dcc3b5ffe1 sched/deadline: Add missing update_rq_clock() in dl_task_timer()
The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:

 ------------[ cut here ]------------
 WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
 rq->clock_update_flags < RQCF_ACT_SKIP
 CPU: 7 PID: 0 Comm: swapper/7 Tainted: G    B           4.11.0-rc1+ #24
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
 Call Trace:
  <IRQ>
  dump_stack+0x85/0xc4
  __warn+0x172/0x1b0
  warn_slowpath_fmt+0xb4/0xf0
  ? __warn+0x1b0/0x1b0
  ? debug_check_no_locks_freed+0x2c0/0x2c0
  ? cpudl_set+0x3d/0x2b0
  replenish_dl_entity+0x71e/0xc40
  enqueue_task_dl+0x2ea/0x12e0
  ? dl_task_timer+0x777/0x990
  ? __hrtimer_run_queues+0x270/0xa50
  dl_task_timer+0x316/0x990
  ? enqueue_task_dl+0x12e0/0x12e0
  ? enqueue_task_dl+0x12e0/0x12e0
  __hrtimer_run_queues+0x270/0xa50
  ? hrtimer_cancel+0x20/0x20
  ? hrtimer_interrupt+0x119/0x600
  hrtimer_interrupt+0x19c/0x600
  ? trace_hardirqs_off+0xd/0x10
  local_apic_timer_interrupt+0x74/0xe0
  smp_apic_timer_interrupt+0x76/0xa0
  apic_timer_interrupt+0x93/0xa0

The DL task will be migrated to a suitable later deadline rq once the DL
timer fires and currnet rq is offline. The rq clock of the new rq should
be updated. This patch fixes it by updating the rq clock after holding
the new rq's rq lock.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:20:59 +01:00
Naveen N. Rao
1d585e7090 trace/kprobes: Fix check for kretprobe offset within function entry
perf specifies an offset from _text and since this offset is fed
directly into the arch-specific helper, kprobes tracer rejects
installation of kretprobes through perf. Fix this by looking up the
actual offset from a function for the specified sym+offset.

Refactor and reuse existing routines to limit code duplication -- we
repurpose kprobe_addr() for determining final kprobe address and we
split out the function entry offset determination into a separate
generic helper.

Before patch:

  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7ff]
  Probe point found: do_open+0
  Matched function: do_open [35d76dc]
  found inline addr: 0xc0000000004ba9c4
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//README write=0
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open _text+4469776
  Failed to write event: Invalid argument
    Error: Failed to add events. Reason: Invalid argument (Code: -22)
  naveen@ubuntu:~/linux/tools/perf$ dmesg | tail
  <snip>
  [   33.568656] Given offset is not valid for return probe.

After patch:

  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7d6]
  Probe point found: do_open+0
  Matched function: do_open [35d76b3]
  found inline addr: 0xc0000000004ba9e4
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//README write=0
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open _text+4469808
  Writing event: r:probe/do_open_1 _text+4956344
  Added new events:
    probe:do_open        (on do_open%return)
    probe:do_open_1      (on do_open%return)

  You can now use it in all perf tools, such as:

	  perf record -e probe:do_open_1 -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c000000000041370  k  kretprobe_trampoline+0x0    [OPTIMIZED]
  c0000000004ba0b8  r  do_open+0x8    [DISABLED]
  c000000000443430  r  do_open+0x0    [DISABLED]

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/d8cd1ef420ec22e3643ac332fdabcffc77319a42.1488961018.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-15 17:48:37 -03:00
Linus Torvalds
ae50dfd616 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
    from Steffen Klassert.

 2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
    Alexei Starovoitov.

 3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
    Michal Schmidt.

 4) We can get a divide by zero when TCP socket are morphed into
    listening state, fix from Eric Dumazet.

 5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
    skb_complete_tx_timestamp(). From Eric Dumazet.

 6) Use after free in dccp_feat_activate_values(), also from Eric
    Dumazet.

 7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
    Jarod Wilson.

 8) Fix use after free in vrf_xmit(), from David Ahern.

 9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
    Alexey Kodanev.

10) Properly check napi_complete_done() return value in order to decide
    whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
    Lendacky.

11) Fix double free of hwmon device in marvell phy driver, from Andrew
    Lunn.

12) Don't crash on malformed netlink attributes in act_connmark, from
    Etienne Noss.

13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
    from Sabrina Dubroca.

14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
    Florian Westphal.

15) Fix routing redirect races in dccp and tcp, basically the ICMP
    handler can't modify the socket's cached route in it's locked by the
    user at this moment. From Jon Maxwell.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
  qed: Enable iSCSI Out-of-Order
  qed: Correct out-of-bound access in OOO history
  qed: Fix interrupt flags on Rx LL2
  qed: Free previous connections when releasing iSCSI
  qed: Fix mapping leak on LL2 rx flow
  qed: Prevent creation of too-big u32-chains
  qed: Align CIDs according to DORQ requirement
  mlxsw: reg: Fix SPVMLR max record count
  mlxsw: reg: Fix SPVM max record count
  net: Resend IGMP memberships upon peer notification.
  dccp: fix memory leak during tear-down of unsuccessful connection request
  tun: fix premature POLLOUT notification on tun devices
  dccp/tcp: fix routing redirect race
  ucc/hdlc: fix two little issue
  vxlan: fix ovs support
  net: use net->count to check whether a netns is alive or not
  bridge: drop netfilter fake rtable unconditionally
  ipv6: avoid write to a possibly cloned skb
  net: wimax/i2400m: fix NULL-deref at probe
  isdn/gigaset: fix NULL-deref at probe
  ...
2017-03-14 21:31:23 -07:00
Linus Torvalds
352526f453 Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Three cgroup fixes.  Nothing critical:

   - the pids controller could trigger suspicious RCU warning
     spuriously. Fixed.

   - in the debug controller, %p -> %pK to protect kernel pointer
     from getting exposed.

   - documentation formatting fix"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroups: censor kernel pointer in debug files
  cgroup/pids: remove spurious suspicious RCU usage warning
  cgroup: Fix indenting in PID controller documentation
2017-03-14 15:11:19 -07:00
Linus Torvalds
bc25887935 Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "If a delayed work is queued with NULL @wq, workqueue code explodes
  after the timer expires at which point it's difficult to tell who the
  culprit was.

  This actually happened and the offender was net/smc this time.

  Add an explicit sanity check for it in the queueing path"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
2017-03-14 14:52:08 -07:00
Peter Zijlstra
9bbb25afeb futex: Add missing error handling to FUTEX_REQUEUE_PI
Thomas spotted that fixup_pi_state_owner() can return errors and we
fail to unlock the rt_mutex in that case.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14 21:45:36 +01:00
Peter Zijlstra
c236c8e95a futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
While working on the futex code, I stumbled over this potential
use-after-free scenario. Dmitry triggered it later with syzkaller.

pi_mutex is a pointer into pi_state, which we drop the reference on in
unqueue_me_pi(). So any access to that pointer after that is bad.

Since other sites already do rt_mutex_unlock() with hb->lock held, see
for example futex_lock_pi(), simply move the unlock before
unqueue_me_pi().

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14 21:45:36 +01:00
Sebastian Andrzej Siewior
dc434e056f cpu/hotplug: Serialize callback invocations proper
The setup/remove_state/instance() functions in the hotplug core code are
serialized against concurrent CPU hotplug, but unfortunately not serialized
against themself.

As a consequence a concurrent invocation of these function results in
corruption of the callback machinery because two instances try to invoke
callbacks on remote cpus at the same time. This results in missing callback
invocations and initiator threads waiting forever on the completion.

The obvious solution to replace get_cpu_online() with cpu_hotplug_begin()
is not possible because at least one callsite calls into these functions
from a get_online_cpu() locked region.

Extend the protection scope of the cpuhp_state_mutex from solely protecting
the state arrays to cover the callback invocation machinery as well.

Fixes: 5b7aa87e04 ("cpu/hotplug: Implement setup/removal interface")
Reported-and-tested-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: hpa@zytor.com
Cc: mingo@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20170314150645.g4tdyoszlcbajmna@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14 19:19:27 +01:00
Naveen N. Rao
5f6bee3470 kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL
commit fc62d0207a ("kprobes: Introduce weak variant of
kprobe_exceptions_notify()") used the __kprobes annotation to exclude
kprobe_exceptions_notify from being probed. Since NOKPROBE_SYMBOL() is a
better way to do this enabling the symbol to be discovered as being
blacklisted, change over to using NOKPROBE_SYMBOL().

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/3f25bf400da5c222cd9b10eec6ded2d6b58209f8.1488991670.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-14 15:17:40 -03:00
Daniel Vetter
05df49e73b drm/i915: annote drop_caches debugfs interface with lockdep
The trouble we have is that we can't really test all the shrinker
recursion stuff exhaustively in BAT because any kind of thrashing
stress test just takes too long.

But that leaves a really big gap open, since shrinker recursions are
one of the most annoying bugs. Now lockdep already has support for
checking allocation deadlocks:

- Direct reclaim paths are marked up with
  lockdep_set_current_reclaim_state() and
  lockdep_clear_current_reclaim_state().

- Any allocation paths are marked with lockdep_trace_alloc().

If we simply mark up our debugfs with the reclaim annotations, any
code and locks taken in there will automatically complete the picture
with any allocation paths we already have, as long as we have a simple
testcase in BAT which throws out a few objects using this interface.
Not stress test or thrashing needed at all.

v2: Need to EXPORT_SYMBOL_GPL to make it compile as a module.

v3: Fixup rebase fail (spotted by Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://patchwork.freedesktop.org/patch/msgid/20170312205340.16202-1-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
2017-03-14 10:10:14 +01:00
Arun Raghavan
e7ea7c9806 rlimits: Print more information when CPU/RT limits are exceeded
When a process is sent a SIGKILL because it exceeded CPU or RT limits,
the cause may not be obvious in userspace -- daemonised processes just
get killed, and even foreground process just see a 'Killed' message. The
lack of any information on why this might be happening in logs can be
confusing to users who are not aware of this mechanism.

Add messages which dump the process name and tid in dmesg when a process
exceeds its CPU or RT limits (soft and hard) in order to make it clearer to
people debugging such issues.

Signed-off-by: Arun Raghavan <arun@arunraghavan.net>
Link: http://lkml.kernel.org/r/20170301145309.27214-1-arun@arunraghavan.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-13 21:32:15 +01:00
Hari Bathini
e422267322 perf: Add PERF_RECORD_NAMESPACES to include namespaces related info
With the advert of container technologies like docker, that depend on
namespaces for isolation, there is a need for tracing support for
namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
recording namespaces related info. By recording info for every
namespace, it is left to userspace to take a call on the definition of a
container and trace containers by updating perf tool accordingly.

Each namespace has a combination of device and inode numbers. Though
every namespace has the same device number currently, that may change in
future to avoid the need for a namespace of namespaces. Considering such
possibility, record both device and inode numbers separately for each
namespace.

Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-13 15:57:41 -03:00
Viresh Kumar
cba1dfb57b cpufreq: schedutil: Refactor sugov_next_freq_shared()
The loop in sugov_next_freq_shared() contains an if block to skip the
loop for the current CPU. This turns out to be an unnecessary
conditional in the scheduler's hot-path for every CPU in the policy.

It would be better to drop the conditional and make the loop treat all
the CPUs in the same way. That would eliminate the need of calling
sugov_iowait_boost() at the top of the routine.

To keep the code optimized to return early if the current CPU has RT/DL
flags set, move the flags check to sugov_update_shared() instead in
order to avoid the function call entirely.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-12 23:11:33 +01:00
Viresh Kumar
994a8f2514 cpufreq: schedutil: Redefine the rate_limit_us tunable
The rate_limit_us tunable is intended to reduce the possible overhead
from running the schedutil governor.  However, that overhead can be
divided into two separate parts: the governor computations and the
invocation of the scaling driver to set the CPU frequency.  The latter
is where the real overhead comes from.  The former is much less
expensive in terms of execution time and running it every time the
governor callback is invoked by the scheduler, after rate_limit_us
interval has passed since the last frequency update, would not be a
problem.

For this reason, redefine the rate_limit_us tunable so that it means the
minimum time that has to pass between two consecutive invocations of the
scaling driver by the schedutil governor (to set the CPU frequency).

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-12 23:11:33 +01:00
Linus Torvalds
5a45a5a881 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:

 - a fix for the kexec/purgatory regression which was introduced in the
   merge window via an innocent sparse fix. We could have reverted that
   commit, but on deeper inspection it turned out that the whole
   machinery is neither documented nor robust. So a proper cleanup was
   done instead

 - the fix for the TLB flush issue which was discovered recently

 - a simple typo fix for a reboot quirk

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tlb: Fix tlb flushing when lguest clears PGE
  kexec, x86/purgatory: Unbreak it and clean it up
  x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk
2017-03-12 14:18:49 -07:00
Charles Keepax
45e5202213 genirq: Add support for nested shared IRQs
On a specific audio system an interrupt input of an audio CODEC is used as a
shared interrupt. That interrupt input is handled by a CODEC specific irq
chip driver and triggers a CPU interrupt via the CODEC irq output line.

The CODEC interrupt handler demultiplexes the CODEC interrupt inputs and
the interrupt handlers for these demultiplexed inputs run nested in the
context of the CODEC interrupt handler.

The demultiplexed interrupts use handle_nested_irq() as their interrupt
handler, which unfortunately has no support for shared interrupts. So the
above hardware cannot be supported.

Add shared interrupt support to handle_nested_irq() by iterating over the
interrupt action chain.

[ tglx: Massaged changelog ]

Signed-off-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Cc: patches@opensource.wolfsonmicro.com
Link: http://lkml.kernel.org/r/1488904098-5350-1-git-send-email-ckeepax@opensource.wolfsonmicro.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-11 12:57:03 +01:00
Thomas Gleixner
40c50c1fec kexec, x86/purgatory: Unbreak it and clean it up
The purgatory code defines global variables which are referenced via a
symbol lookup in the kexec code (core and arch).

A recent commit addressing sparse warnings made these static and thereby
broke kexec_file.

Why did this happen? Simply because the whole machinery is undocumented and
lacks any form of forward declarations. The variable names are unspecific
and lack a prefix, so adding forward declarations creates shadow variables
in the core code. Aside of that the code relies on magic constants and
duplicate struct definitions with no way to ensure that these things stay
in sync. The section placement of the purgatory variables happened by
chance and not by design.

Unbreak kexec and cleanup the mess:

 - Add proper forward declarations and document the usage
 - Use common struct definition
 - Use the proper common defines instead of magic constants
 - Add a purgatory_ prefix to have a proper name space
 - Use ARRAY_SIZE() instead of a homebrewn reimplementation
 - Add proper sections to the purgatory variables [ From Mike ]

Fixes: 72042a8c7b ("x86/purgatory: Make functions and variables static")
Reported-by: Mike Galbraith <<efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicholas Mc Guire <der.herr@hofr.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Tobin C. Harding" <me@tobin.cc>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-10 20:55:09 +01:00
Linus Torvalds
8fe3ccaed0 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "26 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  userfaultfd: remove wrong comment from userfaultfd_ctx_get()
  fat: fix using uninitialized fields of fat_inode/fsinfo_inode
  sh: cayman: IDE support fix
  kasan: fix races in quarantine_remove_cache()
  kasan: resched in quarantine_remove_cache()
  mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
  thp: fix another corner case of munlock() vs. THPs
  rmap: fix NULL-pointer dereference on THP munlocking
  mm/memblock.c: fix memblock_next_valid_pfn()
  userfaultfd: selftest: vm: allow to build in vm/ directory
  userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
  userfaultfd: non-cooperative: fix fork fctx->new memleak
  mm/cgroup: avoid panic when init with low memory
  drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
  mm/vmstats: add thp_split_pud event for clarity
  include/linux/fs.h: fix unsigned enum warning with gcc-4.2
  userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
  userfaultfd: non-cooperative: robustness check
  userfaultfd: non-cooperative: rollback userfaultfd_exit
  x86, mm: unify exit paths in gup_pte_range()
  ...
2017-03-10 08:34:42 -08:00
Andrea Arcangeli
dd0db88d80 userfaultfd: non-cooperative: rollback userfaultfd_exit
Patch series "userfaultfd non-cooperative further update for 4.11 merge
window".

Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
more testing.  I've been doing testing before and this was also tested
by kbuild bot and exercised by the selftest, but this bug never
reproduced before.

I dropped userfaultfd_exit as result.  I dropped it because of
implementation difficulty in receiving signals in __mmput and because I
think -ENOSPC as result from the background UFFDIO_COPY should be enough
already.

Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
wasn't exercised by the selftest and when I tried to exercise it, after
moving it to a more correct place in __mmput where it would make more
sense and where the vma list is stable, it resulted in the
event_wait_completion in D state.  So then I added the second patch to
be sure even if we call userfaultfd_event_wait_completion too late
during task exit(), we won't risk to generate tasks in D state.  The
same check exists in handle_userfault() for the same reason, except it
makes a difference there, while here is just a robustness check and it's
run under WARN_ON_ONCE.

While looking at the userfaultfd_event_wait_completion() function I
looked back at its callers too while at it and I think it's not ok to
stop executing dup_fctx on the fcs list because we relay on
userfaultfd_event_wait_completion to execute
userfaultfd_ctx_put(fctx->orig) which is paired against
userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
list_add(fcs).  This change only takes care of fctx->orig but this area
also needs further review looking for similar problems in fctx->new.

The only patch that is urgent is the first because it's an use after
free during a SMP race condition that affects all processes if
CONFIG_USERFAULTFD=y.  Very hard to reproduce though and probably
impossible without SLUB poisoning enabled.

This patch (of 3):

I once reproduced this oops with the userfaultfd selftest, it's not
easily reproducible and it requires SLUB poisoning to reproduce.

    general protection fault: 0000 [#1] SMP
    Modules linked in:
    CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G               ------------ T 3.10.0+ #15
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
    task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
    RIP: 0010:[<ffffffff81451299>]  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
    RSP: 0018:ffff8801f833fe80  EFLAGS: 00010202
    RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
    RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
    R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
    FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
      do_exit+0x297/0xd10
      SyS_exit+0x17/0x20
      tracesys+0xdd/0xe2
    Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
    RIP  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
     RSP <ffff8801f833fe80>
    ---[ end trace 9fecd6dcb442846a ]---

In the debugger I located the "mm" pointer in the stack and walking
mm->mmap->vm_next through the end shows the vma->vm_next list is fully
consistent and it is null terminated list as expected.  So this has to
be an SMP race condition where userfaultfd_exit was running while the
vma list was being modified by another CPU.

When userfaultfd_exit() run one of the ->vm_next pointers pointed to
SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

The reason is that it's not running in __mmput but while there are still
other threads running and it's not holding the mmap_sem (it can't as it
has to wait the even to be received by the manager).  So this is an use
after free that was happening for all processes.

One more implementation problem aside from the race condition:
userfaultfd_exit has really to check a flag in mm->flags before walking
the vma or it's going to slowdown the exit() path for regular tasks.

One more implementation problem: at that point signals can't be
delivered so it would also create a task in D state if the manager
doesn't read the event.

The major design issue: it overall looks superfluous as the manager can
check for -ENOSPC in the background transfer:

	if (mmget_not_zero(ctx->mm)) {
[..]
	} else {
		return -ENOSPC;
	}

It's safer to roll it back and re-introduce it later if at all.

[rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
  Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Masahiro Yamada
505d3085d7 scripts/spelling.txt: add "overide" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  overide||override

While we are here, fix the doubled "address" in the touched line
Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.

Also, fix the comment block style in the touched hunks in
drivers/media/dvb-frontends/drx39xyj/drx_driver.h.

Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Masahiro Yamada
8a1115ff6b scripts/spelling.txt: add "disble(d)" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  disble||disable
  disbled||disabled

I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched.  The macro is not referenced at all, but this commit is
touching only comment blocks just in case.

Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Linus Torvalds
c1aa905a30 Power management updates for v4.11-rc2
- Three fixes for intel_pstate problems related to the passive
    mode (in which it acts as a regular cpufreq scaling driver), two
    for the handling of global P-state limits and one for the handling
    of the cpu_frequency tracepoint in that mode (Rafael Wysocki).
 
  - Three fixes for the handling of P-state limits in intel_pstate in
    the active mode (Rafael Wysocki).
 
  - Introduction of a new cpufreq.off=1 kernel command line argument
    that will disable cpufreq entirely if passed to the kernel and
    is simply hooked up to the existing code used by Xen (Len Brown).
 
  - Fix for the schedutil cpufreq governor to prevent it from using
    stale raw frequency values in configurations with mutiple CPUs
    sharing one policy object and a cleanup for it reducing its
    overhead slightly (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYwc9bAAoJEILEb/54YlRxU24P/2RPFpRYNtGNfbnvmyNjcA5H
 MLQ/i0HEqhBVsfsj2Kmy6sRkY/X/kqnOiD2PzGcaOMTtMx2FcLNHzNqH0P0I7A4W
 9wzUcq0SC5FzTJcLryN4gtJa3tfBDHjr6peAcZyji5t3DbGa+mygVwH8IGyr4feo
 u7PE/6eXLfkRIySwG/kCI/YVE8tWuhjDHbjhR1pjgyrMjhDbPP480Mgsi4eDRRTO
 BwGVyQJXoMo9e7/vZ8Y8Fkt7PMxYeyeE1yAGHkJzjkFfdrkZrn3IUPfgVgxy8IPV
 N3w1BB+H84duEswPZH2patdIKQxXG3r0eaGHTXZIwy+5sHyc3mUMJ1FeHR67gasd
 Z4p4hylTP+dGZGDo4G7PbVX4V6zEP+LSxgOQYjbo4n6k3nJJOrIF4wt+l5+tQz5m
 Y+V6XVHgZm1LyFgjj06jMeVSmk6SS7X0qBHJ4WcfGzV/J+vStXmIvAmljs+cd/u2
 R9z1W/rxelZFKT+4lRCuztpBYfJvlE5nXyM1XwC6Rz6QjFax0pJAHUPFznWyq24C
 GlMAvQWPapDEoOvDrS/QKeqX1ROOYf2FwHs+uvPPCvOxMjL9NCQsQ34tqCM3MiL/
 ebk3uZ0xC1e0LaOB87xr7tfsag12MyZzSCwX0D8nBLBLvntpa/xQwE5+4vu9ZguD
 xlarnIsEh6bTwTVH6DJP
 =Gwp7
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix several issues in the intel_pstate driver and one issue in
  the schedutil cpufreq governor, clean up that governor a bit and hook
  up existing code for disabling cpufreq to a new kernel command line
  option.

  Specifics:

   - Three fixes for intel_pstate problems related to the passive mode
     (in which it acts as a regular cpufreq scaling driver), two for the
     handling of global P-state limits and one for the handling of the
     cpu_frequency tracepoint in that mode (Rafael Wysocki).

   - Three fixes for the handling of P-state limits in intel_pstate in
     the active mode (Rafael Wysocki).

   - Introduction of a new cpufreq.off=1 kernel command line argument
     that will disable cpufreq entirely if passed to the kernel and is
     simply hooked up to the existing code used by Xen (Len Brown).

   - Fix for the schedutil cpufreq governor to prevent it from using
     stale raw frequency values in configurations with mutiple CPUs
     sharing one policy object and a cleanup for it reducing its
     overhead slightly (Viresh Kumar)"

* tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
  cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
  cpufreq: intel_pstate: Fix global settings in active mode
  cpufreq: Add the "cpufreq.off=1" cmdline option
  cpufreq: schedutil: Pass sg_policy to get_next_freq()
  cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
  cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
  cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
  cpufreq: intel_pstate: Do not use performance_limits in passive mode
2017-03-09 16:30:37 -08:00
Alexei Starovoitov
4fe8435909 bpf: convert htab map to hlist_nulls
when all map elements are pre-allocated one cpu can delete and reuse htab_elem
while another cpu is still walking the hlist. In such case the lookup may
miss the element. Convert hlist to hlist_nulls to avoid such scenario.
When bucket lock is taken there is no need to take such precautions,
so only convert map_lookup and map_get_next to nulls.
The race window is extremely small and only reproducible with explicit
udelay() inside lookup_nulls_elem_raw()

Similar to hlist add hlist_nulls_for_each_entry_safe() and
hlist_nulls_entry_safe() helpers.

Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Reported-by: Jonathan Perry <jonperry@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 13:27:17 -08:00
Alexei Starovoitov
9f691549f7 bpf: fix struct htab_elem layout
when htab_elem is removed from the bucket list the htab_elem.hash_node.next
field should not be overridden too early otherwise we have a tiny race window
between lookup and delete.
The bug was discovered by manual code analysis and reproducible
only with explicit udelay() in lookup_elem_raw().

Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Reported-by: Jonathan Perry <jonperry@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 13:27:17 -08:00
Elena Reshetova
4b9502e63b kernel: convert css_set.refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-08 17:46:03 -05:00
Linus Torvalds
bd0f9b356d sched/headers: fix up header file dependency on <linux/sched/signal.h>
The scheduler header file split and cleanups ended up exposing a few
nasty header file dependencies, and in particular it showed how we in
<linux/wait.h> ended up depending on "signal_pending()", which now comes
from <linux/sched/signal.h>.

That's a very subtle and annoying dependency, which already caused a
semantic merge conflict (see commit e58bc92783 "Pull overlayfs updates
from Miklos Szeredi", which added that fixup in the merge commit).

It turns out that we can avoid this dependency _and_ improve code
generation by moving the guts of the fairly nasty helper #define
__wait_event_interruptible_locked() to out-of-line code.  The code that
includes the signal_pending() check is all in the slow-path where we
actually go to sleep waiting for the event anyway, so using a helper
function is the right thing to do.

Using a helper function is also what we already did for the non-locked
versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
set of helper functions.

We might want to try to unify all these macro games, we have a _lot_ of
subtly different wait-event loops.  But this is the minimal patch to fix
the annoying header dependency.

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-08 10:36:03 -08:00
Jiri Kosina
10517429b5 livepatch: make klp_mutex proper part of API
klp_mutex is shared between core.c and transition.c, and as such would
rather be properly located in a header so that we don't have to play
'extern' games from .c sources.

This also silences sparse warning (wrongly) suggesting that klp_mutex
should be defined static.

Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 14:40:53 +01:00
Josh Poimboeuf
3ec24776bf livepatch: allow removal of a disabled patch
Currently we do not allow patch module to unload since there is no
method to determine if a task is still running in the patched code.

The consistency model gives us the way because when the unpatching
finishes we know that all tasks were marked as safe to call an original
function. Thus every new call to the function calls the original code
and at the same time no task can be somewhere in the patched code,
because it had to leave that code to be marked as safe.

We can safely let the patch module go after that.

Completion is used for synchronization between module removal and sysfs
infrastructure in a similar way to commit 942e443127 ("module: Fix
mod->mkobj.kobj potentially freed too early").

Note that we still do not allow the removal for immediate model, that is
no consistency model. The module refcount may increase in this case if
somebody disables and enables the patch several times. This should not
cause any harm.

With this change a call to try_module_get() is moved to
__klp_enable_patch from klp_register_patch to make module reference
counting symmetric (module_put() is in a patch disable path) and to
allow to take a new reference to a disabled module when being enabled.

Finally, we need to be very careful about possible races between
klp_unregister_patch(), kobject_put() functions and operations
on the related sysfs files.

kobject_put(&patch->kobj) must be called without klp_mutex. Otherwise,
it might be blocked by enabled_store() that needs the mutex as well.
In addition, enabled_store() must check if the patch was not
unregisted in the meantime.

There is no need to do the same for other kobject_put() callsites
at the moment. Their sysfs operations neither take the lock nor
they access any data that might be freed in the meantime.

There was an attempt to use kobjects the right way and prevent these
races by design. But it made the patch definition more complicated
and opened another can of worms. See
https://lkml.kernel.org/r/1464018848-4303-1-git-send-email-pmladek@suse.com

[Thanks to Petr Mladek for improving the commit message.]

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:38:43 +01:00
Josh Poimboeuf
d83a7cb375 livepatch: change to a per-task consistency model
Change livepatch to use a basic per-task consistency model.  This is the
foundation which will eventually enable us to patch those ~10% of
security patches which change function or data semantics.  This is the
biggest remaining piece needed to make livepatch more generally useful.

This code stems from the design proposal made by Vojtech [1] in November
2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
consistency and syscall barrier switching combined with kpatch's stack
trace switching.  There are also a number of fallback options which make
it quite flexible.

Patches are applied on a per-task basis, when the task is deemed safe to
switch over.  When a patch is enabled, livepatch enters into a
transition state where tasks are converging to the patched state.
Usually this transition state can complete in a few seconds.  The same
sequence occurs when a patch is disabled, except the tasks converge from
the patched state to the unpatched state.

An interrupt handler inherits the patched state of the task it
interrupts.  The same is true for forked tasks: the child inherits the
patched state of the parent.

Livepatch uses several complementary approaches to determine when it's
safe to patch tasks:

1. The first and most effective approach is stack checking of sleeping
   tasks.  If no affected functions are on the stack of a given task,
   the task is patched.  In most cases this will patch most or all of
   the tasks on the first try.  Otherwise it'll keep trying
   periodically.  This option is only available if the architecture has
   reliable stacks (HAVE_RELIABLE_STACKTRACE).

2. The second approach, if needed, is kernel exit switching.  A
   task is switched when it returns to user space from a system call, a
   user space IRQ, or a signal.  It's useful in the following cases:

   a) Patching I/O-bound user tasks which are sleeping on an affected
      function.  In this case you have to send SIGSTOP and SIGCONT to
      force it to exit the kernel and be patched.
   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
      then it will get patched the next time it gets interrupted by an
      IRQ.
   c) In the future it could be useful for applying patches for
      architectures which don't yet have HAVE_RELIABLE_STACKTRACE.  In
      this case you would have to signal most of the tasks on the
      system.  However this isn't supported yet because there's
      currently no way to patch kthreads without
      HAVE_RELIABLE_STACKTRACE.

3. For idle "swapper" tasks, since they don't ever exit the kernel, they
   instead have a klp_update_patch_state() call in the idle loop which
   allows them to be patched before the CPU enters the idle state.

   (Note there's not yet such an approach for kthreads.)

All the above approaches may be skipped by setting the 'immediate' flag
in the 'klp_patch' struct, which will disable per-task consistency and
patch all tasks immediately.  This can be useful if the patch doesn't
change any function or data semantics.  Note that, even with this flag
set, it's possible that some tasks may still be running with an old
version of the function, until that function returns.

There's also an 'immediate' flag in the 'klp_func' struct which allows
you to specify that certain functions in the patch can be applied
without per-task consistency.  This might be useful if you want to patch
a common function like schedule(), and the function change doesn't need
consistency but the rest of the patch does.

For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
must set patch->immediate which causes all tasks to be patched
immediately.  This option should be used with care, only when the patch
doesn't change any function or data semantics.

In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
may be allowed to use per-task consistency if we can come up with
another way to patch kthreads.

The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
is in transition.  Only a single patch (the topmost patch on the stack)
can be in transition at a given time.  A patch can remain in transition
indefinitely, if any of the tasks are stuck in the initial patch state.

A transition can be reversed and effectively canceled by writing the
opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
the transition is in progress.  Then all the tasks will attempt to
converge back to the original patch state.

[1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Ingo Molnar <mingo@kernel.org>        # for the scheduler changes
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:36:21 +01:00
Josh Poimboeuf
f5e547f4ac livepatch: store function sizes
For the consistency model we'll need to know the sizes of the old and
new functions to determine if they're on the stacks of any tasks.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:24:04 +01:00
Josh Poimboeuf
68ae4b2b68 livepatch: use kstrtobool() in enabled_store()
The sysfs enabled value is a boolean, so kstrtobool() is a better fit
for parsing the input string since it does the range checking for us.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:23:52 +01:00
Josh Poimboeuf
c349cdcaba livepatch: move patching functions into patch.c
Move functions related to the actual patching of functions and objects
into a new patch.c file.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:23:40 +01:00
Josh Poimboeuf
aa82dc3e00 livepatch: remove unnecessary object loaded check
klp_patch_object()'s callers already ensure that the object is loaded,
so its call to klp_is_object_loaded() is unnecessary.

This will also make it possible to move the patching code into a
separate file.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:23:28 +01:00
Josh Poimboeuf
0dade9f374 livepatch: separate enabled and patched states
Once we have a consistency model, patches and their objects will be
enabled and disabled at different times.  For example, when a patch is
disabled, its loaded objects' funcs can remain registered with ftrace
indefinitely until the unpatching operation is complete and they're no
longer in use.

It's less confusing if we give them different names: patches can be
enabled or disabled; objects (and their funcs) can be patched or
unpatched:

- Enabled means that a patch is logically enabled (but not necessarily
  fully applied).

- Patched means that an object's funcs are registered with ftrace and
  added to the klp_ops func stack.

Also, since these states are binary, represent them with booleans
instead of ints.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:23:16 +01:00
Josh Poimboeuf
46c5a0113f livepatch: create temporary klp_update_patch_state() stub
Create temporary stubs for klp_update_patch_state() so we can add
TIF_PATCH_PENDING to different architectures in separate patches without
breaking build bisectability.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:19:16 +01:00
Josh Poimboeuf
af085d9084 stacktrace/x86: add function for detecting reliable stack traces
For live patching and possibly other use cases, a stack trace is only
useful if it can be assured that it's completely reliable.  Add a new
save_stack_trace_tsk_reliable() function to achieve that.

Note that if the target task isn't the current task, and the target task
is allowed to run, then it could be writing the stack while the unwinder
is reading it, resulting in possible corruption.  So the caller of
save_stack_trace_tsk_reliable() must ensure that the task is either
'current' or inactive.

save_stack_trace_tsk_reliable() relies on the x86 unwinder's detection
of pt_regs on the stack.  If the pt_regs are not user-mode registers
from a syscall, then they indicate an in-kernel interrupt or exception
(e.g. preemption or a page fault), in which case the stack is considered
unreliable due to the nature of frame pointers.

It also relies on the x86 unwinder's detection of other issues, such as:

- corrupted stack data
- stack grows the wrong way
- stack walk doesn't reach the bottom
- user didn't provide a large enough entries array

Such issues are reported by checking unwind_error() and !unwind_done().

Also add CONFIG_HAVE_RELIABLE_STACKTRACE so arch-independent code can
determine at build time whether the function is implemented.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Ingo Molnar <mingo@kernel.org>	# for the x86 changes
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:18:02 +01:00
Linus Torvalds
8a9172356f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "This includes a fix for lockups caused by incorrect nsecs related
  cleanup, and a capabilities check fix for timerfd"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC
  timerfd: Only check CAP_WAKE_ALARM when it is needed
2017-03-07 14:45:22 -08:00
Linus Torvalds
609b07b72d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A fix for KVM's scheduler clock which (erroneously) was always marked
  unstable, a fix for RT/DL load balancing, plus latency fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface
  sched/core: Fix pick_next_task() for RT,DL
  sched/fair: Make select_idle_cpu() more aggressive
2017-03-07 14:42:34 -08:00
Linus Torvalds
c3abcabe81 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This includes a fix for a crash if certain special addresses are
  kprobed, plus does a rename of two Kconfig variables that were a minor
  misnomer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS
  kprobes/x86: Fix kernel panic when certain exception-handling addresses are probed
2017-03-07 14:38:16 -08:00
Linus Torvalds
500e1af252 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:

 - Change the new refcount_t warnings from WARN() to WARN_ONCE()

 - two ww_mutex fixes

 - plus a new lockdep self-consistency check for a bug that triggered in
   practice

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/ww_mutex: Adjust the lock number for stress test
  locking/lockdep: Add nest_lock integrity test
  locking/ww_mutex: Replace cpu_relax() with cond_resched() for tests
  locking/refcounts: Change WARN() to WARN_ONCE()
2017-03-07 14:33:11 -08:00
Linus Torvalds
304362a8bc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fix from Eric Biederman:
 "This fixes a race between put_ucounts and get_ucounts that can cause a
  use after free. The fix works by simplifying the code and so there is
  not even a temptation to be clever and play spinlock vs atomic
  reference games"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucount: Remove the atomicity from ucount->count
2017-03-07 10:06:25 -08:00
Linus Torvalds
f26db9649a There was some breakage with the changes for jump labels in the 4.11 merge
window. Namely powerpc broke as jump labels uses the two LSB bits as flags
 in initialization. A check was added to make sure that all jump label
 entries were 4 bytes aligned, but powerpc didn't work that way for modules.
 Adding an alignment in the module linker script appeared to be the best
 solution.
 
 Jump labels also added an anonymous union to access those LSB bits as a
 normal long. But because this structure had static initialization, it broke
 older compilers that could not statically initialize anonymous unions
 without brackets.
 
 The command line parameter for setting function graph filter broke the
 "EMPTY_HASH" descriptor by modifying it instead of creating a new hash to
 hold the entries.
 
 The command line parameter ftrace_graph_max_depth was added to allow its
 setting at boot time. It uses existing code and only the command line hook
 was added. This is not really a fix, but as it uses existing code without
 affecting anything else, I added it to this release. It was ready before the
 merge window closed, but I wanted to let it sit in linux-next for a couple
 of days first.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYvNrAFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 JGQIAMkayeZ0OCyYHRPR4EcCrdE3fATmt1huJWHrMPnT4/fLabL8XQqrOpnOBMq1
 GFZb1SMkBmvGtAHF4GbvCxnIUfDQko6BTQAd8EMea1WM8+Kb66/BLgJawjWIU9I0
 dNYre9ONgR2NOzkz6nfKRXnmy0lRcOweBb09YYGSzY11Md7d8T3T4TUrPNZdYrO9
 8ZMbF4qRd9KLMRHcsWqvhWhBISxWnmtUSlthfweukKgDMy8OKpb7pR0ckjtYwsWX
 RF41jqLqzSUqtd/nE2Sj/aT8XOP4pfrKEUuNM4SBj8q5jmNcZuqi8Q9wItu3LWR2
 jqM/9UKTzaCr9cchwuvUC0i+jWc=
 =kDql
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "There was some breakage with the changes for jump labels in the 4.11
  merge window:

   - powerpc broke as jump labels uses the two LSB bits as flags in
     initialization.

     A check was added to make sure that all jump label entries were 4
     bytes aligned, but powerpc didn't work that way for modules. Adding
     an alignment in the module linker script appeared to be the best
     solution.

   - Jump labels also added an anonymous union to access those LSB bits
     as a normal long. But because this structure had static
     initialization, it broke older compilers that could not statically
     initialize anonymous unions without brackets.

   - The command line parameter for setting function graph filter broke
     the "EMPTY_HASH" descriptor by modifying it instead of creating a
     new hash to hold the entries.

   - The command line parameter ftrace_graph_max_depth was added to
     allow its setting at boot time. It uses existing code and only the
     command line hook was added.

     This is not really a fix, but as it uses existing code without
     affecting anything else, I added it to this release. It was ready
     before the merge window closed, but I wanted to let it sit in
     linux-next for a couple of days first"

* tag 'trace-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace/graph: Add ftrace_graph_max_depth kernel parameter
  tracing: Add #undef to fix compile error
  jump_label: Add comment about initialization order for anonymous unions
  jump_label: Fix anonymous union initialization
  module: set __jump_table alignment to 8
  ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filter
  tracing: Fix code comment for ftrace_ops_get_func()
2017-03-07 09:37:28 -08:00
Frederic Weisbecker
fa3aa7a54f jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC
commit 93825f2ec7 converted NSEC_PER_SEC to TICK_NSEC because the author
confused NSEC_PER_JIFFY with NSEC_PER_SEC.

As a result, the calculation of refined jiffies got broken, triggering
lockups.

Fixes: 93825f2ec7 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY")
Reported-and-tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1488880534-3777-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-07 11:03:28 +01:00
Ingo Molnar
84e5b54921 perf/core improvements and fixes:
New features:
 
 - Allow sorting by symbol_size in 'perf report' and 'perf top' (Charles Baylis)
 
   E.g.:
 
   # perf report -s symbol_size,symbol
 
   Samples: 9K of event 'cycles:k', Event count (approx.): 2870461623
   Overhead  Symbol size  Symbol
     14.55%          326  [k] flush_tlb_mm_range
      7.20%         1045  [k] filemap_map_pages
      5.82%          124  [k] vma_interval_tree_insert
      5.18%         2430  [k] unmap_page_range
      2.57%          571  [k] vma_interval_tree_remove
      1.94%          494  [k] page_add_file_rmap
      1.82%          740  [k] page_remove_rmap
      1.66%         1017  [k] release_pages
      1.57%         1636  [k] update_blocked_averages
      1.57%           76  [k] unlock_page
 
 - Add support for -p/--pid, -a/--all-cpus and -C/--cpu in 'perf ftrace' (Namhyung Kim)
 
 Change in behaviour:
 
 - Make system wide (-a) the default option if no target was specified and one
   of following conditions is met:
 
   - No workload specified (current behaviour)
 
   - A workload is specified but all requested events are system wide ones,
     like uncore ones. (Jiri Olsa)
 
 Fixes:
 
 - Add missing initialization to the instruction decoder used in the
   intel PT/BTS code, which was causing lots of failures in 'perf test',
   looking for a value when there was none (Adrian Hunter)
 
 Infrastructure:
 
 - Add arch code needed to adopt the kernel's refcount_t to aid in
   catching bugs when using atomic_t as a reference counter, basically
   cmpxchg related functions (Arnaldo Carvalho de Melo)
 
 - Convert the code using atomic_t as reference counts to refcount_t
   (Elena Rashetova)
 
 - Add feature test for sched_getcpu() to more easily check for its
   presence in the many libc implementations and accross different
   versions of such C libraries (Arnaldo Carvalho de Melo)
 
 - Issue a HW watchdog disable hint in 'perf stat' for when some of the
   requested events can't get counted because a PMU counter is taken by that
   watchdog (Borislav Petkov).
 
 - Add mapping for Intel's KnightsMill PMU events (Karol Wachowski)
 
 Documentation:
 
 - Clarify the term 'convergence' in:
 
    perf bench numa numa-mem -h --show_convergence (Jiri Olsa)
 
 Kernel code:
 
 - Ensure probe location is at function entry in kretprobes (Naveen N. Rao)
 
 - Allow return probes with offsets and absolute addresses (Naveen N. Rao)
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYvbmTAAoJENZQFvNTUqpAN80QAJ2ETcTosR9fo06VrT2HqRT4
 +iGe55wSu261TOekIkXOEW+ww321eNPfy4rIZeLCEFcCd9p03n5JceVbFnOjBuAz
 Lk6jrKpaH+Ajp56nCLyWH4r3LYLJXdoIydNay4PZ08rl0GgagGqvevD8ZZCEO0sx
 vjD1TFH2uSOq3UTxKapO++FHwhy+XqZ5S5I+rMuLxg6Qi+rLubXDztzIlcCfQPGx
 g+zFkaJ/ms9TAtWK25xoj34QXsaqpBsF8qkCE1P8Zdjtnkp6zM2Rx3HvvbRDmgVx
 /h0b1iua5IVElgnai/84ttJG3Bi6ovRbf/PFy+IceM4Qfx0eQeWmA3CAtcGOh9Gv
 GTDCcJ7xWZBpM0g1wCk3ks2oApFTA6GkcnIt5alhTse5U3gNmImv3uvuN8d265KL
 2oGKps7MH1nWMgpL4G4BNuZg2oqmM/uX9ERiuNjtCqj6WoHy2QSDyEMJN5Od3lYj
 ar2PPGofHmiacsW3NNMT+LwQ/wL/d2dVfZTopMafeaxRDGTxdqkhLkB/sZT/wexQ
 ySVijQPO+x0eLSIK/BWdEmD8K6JiYGdpIRDWVW+D043I7iiXFvPgui1bFABJEqn5
 mZFa1qT4EQMuuSaLkkxtoOrdoF6YzJJA57sIx2IrouGDapJ2BDegiUOfE0PSp5l0
 oeRuFcJYfpITC/TzntE3
 =h2yo
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-mingo-4.11-20170306' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

New features:

- Allow sorting by symbol_size in 'perf report' and 'perf top' (Charles Baylis)

  E.g.:

  # perf report -s symbol_size,symbol

  Samples: 9K of event 'cycles:k', Event count (approx.): 2870461623
  Overhead  Symbol size  Symbol
    14.55%          326  [k] flush_tlb_mm_range
     7.20%         1045  [k] filemap_map_pages
     5.82%          124  [k] vma_interval_tree_insert
     5.18%         2430  [k] unmap_page_range
     2.57%          571  [k] vma_interval_tree_remove
     1.94%          494  [k] page_add_file_rmap
     1.82%          740  [k] page_remove_rmap
     1.66%         1017  [k] release_pages
     1.57%         1636  [k] update_blocked_averages
     1.57%           76  [k] unlock_page

- Add support for -p/--pid, -a/--all-cpus and -C/--cpu in 'perf ftrace' (Namhyung Kim)

Change in behaviour:

- Make system wide (-a) the default option if no target was specified and one
  of following conditions is met:

  - No workload specified (current behaviour)

  - A workload is specified but all requested events are system wide ones,
    like uncore ones. (Jiri Olsa)

Fixes:

- Add missing initialization to the instruction decoder used in the
  intel PT/BTS code, which was causing lots of failures in 'perf test',
  looking for a value when there was none (Adrian Hunter)

Infrastructure changes:

- Add arch code needed to adopt the kernel's refcount_t to aid in
  catching bugs when using atomic_t as a reference counter, basically
  cmpxchg related functions (Arnaldo Carvalho de Melo)

- Convert the code using atomic_t as reference counts to refcount_t
  (Elena Rashetova)

- Add feature test for sched_getcpu() to more easily check for its
  presence in the many libc implementations and accross different
  versions of such C libraries (Arnaldo Carvalho de Melo)

- Issue a HW watchdog disable hint in 'perf stat' for when some of the
  requested events can't get counted because a PMU counter is taken by that
  watchdog (Borislav Petkov).

- Add mapping for Intel's KnightsMill PMU events (Karol Wachowski)

Documentation changes:

- Clarify the term 'convergence' in:

   perf bench numa numa-mem -h --show_convergence (Jiri Olsa)

Kernel code changes:

- Ensure probe location is at function entry in kretprobes (Naveen N. Rao)

- Allow return probes with offsets and absolute addresses (Naveen N. Rao)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-07 08:14:14 +01:00
Eric W. Biederman
040757f738 ucount: Remove the atomicity from ucount->count
Always increment/decrement ucount->count under the ucounts_lock.  The
increments are there already and moving the decrements there means the
locking logic of the code is simpler.  This simplification in the
locking logic fixes a race between put_ucounts and get_ucounts that
could result in a use-after-free because the count could go zero then
be found by get_ucounts and then be freed by put_ucounts.

A bug presumably this one was found by a combination of syzkaller and
KASAN.  JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
spotted the race in the code.

Cc: stable@vger.kernel.org
Fixes: f6b2db1a3e ("userns: Make the count of user namespaces per user")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-03-06 15:26:37 -06:00
Geliang Tang
c30fb26b11 workqueue: use setup_deferrable_timer
Use setup_deferrable_timer() instead of init_timer_deferrable() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06 15:42:20 -05:00
Tejun Heo
637fdbae60 workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
If queue_delayed_work() gets called with NULL @wq, the kernel will
oops asynchronuosly on timer expiration which isn't too helpful in
tracking down the offender.  This actually happened with smc.

__queue_delayed_work() already does several input sanity checks
synchronously.  Add NULL @wq check.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06 15:33:42 -05:00
Kees Cook
b6a6759daf cgroups: censor kernel pointer in debug files
As found in grsecurity, this avoids exposing a kernel pointer through
the cgroup debug entries.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06 15:16:03 -05:00
Tejun Heo
1d18c2747f cgroup/pids: remove spurious suspicious RCU usage warning
pids_can_fork() is special in that the css association is guaranteed
to be stable throughout the function and thus doesn't need RCU
protection around task_css access.  When determining the css to charge
the pid, task_css_check() is used to override the RCU sanity check.

While adding a warning message on fork rejection from pids limit,
135b8b37bd ("cgroup: Add pids controller event when fork fails
because of pid limit") incorrectly added a task_css access which is
neither RCU protected or explicitly annotated.  This triggers the
following suspicious RCU usage warning when RCU debugging is enabled.

  cgroup: fork rejected by pids controller in

  ===============================
  [ ERR: suspicious RCU usage.  ]
  4.10.0-work+ #1 Not tainted
  -------------------------------
  ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage!

  other info that might help us debug this:

  rcu_scheduler_active = 2, debug_locks = 0
  1 lock held by bash/1748:
   #0:  (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0

  stack backtrace:
  CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
  Call Trace:
   dump_stack+0x68/0x93
   lockdep_rcu_suspicious+0xd7/0x110
   pids_can_fork+0x1c7/0x1d0
   cgroup_can_fork+0x67/0xc0
   copy_process.part.58+0x1709/0x1e90
   _do_fork+0xe6/0x6e0
   SyS_clone+0x19/0x20
   do_syscall_64+0x5c/0x140
   entry_SYSCALL64_slow_path+0x25/0x25
  RIP: 0033:0x7f7853fab93a
  RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
  RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700
  R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4
  R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d
  /asdf

There's no reason to dereference task_css again here when the
associated css is already available.  Fix it by replacing the
task_cgroup() call with css->cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Fixes: 135b8b37bd ("cgroup: Add pids controller event when fork fails because of pid limit")
Cc: Kenny Yu <kennyyu@fb.com>
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06 15:11:29 -05:00
Elena Reshetova
387ad9674b kernel: convert cgroup_namespace.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-06 14:55:22 -05:00
Alexei Starovoitov
f38837b08d bpf: add get_next_key callback to LPM map
map_get_next_key callback is mandatory. Supply dummy handler.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-05 17:55:29 -08:00
Stephen Smalley
791ec491c3 prlimit,security,selinux: add a security hook for prlimit
When SELinux was first added to the kernel, a process could only get
and set its own resource limits via getrlimit(2) and setrlimit(2), so no
MAC checks were required for those operations, and thus no security hooks
were defined for them. Later, SELinux introduced a hook for setlimit(2)
with a check if the hard limit was being changed in order to be able to
rely on the hard limit value as a safe reset point upon context
transitions.

Later on, when prlimit(2) was added to the kernel with the ability to get
or set resource limits (hard or soft) of another process, LSM/SELinux was
not updated other than to pass the target process to the setrlimit hook.
This resulted in incomplete control over both getting and setting the
resource limits of another process.

Add a new security_task_prlimit() hook to the check_prlimit_permission()
function to provide complete mediation.  The hook is only called when
acting on another task, and only if the existing DAC/capability checks
would allow access.  Pass flags down to the hook to indicate whether the
prlimit(2) call will read, write, or both read and write the resource
limits of the target process.

The existing security_task_setrlimit() hook is left alone; it continues
to serve a purpose in supporting the ability to make decisions based on
the old and/or new resource limit values when setting limits.  This
is consistent with the DAC/capability logic, where
check_prlimit_permission() performs generic DAC/capability checks for
acting on another task, while do_prlimit() performs a capability check
based on a comparison of the old and new resource limits.  Fix the
inline documentation for the hook to match the code.

Implement the new hook for SELinux.  For setting resource limits, we
reuse the existing setrlimit permission.  Note that this does overload
the setrlimit permission to mean the ability to set the resource limit
(soft or hard) of another process or the ability to change one's own
hard limit.  For getting resource limits, a new getrlimit permission
is defined.  This was not originally defined since getrlimit(2) could
only be used to obtain a process' own limits.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-03-06 10:43:47 +11:00
Viresh Kumar
655cb1ebff cpufreq: schedutil: Pass sg_policy to get_next_freq()
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-05 23:58:48 +01:00
Viresh Kumar
6c4f0fa643 cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cached_raw_freq applies to the entire cpufreq policy and not individual
CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it
in struct sugov_cpu as we may end up comparing next_freq with a stale
cached_raw_freq of a random CPU.

Move cached_raw_freq to struct sugov_policy.

Fixes: 5cbea46984 (cpufreq: schedutil: map raw required frequency to driver frequency)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-05 23:58:48 +01:00
Linus Torvalds
8d70eeb84a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix double-free in batman-adv, from Sven Eckelmann.

 2) Fix packet stats for fast-RX path, from Joannes Berg.

 3) Netfilter's ip_route_me_harder() doesn't handle request sockets
    properly, fix from Florian Westphal.

 4) Fix sendmsg deadlock in rxrpc, from David Howells.

 5) Add missing RCU locking to transport hashtable scan, from Xin Long.

 6) Fix potential packet loss in mlxsw driver, from Ido Schimmel.

 7) Fix race in NAPI handling between poll handlers and busy polling,
    from Eric Dumazet.

 8) TX path in vxlan and geneve need proper RCU locking, from Jakub
    Kicinski.

 9) SYN processing in DCCP and TCP need to disable BH, from Eric
    Dumazet.

10) Properly handle net_enable_timestamp() being invoked from IRQ
    context, also from Eric Dumazet.

11) Fix crash on device-tree systems in xgene driver, from Alban Bedel.

12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de
    Melo.

13) Fix use-after-free in netvsc driver, from Dexuan Cui.

14) Fix max MTU setting in bonding driver, from WANG Cong.

15) xen-netback hash table can be allocated from softirq context, so use
    GFP_ATOMIC. From Anoob Soman.

16) Fix MAC address change bug in bgmac driver, from Hari Vyas.

17) strparser needs to destroy strp_wq on module exit, from WANG Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
  strparser: destroy workqueue on module exit
  sfc: fix IPID endianness in TSOv2
  sfc: avoid max() in array size
  rds: remove unnecessary returned value check
  rxrpc: Fix potential NULL-pointer exception
  nfp: correct DMA direction in XDP DMA sync
  nfp: don't tell FW about the reserved buffer space
  net: ethernet: bgmac: mac address change bug
  net: ethernet: bgmac: init sequence bug
  xen-netback: don't vfree() queues under spinlock
  xen-netback: keep a local pointer for vif in backend_disconnect()
  netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
  netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
  netfilter: nf_conntrack_sip: fix wrong memory initialisation
  can: flexcan: fix typo in comment
  can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
  can: gs_usb: fix coding style
  can: gs_usb: Don't use stack memory for USB transfers
  ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
  ixgbe: update the rss key on h/w, when ethtool ask for it
  ...
2017-03-04 17:31:39 -08:00
Steven Rostedt (VMware)
d0e02579c2 trace/kprobes: Add back warning about offset in return probes
Let's not remove the warning about offsets and return probes when the
offset is invalid.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20170227115204.00f92846@gandalf.local.home
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-03 19:07:19 -03:00
Naveen N. Rao
35b6f55aa9 trace/kprobes: Allow return probes with offsets and absolute addresses
Since the kernel includes many non-global functions with same names, we
will need to use offsets from other symbols (typically _text/_stext) or
absolute addresses to place return probes on specific functions. Also,
the core register_kretprobe() API never forbid use of offsets or
absolute addresses with kretprobes.

Allow its use with the trace infrastructure. To distinguish kernels that
support this, update ftrace README to explicitly call this out.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/183e7ce2921a08c9c755ee9a5da3134febc6695b.1487770934.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-03 19:07:18 -03:00
Naveen N. Rao
90ec5e89e3 kretprobes: Ensure probe location is at function entry
kretprobes can be registered by specifying an absolute address or by
specifying offset to a symbol. However, we need to ensure this falls at
function entry so as to be able to determine the return address.

Validate the same during kretprobe registration. By default, there
should not be any offset from a function entry, as determined through a
kallsyms_lookup(). Introduce arch_function_offset_within_entry() as a
way for architectures to override this.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/f1583bc4839a3862cfc2acefcc56f9c8837fa2ba.1487770934.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-03-03 19:07:17 -03:00
Linus Torvalds
1827adb11a Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull sched.h split-up from Ingo Molnar:
 "The point of these changes is to significantly reduce the
  <linux/sched.h> header footprint, to speed up the kernel build and to
  have a cleaner header structure.

  After these changes the new <linux/sched.h>'s typical preprocessed
  size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
  lines), which is around 40% faster to build on typical configs.

  Not much changed from the last version (-v2) posted three weeks ago: I
  eliminated quirks, backmerged fixes plus I rebased it to an upstream
  SHA1 from yesterday that includes most changes queued up in -next plus
  all sched.h changes that were pending from Andrew.

  I've re-tested the series both on x86 and on cross-arch defconfigs,
  and did a bisectability test at a number of random points.

  I tried to test as many build configurations as possible, but some
  build breakage is probably still left - but it should be mostly
  limited to architectures that have no cross-compiler binaries
  available on kernel.org, and non-default configurations"

* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
  sched/headers: Clean up <linux/sched.h>
  sched/headers: Remove #ifdefs from <linux/sched.h>
  sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
  sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
  sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
  sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
  sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
  sched/core: Remove unused prefetch_stack()
  sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
  sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
  sched/headers: Remove <linux/signal.h> from <linux/sched.h>
  sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
  sched/headers: Remove the runqueue_is_locked() prototype
  sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
  sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
  sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
  ...
2017-03-03 10:16:38 -08:00
Todd Brandt
65a50c6562 ftrace/graph: Add ftrace_graph_max_depth kernel parameter
Early trace callgraphs can be extremely large on systems with
several seconds of boot time. The max_depth parameter limits how
deep the graph trace goes and reduces the output size. This
parameter is the same as the max_graph_depth file in tracefs.

Link: http://lkml.kernel.org/r/1488499935-23216-1-git-send-email-todd.e.brandt@linux.intel.com

Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com>
[ changed comments about debugfs to tracefs ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-03 09:45:01 -05:00
Steven Rostedt (VMware)
92ad18ec26 ftrace/graph: Do not modify the EMPTY_HASH for the function_graph filter
On boot up, if the kernel command line sets a graph funtion with the kernel
command line options "ftrace_graph_filter" or "ftrace_graph_notrace" then it
updates the corresponding function graph hash, ftrace_graph_hash or
ftrace_graph_notrace_hash respectively. Unfortunately, at boot up, these
variables are pointers to the "EMPTY_HASH" which is a constant used as a
placeholder when a hash has no entities. The problem was that the comand
line version to set the hashes updated the actual EMPTY_HASH instead of
creating a new hash for the function graph. This broke the EMPTY_HASH
because not only did it modify a constant (not sure how that was allowed to
happen, except maybe because it was done at early boot, const variables were
still mutable), but it made the filters have functions listed in them when
they were actually empty.

The kernel command line function needs to allocate a new hash for the
function graph filters and assign the necessary variables to that new hash
instead.

Link: http://lkml.kernel.org/r/1488420091.7212.17.camel@linux.intel.com

Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-03 09:44:17 -05:00
Linus Torvalds
080e4168c0 More power management updates for v4.11-rc1
- Fix for a cpuidle menu governor problem that started to take an
    unnecessary spinlock after one of the recent updates and that
    did not play well with the RT patch (Rafael Wysocki).
 
  - Fix for the new intel_pstate operation mode switching feature
    added recently that did not reinitialize P-state limits properly
    when switching operation modes (Rafael Wysocki).
 
  - Removal of unused global notifiers from the PM QoS framework
    (Viresh Kumar).
 
  - Generic power domains framework update to make it handle
    asynchronous invocations of PM callbacks in the "noirq" phases
    of system suspend/hibernation correctly (Ulf Hansson).
 
  - Two hibernation core cleanups (Rafael Wysocki).
 
  - intel_idle cleanup related to the sysfs interface (Len Brown).
 
  - Off-by-one bug fix in the OPP (Operating Performance Points)
    framework (Andrzej Hajda).
 
  - OPP framework's documentation fix (Viresh Kumar).
 
  - cpufreq qoriq driver cleanup (Tang Yuantian).
 
  - Fixes for typos in comments in the device runtime PM framework
    (Christophe Jaillet).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYuLeGAAoJEILEb/54YlRxJvcP/0BRmh8Hn4Itx/NIWNg71X6j
 U+v8Pn8T3MP33gCcleLYlgre2JUIAUDmhdK99+UOx+/abhjMhQSaF3HhTOwYaPtQ
 6njoHVS0NnfqUf+x5kp+EpRxBVNucYVbdRTVd1DsHIeLLz/96DFOzb/R7tko/pKx
 pFMWvNdotHLLgXOG1UvdRimwTDlFMffxFzD8Se53LPjRXS0S73A5VWfqZOye44Re
 j3W1AJ0Idgq5uduA6J8x1MWbaxDq1h+j6CSUm05yvqrINzxXwXt0Hv6stCQTo+Gb
 YMdiBd8MujNyAgcchw3jiDQ8Vp+zmfLPcHrfPe//SSefj26eB8LyVNSYelvbUdOz
 cNjvyErva37MmaegCL9QC7WbLM+A7VE6bU6YzDCi/rR8jYMJ51Fb9jGiYb/oimry
 OLlblEekikUsskWv4hGV1JVt5VhmUMlagWtexxn+lMszATcZro0tfXu/vgQWksYs
 noUnwuWJWxvj2aNMsvbzW3HLlTGSmYl2UxJ7IymQQaTDblwF9Kg61rm3+5coUctd
 ifceynDVp9Gju25faYgZ+Dq9+o8ktlOGOHRRPdLIRNJ/T+4tUDnlGkdbPb+Tfn03
 XUIzYCu74U8/oW8gOk6t0WpmWzvxEXNgdirdEIR6y3loYIC0Jr3v4gyD975Eug74
 Hzfrdg7ignAmWV+nf6UY
 =SeUF
 -----END PGP SIGNATURE-----

Merge tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates deom Rafael Wysocki:
 "These fix two bugs introduced by recent power management updates (in
  the cpuidle menu governor and intel_pstate) and a few other issues,
  clean up things and remove unused code.

  Specifics:

   - Fix for a cpuidle menu governor problem that started to take an
     unnecessary spinlock after one of the recent updates and that did
     not play well with the RT patch (Rafael Wysocki).

   - Fix for the new intel_pstate operation mode switching feature added
     recently that did not reinitialize P-state limits properly when
     switching operation modes (Rafael Wysocki).

   - Removal of unused global notifiers from the PM QoS framework
     (Viresh Kumar).

   - Generic power domains framework update to make it handle
     asynchronous invocations of PM callbacks in the "noirq" phases of
     system suspend/hibernation correctly (Ulf Hansson).

   - Two hibernation core cleanups (Rafael Wysocki).

   - intel_idle cleanup related to the sysfs interface (Len Brown).

   - Off-by-one bug fix in the OPP (Operating Performance Points)
     framework (Andrzej Hajda).

   - OPP framework's documentation fix (Viresh Kumar).

   - cpufreq qoriq driver cleanup (Tang Yuantian).

   - Fixes for typos in comments in the device runtime PM framework
     (Christophe Jaillet)"

* tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / OPP: Documentation: Fix opp-microvolt in examples
  intel_idle: stop exposing platform acronyms in sysfs
  cpufreq: intel_pstate: Fix limits issue with operation mode switching
  PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
  PM / hibernate: Untangle power_down()
  cpuidle: menu: Avoid taking spinlock for accessing QoS values
  PM / QoS: Remove global notifiers
  PM / runtime: Fix some typos
  cpufreq: qoriq: clean up unused code
  PM / OPP: fix off-by-one bug in dev_pm_opp_get_max_volt_latency loop
  PM / Domains: Power off masters immediately in the power off sequence
  PM / Domains: Rename is_async to one_dev_on for genpd_power_off()
  PM / Domains: Move genpd_power_off() above genpd_power_on()
2017-03-02 17:33:52 -08:00
Ingo Molnar
50ff9d1300 sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
It's not used by any of the scheduler methods, but <linux/sched/task_stack.h>
needs it to pick up STACK_END_MAGIC.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:39 +01:00
Ingo Molnar
cd9c513be3 sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
This is a stray header that is not needed by anything in sched.h,
so remove it.

Update files that relied on the stray inclusion.

This reduces the size of the header dependency graph.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:36 +01:00
Ingo Molnar
1050b27c52 sched/headers: Move cputime functionality from <linux/sched.h> and <linux/cputime.h> into <linux/sched/cputime.h>
Move cputime related functionality out of <linux/sched.h>, as most code
that includes <linux/sched.h> does not use that functionality.

Move data types that are not included in task_struct directly to
the signal definitions, into <linux/sched/signal.h>.

Also merge the (small) existing <linux/cputime.h> header into <linux/sched/cputime.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:22 +01:00
Ingo Molnar
56cd697366 sched/headers: Move the task_lock()/unlock() APIs to <linux/sched/task.h>
The task_lock()/task_unlock() APIs are not realated to core scheduling,
they are task lifetime APIs, i.e. they belong into <linux/sched/task.h>.

Move them.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:45:22 +01:00
Ingo Molnar
6bfbaa51ed sched/headers, RCU: Move rcu_copy_process() from <linux/sched/task.h> to kernel/fork.c
Move rcu_copy_process() into kernel/fork.c, which is the only
user of this inline function.

This simplifies <linux/sched/task.h> to the level that <linux/sched.h>
does not have to be included in it anymore - which change is done
in a subsequent patch.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:43:46 +01:00
Ingo Molnar
c3edc4010e sched/headers: Move task_struct::signal and task_struct::sighand types and accessors into <linux/sched/signal.h>
task_struct::signal and task_struct::sighand are pointers, which would normally make it
straightforward to not define those types in sched.h.

That is not so, because the types are accompanied by a myriad of APIs (macros and inline
functions) that dereference them.

Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.

With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
trying to put accessors into sched.h as a test fails the following way:

  ./include/linux/sched.h: In function ‘test_signal_types’:
  ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
                    ^

This reduces the size and complexity of sched.h significantly.

Update all headers and .c code that relied on getting the signal handling
functionality from <linux/sched.h> to include <linux/sched/signal.h>.

The list of affected files in the preparatory patch was partly generated by
grepping for the APIs, and partly by doing coverage build testing, both
all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
cross-architecture builds.

Nevertheless some (trivial) build breakage is still expected related to rare
Kconfig combinations and in-flight patches to various kernel code, but most
of it should be handled by this patch.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-03 01:43:37 +01:00
Rafael J. Wysocki
9b5e9cb164 Merge branches 'pm-cpuidle', 'pm-cpufreq' and 'pm-sleep'
* pm-cpuidle:
  intel_idle: stop exposing platform acronyms in sysfs
  cpuidle: menu: Avoid taking spinlock for accessing QoS values

* pm-cpufreq:
  cpufreq: intel_pstate: Fix limits issue with operation mode switching
  cpufreq: qoriq: clean up unused code

* pm-sleep:
  PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
  PM / hibernate: Untangle power_down()
2017-03-03 00:43:11 +01:00
Boqun Feng
857811a371 locking/ww_mutex: Adjust the lock number for stress test
Because there are only 12 bits in held_lock::references, so we only
support 4095 nested lock held in the same time, adjust the lock number
for ww_mutex stress test to kill one lockdep splat:

  [ ] [ BUG: bad unlock balance detected! ]
  [ ] kworker/u2:0/5 is trying to release lock (ww_class_mutex) at:
  [ ] ww_mutex_unlock()
  [ ] but there are no more locks to release!
  ...

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170301150138.hdixnmafzfsox7nn@tardis.cn.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 09:00:39 +01:00
Peter Zijlstra
7fb4a2cea6 locking/lockdep: Add nest_lock integrity test
Boqun reported that hlock->references can overflow. Add a debug test
for that to generate a clear error when this happens.

Without this, lockdep is likely to report a mysterious failure on
unlock.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 09:00:38 +01:00
Chris Wilson
2b232e0c3b locking/ww_mutex: Replace cpu_relax() with cond_resched() for tests
When busy-spinning on a ww_mutex_trylock(), we depend upon the other
thread advancing and releasing the lock. This can not happen on a single
CPU unless we relinquish it:

  [ ] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:18]
  ...
  [ ] Call Trace:
  [ ]  mutex_trylock()
  [ ]  test_mutex_work+0x31/0x56
  [ ]  process_one_work+0x1b4/0x2f9
  [ ]  worker_thread+0x1b0/0x27c
  [ ]  kthread+0xd1/0xd3
  [ ]  ret_from_fork+0x19/0x30

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f2a5fec173 ("locking/ww_mutex: Begin kselftests for ww_mutex")
Link: http://lkml.kernel.org/r/20170228094011.2595-1-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 09:00:38 +01:00
Peter Zijlstra
0ba87bb27d sched/core: Fix pick_next_task() for RT,DL
Pavan noticed that the following commit:

  49ee576809 ("sched/core: Optimize pick_next_task() for idle_sched_class")

... broke RT,DL balancing by robbing them of the opportinty to do new-'idle'
balancing when their last runnable task (on that runqueue) goes away.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 49ee576809 ("sched/core: Optimize pick_next_task() for idle_sched_class")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:50:17 +01:00
Peter Zijlstra
4c77b18cf8 sched/fair: Make select_idle_cpu() more aggressive
Kitsunyan reported desktop latency issues on his Celeron 887 because
of commit:

  1b568f0aab ("sched/core: Optimize SCHED_SMT")

... even though his CPU doesn't do SMT.

The effect of running the SMT code on a !SMT part is basically a more
aggressive select_idle_cpu(). Removing the avg condition fixed things
for him.

I also know FB likes this test gone, even though other workloads like
having it.

For now, take it out by default, until we get a better idea.

Reported-by: kitsunyan <kitsunyan@inbox.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:50:17 +01:00
Ingo Molnar
0881e7bd34 sched/headers: Prepare to move the get_task_struct()/put_task_struct() and related APIs from <linux/sched.h> to <linux/sched/task.h>
But first update usage sites with the new header dependency.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:40 +01:00
Ingo Molnar
1777e46355 sched/headers: Prepare to move _init() prototypes from <linux/sched.h> to <linux/sched/init.h>
But first introduce a trivial header and update usage sites.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:40 +01:00
Ingo Molnar
61855b6b03 sched/headers: Prepare to move exit_files() and exit_itimers() from <linux/sched.h> to <linux/sched/task.h>
But first update the usage site.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:40 +01:00
Ingo Molnar
5c2c5c5514 sched/headers, vfs/execve: Prepare to move the do_execve*() prototypes from <linux/sched.h> to <linux/binfmts.h>
But first update the usage sites with the new header dependency.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:39 +01:00
Ingo Molnar
3905f9ad45 sched/headers: Prepare to move sched_info_on() and force_schedstat_enabled() from <linux/sched.h> to <linux/sched/stat.h>
But first update usage sites with the new header dependency.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:39 +01:00
Ingo Molnar
32ef5517c2 sched/headers: Prepare to move cputime functionality from <linux/sched.h> into <linux/sched/cputime.h>
Introduce a trivial, mostly empty <linux/sched/cputime.h> header
to prepare for the moving of cputime functionality out of sched.h.

Update all code that relies on these facilities.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:39 +01:00
Ingo Molnar
f719ff9bce sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h>
But first update the code that uses these facilities with the
new header.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar
b2d0910310 sched/headers: Prepare to use <linux/rcuupdate.h> instead of <linux/rculist.h> in <linux/sched.h>
We don't actually need the full rculist.h header in sched.h anymore,
we will be able to include the smaller rcupdate.h header instead.

But first update code that relied on the implicit header inclusion.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar
9164bb4a18 sched/headers: Prepare to move 'init_task' and 'init_thread_union' from <linux/sched.h> to <linux/sched/task.h>
Update all usage sites first.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar
589ee62844 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h>
Update code that relied on sched.h including various MM types for them.

This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar
f361bf4a66 sched/headers: Prepare for the reduction of <linux/sched.h>'s signal API dependency
Instead of including the full <linux/signal.h>, we are going to include the
types-only <linux/signal_types.h> header in <linux/sched.h>, to further
decouple the scheduler header from the signal headers.

This means that various files which relied on the full <linux/signal.h> need
to be updated to gain an explicit dependency on it.

Update the code that relies on sched.h's inclusion of the <linux/signal.h> header.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar
dfc3401a33 sched/headers: Prepare to move the 'root_task_group' declaration to <linux/sched/autogroup.h>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar
68db0cf106 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar
299300258d sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task.h>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:35 +01:00
Ingo Molnar
ef8bd77f33 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/hotplug.h>
We are going to split <linux/sched/hotplug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/hotplug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:35 +01:00
Ingo Molnar
b17b01533b sched/headers: Prepare for new header dependencies before moving code to <linux/sched/debug.h>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
370c91355c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/nohz.h>
We are going to split <linux/sched/nohz.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/nohz.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
03441a3482 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/stat.h>
We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/stat.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Ingo Molnar
037741a6d4 sched/headers: Prepare for the removal of <linux/rtmutex.h> from <linux/sched.h>
Fix up missing #includes in other places that rely on sched.h doing that for them.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Ingo Molnar
5b825c3af1 sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:31 +01:00
Ingo Molnar
7fce777cd4 sched/headers: Prepare header dependency changes, move the <asm/paravirt.h> include to kernel/sched/sched.h
Recent header reorganizations unearthed this hidden dependency:

  kernel/sched/core.c:199:25: error: 'paravirt_steal_rq_enabled' undeclared (first use in this function)
  kernel/sched/core.c:200:11: error: implicit declaration of function 'paravirt_steal_clock' [-Werror=implicit-function-declaration]

So move the asm/paravirt.h include from kernel/sched/cpuclock.c to kernel/sched/sched.h.

( NOTE: We do this change before doing the changes that introduce the build failure,
        so the series remains fully bisectable. )

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:31 +01:00
Ingo Molnar
6a3827d750 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/numa_balancing.h>
We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Ingo Molnar
55687da166 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/cpufreq.h>
We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/cpufreq.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Ingo Molnar
38b8d208a4 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/nmi.h>
We are going to move softlockup APIs out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

<linux/nmi.h> already includes <linux/sched.h>.

Include the <linux/nmi.h> header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Ingo Molnar
8703e8a465 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h>
We are going to split <linux/sched/user.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/user.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Ingo Molnar
f7ccbae45c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h>
We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/coredump.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Ingo Molnar
6e84f31522 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

   mm_alloc()
   __mmdrop()
   mmdrop()
   mmdrop_async_fn()
   mmdrop_async()
   mmget_not_zero()
   mmput()
   mmput_async()
   get_task_mm()
   mm_access()
   mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Ingo Molnar
4eb5aaa3af sched/headers: Prepare for new header dependencies before moving code to <linux/sched/autogroup.h>
We are going to split <linux/sched/autogroup.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/autogroup.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Ingo Molnar
4f17722c72 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/loadavg.h>
We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Ingo Molnar
ae7e81c077 sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.

Create empty placeholder header that maps to <linux/types.h>.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Ingo Molnar
e601757102 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.

Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Ingo Molnar
84f001e157 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/wake_q.h>
We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/wake_q.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Ingo Molnar
4c822698cb sched/headers: Prepare for new header dependencies before moving code to <linux/sched/idle.h>
We are going to split  <linux/sched/idle.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/idle.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Ingo Molnar
105ab3d8ce sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h>
We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Ingo Molnar
314ff7851f mm/vmacache, sched/headers: Introduce 'struct vmacache' and move it from <linux/sched.h> to <linux/mm_types>
The <linux/sched.h> header includes various vmacache related defines,
which are arguably misplaced.

Move them to mm_types.h and minimize the sched.h impact by putting
all task vmacache state into a new 'struct vmacache' structure.

No change in functionality.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:25 +01:00
Ingo Molnar
780de9dd27 sched/headers, cgroups: Remove the threadgroup_change_*() wrappery
threadgroup_change_begin()/end() is a pointless wrapper around
cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
in the !CONFIG_CGROUPS=y case.

Remove the wrappery, move the might_sleep() (the down_read()
already has a might_sleep() check).

This debloats <linux/sched.h> a bit and simplifies this API.

Update all call sites.

No change in functionality.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:25 +01:00
Ingo Molnar
f9411ebe3d rcu: Separate the RCU synchronization types and APIs into <linux/rcupdate_wait.h>
So rcupdate.h is a pretty complex header, in particular it includes
<linux/completion.h> which includes <linux/wait.h> - creating a
dependency that includes <linux/wait.h> in <linux/sched.h>,
which prevents the isolation of <linux/sched.h> from the derived
<linux/wait.h> header.

Solve part of the problem by decoupling rcupdate.h from completions:
this can be done by separating out the rcu_synchronize types and APIs,
and updating their usage sites.

Since this is a mostly RCU-internal types this will not just simplify
<linux/sched.h>'s dependencies, but will make all the hundreds of
.c files that include rcupdate.h but not completions or wait.h build
faster.

( For rcutiny this means that two dependent APIs have to be uninlined,
  but that shouldn't be much of a problem as they are rare variants. )

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:24 +01:00
Ingo Molnar
4b53a3412d sched/core: Remove the tsk_nr_cpus_allowed() wrapper
tsk_nr_cpus_allowed() too is a pretty pointless wrapper that
is not used consistently and which makes the code both harder
to read and longer as well.

So remove it - this also shrinks <linux/sched.h> a bit.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:24 +01:00
Ingo Molnar
0c98d344fe sched/core: Remove the tsk_cpus_allowed() wrapper
So the original intention of tsk_cpus_allowed() was to 'future-proof'
the field - but it's pretty ineffectual at that, because half of
the code uses ->cpus_allowed directly ...

Also, the wrapper makes the code longer than the original expression!

So just get rid of it. This also shrinks <linux/sched.h> a bit.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:24 +01:00
Ingo Molnar
59ddbcb2f4 sched/core: Move the get_preempt_disable_ip() inline to sched/core.c
It's defined in <linux/sched.h>, but nothing outside the scheduler
uses it - so move it to the sched/core.c usage site.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:23 +01:00
Ingo Molnar
c930b2c0de sched/core: Convert ___assert_task_state() link time assert to BUILD_BUG_ON()
The length of TASK_STATE_TO_CHAR_STR was still checked using the old
link-time manual error method - convert it to BUILD_BUG_ON(). This
has a couple of advantages:

 - it's more obvious what's going on

 - it reduces the size and complexity of <linux/sched.h>

 - BUILD_BUG_ON() will fail during compilation, with a clearer
   error message than the link time assert.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:23 +01:00
Gary Lin
eba38a9682 bpf: update the comment about the length of analysis
Commit 07016151a4 ("bpf, verifier: further improve search
pruning") increased the limit of processed instructions from
32k to 64k, but the comment still mentioned the 32k limit.
This commit updates the comment to reflect the change.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Gary Lin <glin@suse.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 14:56:50 -08:00
Anton Blanchard
6b0b755142 perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS
We have uses of CONFIG_UPROBE_EVENT and CONFIG_KPROBE_EVENT as
well as CONFIG_UPROBE_EVENTS and CONFIG_KPROBE_EVENTS.

Consistently use the plurals.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: davem@davemloft.net
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20170216060050.20866-1-anton@ozlabs.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-01 10:26:39 +01:00
Linus Torvalds
65314ed08e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two rq-clock warnings related fixes, plus a cgroups related crash fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cgroup: Move sched_online_group() back into css_online() to fix crash
  sched/fair: Update rq clock before changing a task's CPU affinity
  sched/core: Fix update_rq_clock() splat on hotplug (and suspend/resume)
2017-02-28 11:44:01 -08:00
Linus Torvalds
3f26b0c876 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes on the kernel and tooling side - nothing in particular
  stands out"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  perf/core: Fix the perf_cpu_time_max_percent check
  perf/core: Fix perf_event_enable_on_exec() timekeeping (again)
  perf/core: Remove confusing comment and move put_ctx()
  perf record: Honor --quiet option properly
  perf annotate: Add -q/--quiet option
  perf diff: Add -q/--quiet option
  perf report: Add -q/--quiet option
  perf utils: Check verbose flag properly
  perf utils: Add perf_quiet_option()
  perf record: Add -a as default target
  perf stat: Add -a as default target
  perf tools: Fail on using multiple bits long terms without value
  perf tools: Move new_term arguments into struct parse_events_term template
  perf build: Add special fixdep cleaning rule
  perf tools: Replace _SC_NPROCESSORS_CONF with max_present_cpu in cpu_topology_map
  perf header: Make build_cpu_topology skip offline/absent CPUs
  perf cpumap: Add cpu__max_present_cpu()
  perf session: Fix DEBUG=1 build with clang
  tools lib traceevent: It's preempt not prempt
  perf python: Filter out -specs=/a/b/c from the python binding cc options
  ...
2017-02-28 11:38:18 -08:00
Linus Torvalds
86292b33d4 Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - a few MM remainders

 - misc things

 - autofs updates

 - signals

 - affs updates

 - ipc

 - nilfs2

 - spelling.txt updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits)
  mm, x86: fix HIGHMEM64 && PARAVIRT build config for native_pud_clear()
  mm: add arch-independent testcases for RODATA
  hfs: atomically read inode size
  mm: clarify mm_struct.mm_{users,count} documentation
  mm: use mmget_not_zero() helper
  mm: add new mmget() helper
  mm: add new mmgrab() helper
  checkpatch: warn when formats use %Z and suggest %z
  lib/vsprintf.c: remove %Z support
  scripts/spelling.txt: add some typo-words
  scripts/spelling.txt: add "followings" pattern and fix typo instances
  scripts/spelling.txt: add "therfore" pattern and fix typo instances
  scripts/spelling.txt: add "overwriten" pattern and fix typo instances
  scripts/spelling.txt: add "overwritting" pattern and fix typo instances
  scripts/spelling.txt: add "deintialize(d)" pattern and fix typo instances
  scripts/spelling.txt: add "disassocation" pattern and fix typo instances
  scripts/spelling.txt: add "omited" pattern and fix typo instances
  scripts/spelling.txt: add "explictely" pattern and fix typo instances
  scripts/spelling.txt: add "applys" pattern and fix typo instances
  scripts/spelling.txt: add "configuartion" pattern and fix typo instances
  ...
2017-02-27 23:09:29 -08:00
Linus Torvalds
f7878dc3a9 Merge branch 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Several noteworthy changes.

   - Parav's rdma controller is finally merged. It is very straight
     forward and can limit the abosolute numbers of common rdma
     constructs used by different cgroups.

   - kernel/cgroup.c got too chubby and disorganized. Created
     kernel/cgroup/ subdirectory and moved all cgroup related files
     under kernel/ there and reorganized the core code. This hurts for
     backporting patches but was long overdue.

   - cgroup v2 process listing reimplemented so that it no longer
     depends on allocating a buffer large enough to cache the entire
     result to sort and uniq the output. v2 has always mangled the sort
     order to ensure that users don't depend on the sorted output, so
     this shouldn't surprise anybody. This makes the pid listing
     functions use the same iterators that are used internally, which
     have to have the same iterating capabilities anyway.

   - perf cgroup filtering now works automatically on cgroup v2. This
     patch was posted a long time ago but somehow fell through the
     cracks.

   - misc fixes asnd documentation updates"

* 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
  kernfs: fix locking around kernfs_ops->release() callback
  cgroup: drop the matching uid requirement on migration for cgroup v2
  cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
  cgroup: misc cleanups
  cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
  cgroup: track migration context in cgroup_mgctx
  cgroup: cosmetic update to cgroup_taskset_add()
  rdmacg: Fixed uninitialized current resource usage
  cgroup: Add missing cgroup-v2 PID controller documentation.
  rdmacg: Added documentation for rdmacg
  IB/core: added support to use rdma cgroup controller
  rdmacg: Added rdma cgroup controller
  cgroup: fix a comment typo
  cgroup: fix RCU related sparse warnings
  cgroup: move namespace code to kernel/cgroup/namespace.c
  cgroup: rename functions for consistency
  cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
  cgroup: separate out cgroup1_kf_syscall_ops
  cgroup: refactor mount path and clearly distinguish v1 and v2 paths
  cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
  ...
2017-02-27 21:41:08 -08:00
Vegard Nossum
388f793455 mm: use mmget_not_zero() helper
We already have the helper, we can convert the rest of the kernel
mechanically using:

  git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/'

This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.

Link: http://lkml.kernel.org/r/20161218123229.22952-3-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:48 -08:00
Vegard Nossum
3fce371bfa mm: add new mmget() helper
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

  git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
  git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Link: http://lkml.kernel.org/r/20161218123229.22952-2-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:48 -08:00
Vegard Nossum
f1f1007644 mm: add new mmgrab() helper
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:48 -08:00
Alexey Dobriyan
5b5e0928f7 lib/vsprintf.c: remove %Z support
Now that %z is standartised in C99 there is no reason to support %Z.
Unlike %L it doesn't even make format strings smaller.

Use BUILD_BUG_ON in a couple ATM drivers.

In case anyone didn't notice lib/vsprintf.o is about half of SLUB which
is in my opinion is quite an achievement.  Hopefully this patch inspires
someone else to trim vsprintf.c more.

Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:47 -08:00
Masahiro Yamada
b564d62e67 scripts/spelling.txt: add "varible" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  varible||variable

While we are here, tidy up the comment blocks that fit in a single line
for drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c and
net/sctp/transport.c.

Link: http://lkml.kernel.org/r/1481573103-11329-11-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:47 -08:00
Masahiro Yamada
9332ef9dbd scripts/spelling.txt: add "an user" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  an user||a user
  an userspace||a userspace

I also added "userspace" to the list since it is a common word in Linux.
I found some instances for "an userfaultfd", but I did not add it to the
list.  I felt it is endless to find words that start with "user" such as
"userland" etc., so must draw a line somewhere.

Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Amit Pundir
0d31a194d5 config: android-base: enable hardened usercopy and kernel ASLR
Enable CONFIG_HARDENED_USERCOPY and CONFIG_RANDOMIZE_BASE in Android
base config fragment.

Reviewed at https://android-review.googlesource.com/#/c/283659/
Reviewed at https://android-review.googlesource.com/#/c/278133/

Link: http://lkml.kernel.org/r/1481113148-29204-2-git-send-email-amit.pundir@linaro.org
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Cc: Rob Herring <rob.herring@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Daniel Micay
2d75cb59a5 config: android-recommended: disable aio support
The aio interface adds substantial attack surface for a feature that's
not being exposed by Android at all.  It's unlikely that anyone is using
the kernel feature directly either.  This feature is rarely used even on
servers.  The glibc POSIX aio calls really use thread pools.  The lack
of widespread usage also means this is relatively poorly audited/tested.

The kernel's aio rarely provides performance benefits over using a
thread pool and is quite incomplete in terms of system call coverage
along with having edge cases where blocking can occur.  Part of the
performance issue is the fact that it only supports direct io, not
buffered io.  The existing API is considered fundamentally flawed and
it's unlikely it will be expanded, but rather replaced:

  https://marc.info/?l=linux-aio&m=145255815216051&w=2

Since ext4 encryption means no direct io support, kernel aio isn't even
going to work properly on Android devices using file-based encryption.

Reviewed-at: https://android-review.googlesource.com/#/c/292158/

Link: http://lkml.kernel.org/r/1481113148-29204-1-git-send-email-amit.pundir@linaro.org
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Cc: Rob Herring <rob.herring@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Stas Sergeev
441398d378 sigaltstack: support SS_AUTODISARM for CONFIG_COMPAT
Currently SS_AUTODISARM is not supported in compatibility mode, but does
not return -EINVAL either.  This makes dosemu built with -m32 on x86_64
to crash.  Also the kernel's sigaltstack selftest fails if compiled with
-m32.

This patch adds the needed support.

Link: http://lkml.kernel.org/r/20170205101213.8163-2-stsp@list.ru
Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>
Cc: Milosz Tanski <milosz@adfin.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Waiman Long <Waiman.Long@hpe.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:45 -08:00
Linus Torvalds
45554b2357 Commit 79c6f448c8 ("tracing: Fix hwlat kthread migration") fixed a
bug that was caused by a race condition in initializing the hwlat
 thread. When fixing this code, I realized that it should have been done
 differently. Instead of doing the rewrite and sending that to stable,
 I just sent the above commit to fix the bug that should be back ported.
 
 This commit is on top of the quick fix commit to rewrite the code the
 way it should have been written in the first place.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYtDsNFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 +CQH/0BhSwjiCnlHAkHNFKn47O0yDtxBLj8ar4bUQRacDXeQyAGDP13hn3q3LvG9
 CRzDXaYrusA3fjGcgmtyU33am6LK84dPn5u2HSyEalDZNBel8l6oYLUZVWLgef02
 x43949nOeBy+KO02Y118zGyxFEPtYBnCVpguMa4vdVgnr04gECo2VH5FjnLMslKM
 W1j/WrbVaO8WObh7X01JZzozWwp3McW4x6H8PUWaHjnhN/Iv6+YGtNN/Sa4cq4V/
 CyCfxrZvN/Y/uMSGzVlhuXxeRc2PRsjjmAqN+8P4KZIGW5BstdiWbrVj+KaBLR9z
 6QERD3atiEIYI/QGEep7ZH795PI=
 =Bdwk
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull another tracing update from Steven Rostedt:
 "Commit 79c6f448c8 ("tracing: Fix hwlat kthread migration") fixed a
  bug that was caused by a race condition in initializing the hwlat
  thread. When fixing this code, I realized that it should have been
  done differently. Instead of doing the rewrite and sending that to
  stable, I just sent the above commit to fix the bug that should be
  back ported.

  This commit is on top of the quick fix commit to rewrite the code the
  way it should have been written in the first place"

* tag 'trace-v4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Clean up the hwlat binding code
2017-02-27 13:36:19 -08:00
Linus Torvalds
79b17ea740 This release has no new tracing features, just clean ups, minor fixes
and small optimizations.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYtDiAFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 KygH/3sxuM9MCeJ29JsjmV49fHcNqryNZdvSadmnysPm+dFPiI6IgIIbh5R8H89b
 2V2gfQSmOTKHu3/wvJr/MprkGP275sWlZPORYFLDl/+NE/3q7g0NKOMWunLcv6dH
 QQRJIFjSMeGawA3KYBEcwBYMlgNd2VgtTxqLqSBhWth5omV6UevJNHhe3xzZ4nEE
 YbRX2mxwOuRHOyFp0Hem+Bqro4z1VXJ6YDxOvae2PP8krrIhIHYw9EI22GK68a2g
 EyKqKPPaEzfU8IjHIQCqIZta5RufnCrDbfHU0CComPANBRGO7g+ZhLO11a/Z316N
 lyV7JqtF680iem7NKcQlwEwhlLE=
 =HJnl
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "This release has no new tracing features, just clean ups, minor fixes
  and small optimizations"

* tag 'trace-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (25 commits)
  tracing: Remove outdated ring buffer comment
  tracing/probes: Fix a warning message to show correct maximum length
  tracing: Fix return value check in trace_benchmark_reg()
  tracing: Use modern function declaration
  jump_label: Reduce the size of struct static_key
  tracing/probe: Show subsystem name in messages
  tracing/hwlat: Update old comment about migration
  timers: Make flags output in the timer_start tracepoint useful
  tracing: Have traceprobe_probes_write() not access userspace unnecessarily
  tracing: Have COMM event filter key be treated as a string
  ftrace: Have set_graph_function handle multiple functions in one write
  ftrace: Do not hold references of ftrace_graph_{notrace_}hash out of graph_lock
  tracing: Reset parser->buffer to allow multiple "puts"
  ftrace: Have set_graph_functions handle write with RDWR
  ftrace: Reset fgd->hash in ftrace_graph_write()
  ftrace: Replace (void *)1 with a meaningful macro name FTRACE_GRAPH_EMPTY
  ftrace: Create a slight optimization on searching the ftrace_hash
  tracing: Add ftrace_hash_key() helper function
  ftrace: Convert graph filter to use hash tables
  ftrace: Expose ftrace_hash_empty and ftrace_lookup_ip
  ...
2017-02-27 13:26:17 -08:00
Chunyu Hu
3a150df945 tracing: Fix code comment for ftrace_ops_get_func()
There is no function 'ftrace_ops_recurs_func' existing in the current code,
it was renamed to ftrace_ops_assist_func() in commit c68c0fa293
("ftrace: Have ftrace_ops_get_func() handle RCU and PER_CPU flags too").
Update the comment to the correct function name.

Link: http://lkml.kernel.org/r/1487723366-14463-1-git-send-email-chuhu@redhat.com

Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-27 11:11:26 -05:00
Rafael J. Wysocki
2872de1382 PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
Define a pr_fmt() for hibernate.c and convert all of the explicit
printk() calls into corresponding pr_*() so that they use the
pr_fmt() format.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-27 15:10:20 +01:00
Rafael J. Wysocki
81d45bdf89 PM / hibernate: Untangle power_down()
The power_down() routine in the core hibernation code is not exactly
straightforward (to put it lightly), so clean it up to make it avoid
invoking itself recursively, among other things.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-27 15:09:28 +01:00
Linus Torvalds
7b46588f36 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - almost all of the rest of MM

 - misc bits

 - KASAN updates

 - procfs

 - lib/ updates

 - checkpatch updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (124 commits)
  checkpatch: remove false unbalanced braces warning
  checkpatch: notice unbalanced else braces in a patch
  checkpatch: add another old address for the FSF
  checkpatch: update $logFunctions
  checkpatch: warn on logging continuations
  checkpatch: warn on embedded function names
  lib/lz4: remove back-compat wrappers
  fs/pstore: fs/squashfs: change usage of LZ4 to work with new LZ4 version
  crypto: change LZ4 modules to work with new LZ4 module version
  lib/decompress_unlz4: change module to work with new LZ4 module version
  lib: update LZ4 compressor module
  lib/test_sort.c: make it explicitly non-modular
  lib: add CONFIG_TEST_SORT to enable self-test of sort()
  rbtree: use designated initializers
  linux/kernel.h: fix DIV_ROUND_CLOSEST to support negative divisors
  lib/find_bit.c: micro-optimise find_next_*_bit
  lib: add module support to atomic64 tests
  lib: add module support to glob tests
  lib: add module support to crc32 tests
  kernel/ksysfs.c: add __ro_after_init to bin_attribute structure
  ...
2017-02-25 10:29:09 -08:00
Bhumika Goyal
738bc38d49 kernel/ksysfs.c: add __ro_after_init to bin_attribute structure
The object notes_attr of type bin_attribute is not modified after
getting initailized by ksysfs_init.  Apart from initialization in
ksysfs_init it is also passed as an argument to the function
sysfs_create_bin_file but this argument is of type const.  Therefore,
add __ro_after_init to its declaration.

Link: http://lkml.kernel.org/r/1486839969-16891-1-git-send-email-bhumirks@gmail.com
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:56 -08:00
Viresh Kumar
3e6daded1f kernel/notifier.c: simplify expression
NOTIFY_STOP_MASK (0x8000) has only one bit set and there is no need to
compare output of "ret & NOTIFY_STOP_MASK" to NOTIFY_STOP_MASK.  We just
need to make sure the output is non-zero, that's it.

Link: http://lkml.kernel.org/r/88ee58264a2bfab1c97ffc8ac753e25f55f57c10.1483593065.git.viresh.kumar@linaro.org
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:56 -08:00
Mike Rapoport
ca49ca7114 userfaultfd: non-cooperative: add event for exit() notification
Allow userfaultfd monitor track termination of the processes that have
memory backed by the uffd.

[rppt@linux.vnet.ibm.com: add comment]
  Link: http://lkml.kernel.org/r/20170202135448.GB19804@rapoport-lnxLink: http://lkml.kernel.org/r/1485542673-24387-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:55 -08:00
Kirill A. Shutemov
14fa2daa15 mm, uprobes: convert __replace_page() to use page_vma_mapped_walk()
For consistency, it worth converting all page_check_address() to
page_vma_mapped_walk(), so we could drop the former.

Link: http://lkml.kernel.org/r/20170129173858.45174-10-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:55 -08:00
Kirill A. Shutemov
c8394812e5 uprobes: split THPs before trying to replace them
Patch series "Fix few rmap-related THP bugs", v3.

The patchset fixes handing PTE-mapped THPs in page_referenced() and
page_idle_clear_pte_refs().

To achieve that I've intrdocued new helper -- page_vma_mapped_walk() --
which replaces all page_check_address{,_transhuge}() and covers all THP
cases.

Patchset overview:
  - First patch fixes one uprobe bug (unrelated to the rest of the
    patchset, just spotted it at the same time);

  - Patches 2-5 fix handling PTE-mapped THPs in page_referenced(),
    page_idle_clear_pte_refs() and rmap core;

  - Patches 6-12 convert all page_check_address{,_transhuge}() users
    (plus remove_migration_pte()) to page_vma_mapped_walk() and drop
    unused helpers.

I think the fixes are not critical enough for stable@ as they don't lead
to crashes or hangs, only suboptimal behaviour.

This patch (of 12):

For THPs page_check_address() always fails.  It leads to endless loop in
uprobe_write_opcode().

Testcase with huge-tmpfs (uprobes cannot probe anonymous memory).

	mount -t debugfs none /sys/kernel/debug
	mount -t tmpfs -o huge=always none /mnt
	gcc -Wall -O2 -o /mnt/test -x c - <<EOF
	int main(void)
	{
		return 0;
	}
	/* Padding to map the code segment with huge pmd */
	asm (".zero 2097152");
	EOF
	echo 'p /mnt/test:0' > /sys/kernel/debug/tracing/uprobe_events
	echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
	/mnt/test

Let's split THPs before trying to replace.

Link: http://lkml.kernel.org/r/20170129173858.45174-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:55 -08:00
Dave Jiang
11bac80004 mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
  Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Dan Williams
b5d24fda9c mm, devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin, done}
The mem_hotplug_{begin,done} lock coordinates with {get,put}_online_mems()
to hold off "readers" of the current state of memory from new hotplug
actions.  mem_hotplug_begin() expects exclusive access, via the
device_hotplug lock, to set mem_hotplug.active_writer.  Calling
mem_hotplug_begin() without locking device_hotplug can lead to
corrupting mem_hotplug.refcount and missed wakeups / soft lockups.

[dan.j.williams@intel.com: v2]
  Link: http://lkml.kernel.org/r/148728203365.38457.17804568297887708345.stgit@dwillia2-desk3.amr.corp.intel.com
Link: http://lkml.kernel.org/r/148693885680.16345.17802627926777862337.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: f931ab479d ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:53 -08:00
Linus Torvalds
f8e6859ea9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc updates from David Miller:

 1) Support multiple huge page sizes, from Nitin Gupta.

 2) Improve boot time on large memory configurations, from Pavel
    Tatashin.

 3) Make BRK handling more consistent and documented, from Vijay Kumar.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Fix build error in flush_tsb_user_page
  sparc64: memblock resizes are not handled properly
  sparc64: use latency groups to improve add_node_ranges speed
  sparc64: Add 64K page size support
  sparc64: Multi-page size support
  Documentation/sparc: Steps for sending break on sunhv console
  sparc64: Send break twice from console to return to boot prom
  sparc64: Migrate hvcons irq to panicked cpu
  sparc64: Set cpu state to offline when stopped
  sunvdc: Add support for setting physical sector size
  sparc64: fix for user probes in high memory
  sparc: topology_64.h: Fix condition for including cpudata.h
  sparc32: mm: srmmu: add __ro_after_init to sparc32_cachetlb_ops structures
2017-02-24 15:01:17 -08:00
Konstantin Khlebnikov
96b777452d sched/cgroup: Move sched_online_group() back into css_online() to fix crash
Commit:

  2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")

.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.

LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: kernfs_path_from_node_locked+0x260/0x320
  CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
  Call Trace:
  ? kernfs_path_from_node+0x4f/0x60
  kernfs_path_from_node+0x3e/0x60
  print_rt_rq+0x44/0x2b0
  print_rt_stats+0x7a/0xd0
  print_cpu+0x2fc/0xe80
  ? __might_sleep+0x4a/0x80
  sched_debug_show+0x17/0x30
  seq_read+0xf2/0x3b0
  proc_reg_read+0x42/0x70
  __vfs_read+0x28/0x130
  ? security_file_permission+0x9b/0xc0
  ? rw_verify_area+0x4e/0xb0
  vfs_read+0xa5/0x170
  SyS_read+0x46/0xa0
  entry_SYSCALL_64_fastpath+0x1e/0xad

Here the task group is already linked into the global RCU-protected 'task_groups'
list, but the css->cgroup pointer is still NULL.

This patch reverts this chunk and moves online back to css_online().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 09:25:28 +01:00
Wanpeng Li
a499c3ead8 sched/fair: Update rq clock before changing a task's CPU affinity
This is triggered during boot when CONFIG_SCHED_DEBUG is enabled:

 ------------[ cut here ]------------
 WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380
 rq->clock_update_flags < RQCF_ACT_SKIP
 CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
 Call Trace:
  dump_stack+0x85/0xc2
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  set_next_entity+0x11d/0x380
  set_curr_task_fair+0x2b/0x60
  do_set_cpus_allowed+0x139/0x180
  __set_cpus_allowed_ptr+0x113/0x260
  set_cpus_allowed_ptr+0x10/0x20
  torture_shuffle+0xfd/0x180
  kthread+0x10f/0x150
  ? torture_shutdown_init+0x60/0x60
  ? kthread_create_on_node+0x60/0x60
  ret_from_fork+0x31/0x40
 ---[ end trace dd94d92344cea9c6 ]---

The task is running && !queued, so there is no rq clock update before calling
set_curr_task().

This patch fixes it by updating rq clock after holding rq->lock/pi_lock
just as what other dequeue + put_prev + enqueue + set_curr story does.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:58:33 +01:00
Peter Zijlstra
8cb68b343a sched/core: Fix update_rq_clock() splat on hotplug (and suspend/resume)
The hotplug code still triggers the warning about using a stale
rq->clock value.

Fix things up to actually run update_rq_clock() in a place where we
record the 'UPDATED' flag, and then modify the annotation to retain
this flag over the rq->lock fiddling that happens as a result of
actually migrating all the tasks elsewhere.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Mike Galbraith <efault@gmx.de>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <zwisler@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4d25b35ea3 ("sched/fair: Restore previous rq_flags when migrating tasks in hotplug")
Link: http://lkml.kernel.org/r/20170202155506.GX6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:58:33 +01:00
Tan Xiaojun
1572e45a92 perf/core: Fix the perf_cpu_time_max_percent check
Use "proc_dointvec_minmax" instead of "proc_dointvec" to check the input
value from user-space.

If not, we can set a big value and some vars will overflow like
"sysctl_perf_event_sample_rate" which will cause a lot of unexpected
problems.

Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <acme@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1487829879-56237-1-git-send-email-tanxiaojun@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:56:33 +01:00
Peter Zijlstra
7bbba0eb1a perf/core: Fix perf_event_enable_on_exec() timekeeping (again)
Where commit:

  7fce250915 ("perf: Fix scaling vs.  perf_event_enable_on_exec()")

disabled the ctx-time a-priory, such that all events get enabled and
scheduled at the time point in time, there is one hole in that patch,
when no events do get enabled nothing re-enables the ctx-time.

Reported-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 7fce250915 ("perf: Fix scaling vs.  perf_event_enable_on_exec()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:56:32 +01:00
Peter Zijlstra
279b5165ff perf/core: Remove confusing comment and move put_ctx()
Since commit:

  321027c1fe ("perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race")

... the code looks like (assuming move_group==1):

  gctx = __perf_event_ctx_lock_double(group_leader, ctx);

  perf_remove_from_context(group_leader, 0);
  list_for_each_entry(sibling, &group_leader->sibling_list, group_entry) {
	perf_remove_from_context(sibling, 0);
	put_ctx(gctx);
  }

  /* ... */

  /* misleading comment about how this is the last reference */
  put_ctx(gctx);

  perf_event_ctx_unlock(group_leader, gctx);

What that 'last' put_ctx() does is drop @group_leader's reference on
gctx after having dropped all its potential sibling references.

But the thing is that __perf_event_ctx_lock_double() returns with a
reference _and_ a held lock, and perf_event_ctx_unlock() unlocks that
lock and drops that reference. Therefore that put_ctx() cannot be the
'last' of anything, nor is there an unbalance in puts.

To reduce confusion, remove the comment and place the put_ctx() next
to the remove_from_context() call.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24 08:56:32 +01:00
Linus Torvalds
f1ef09fde1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "There is a lot here. A lot of these changes result in subtle user
  visible differences in kernel behavior. I don't expect anything will
  care but I will revert/fix things immediately if any regressions show
  up.

  From Seth Forshee there is a continuation of the work to make the vfs
  ready for unpriviled mounts. We had thought the previous changes
  prevented the creation of files outside of s_user_ns of a filesystem,
  but it turns we missed the O_CREAT path. Ooops.

  Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
  standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
  children that are forked after the prctl are considered and not
  children forked before the prctl. The only known user of this prctl
  systemd forks all children after the prctl. So no userspace
  regressions will occur. Holding earlier forked children to the same
  rules as later forked children creates a semantic that is sane enough
  to allow checkpoing of processes that use this feature.

  There is a long delayed change by Nikolay Borisov to limit inotify
  instances inside a user namespace.

  Michael Kerrisk extends the API for files used to maniuplate
  namespaces with two new trivial ioctls to allow discovery of the
  hierachy and properties of namespaces.

  Konstantin Khlebnikov with the help of Al Viro adds code that when a
  network namespace exits purges it's sysctl entries from the dcache. As
  in some circumstances this could use a lot of memory.

  Vivek Goyal fixed a bug with stacked filesystems where the permissions
  on the wrong inode were being checked.

  I continue previous work on ptracing across exec. Allowing a file to
  be setuid across exec while being ptraced if the tracer has enough
  credentials in the user namespace, and if the process has CAP_SETUID
  in it's own namespace. Proc files for setuid or otherwise undumpable
  executables are now owned by the root in the user namespace of their
  mm. Allowing debugging of setuid applications in containers to work
  better.

  A bug I introduced with permission checking and automount is now
  fixed. The big change is to mark the mounts that the kernel initiates
  as a result of an automount. This allows the permission checks in sget
  to be safely suppressed for this kind of mount. As the permission
  check happened when the original filesystem was mounted.

  Finally a special case in the mount namespace is removed preventing
  unbounded chains in the mount hash table, and making the semantics
  simpler which benefits CRIU.

  The vfs fix along with related work in ima and evm I believe makes us
  ready to finish developing and merge fully unprivileged mounts of the
  fuse filesystem. The cleanups of the mount namespace makes discussing
  how to fix the worst case complexity of umount. The stacked filesystem
  fixes pave the way for adding multiple mappings for the filesystem
  uids so that efficient and safer containers can be implemented"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc/sysctl: Don't grab i_lock under sysctl_lock.
  vfs: Use upper filesystem inode in bprm_fill_uid()
  proc/sysctl: prune stale dentries during unregistering
  mnt: Tuck mounts under others instead of creating shadow/side mounts.
  prctl: propagate has_child_subreaper flag to every descendant
  introduce the walk_process_tree() helper
  nsfs: Add an ioctl() to return owner UID of a userns
  fs: Better permission checking for submounts
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  vfs: open() with O_CREAT should not create inodes with unknown ids
  nsfs: Add an ioctl() to return the namespace type
  proc: Better ownership of files for non-dumpable tasks in user namespaces
  exec: Remove LSM_UNSAFE_PTRACE_CAP
  exec: Test the ptracer's saved cred to see if the tracee can gain caps
  exec: Don't reset euid and egid when the tracee has CAP_SETUID
  inotify: Convert to using per-namespace limits
2017-02-23 20:33:51 -08:00
Linus Torvalds
190c3ee06a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Some 'const'ing in qlogic networking drivers, from Bhumika Goyal.

 2) Fix scheduling while atomic in l2tp network namespace exit by
    deferring the work to the workqueue. From Ridge Kennedy.

 3) Fix use after free in dccp timewait handling, from Andrey Ryabinin.

 4) mlx5e CQE compression engine not initialized properly, from Tariq
    Toukan.

 5) Some UAPI header fixes from Dmitry V. Levin.

 6) Don't overwrite module parameter value in mlx4 driver, from Majd
    Dibbiny.

 7) Fix divide by zero in xt_hashlimit netfilter module, from Alban
    Browaeys.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
  bpf: Fix bpf_xdp_event_output
  net/mlx4_en: Use __skb_fill_page_desc()
  net/mlx4_core: Use cq quota in SRIOV when creating completion EQs
  net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs
  net/mlx4: Spoofcheck and zero MAC can't coexist
  net/mlx4: Change ENOTSUPP to EOPNOTSUPP
  uapi: fix linux/rds.h userspace compilation errors
  uapi: fix linux/seg6.h and linux/seg6_iptunnel.h userspace compilation errors
  lib: Remove string from parman config selection
  forcedeth: Remove return from a void function
  bpf: fix spelling mistake: "proccessed" -> "processed"
  uapi: fix linux/llc.h userspace compilation error
  uapi: fix linux/ip6_tunnel.h userspace compilation errors
  net/mlx5e: Fix wrong CQE decompression
  net/mlx5e: Update MPWQE stride size when modifying CQE compress state
  net/mlx5e: Fix broken CQE compression initialization
  net/mlx5e: Do not reduce LRO WQE size when not using build_skb
  net/mlx5e: Register/unregister vport representors on interface attach/detach
  net/mlx5e: s390 system compilation fix
  tcp: account for ts offset only if tsecr not zero
  ...
2017-02-23 11:36:06 -08:00
Vijay Kumar
7db60d05e5 sparc64: Send break twice from console to return to boot prom
Now we can also jump to boot prom from sunhv console by sending
break twice on console for both running and panicked kernel
cases.

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 08:27:24 -08:00
Colin Ian King
bc1750f366 bpf: fix spelling mistake: "proccessed" -> "processed"
trivial fix to spelling mistake in verbose log message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:46:08 -05:00
Linus Torvalds
bc49a7831b Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "142 patches:

   - DAX updates

   - various misc bits

   - OCFS2 updates

   - most of MM"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (142 commits)
  mm/z3fold.c: limit first_num to the actual range of possible buddy indexes
  mm: fix <linux/pagemap.h> stray kernel-doc notation
  zram: remove obsolete sysfs attrs
  mm/memblock.c: remove unnecessary log and clean up
  oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
  mm: drop unused argument of zap_page_range()
  mm: drop zap_details::check_swap_entries
  mm: drop zap_details::ignore_dirty
  mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled
  mm: help __GFP_NOFAIL allocations which do not trigger OOM killer
  mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically
  mm: consolidate GFP_NOFAIL checks in the allocator slowpath
  lib/show_mem.c: teach show_mem to work with the given nodemask
  arch, mm: remove arch specific show_mem
  mm, page_alloc: warn_alloc print nodemask
  mm, page_alloc: do not report all nodes in show_mem
  Revert "mm: bail out in shrink_inactive_list()"
  mm, vmscan: consider eligible zones in get_scan_count
  mm, vmscan: cleanup lru size claculations
  mm, vmscan: do not count freed pages as PGDEACTIVATE
  ...
2017-02-22 19:29:24 -08:00
Linus Torvalds
b4642c109f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull seccomp fix from James Morris:
 "A fix for a regression in the seccomp code (it was supposed to be in
  the first pull req but I had it queued in the wrong branch)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  seccomp: Only dump core when single-threaded
2017-02-22 18:08:50 -08:00
Linus Torvalds
7d91de7443 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:

 - Add Petr Mladek, Sergey Senozhatsky as printk maintainers, and Steven
   Rostedt as the printk reviewer. This idea came up after the
   discussion about printk issues at Kernel Summit. It was formulated
   and discussed at lkml[1].

 - Extend a lock-less NMI per-cpu buffers idea to handle recursive
   printk() calls by Sergey Senozhatsky[2]. It is the first step in
   sanitizing printk as discussed at Kernel Summit.

   The change allows to see messages that would normally get ignored or
   would cause a deadlock.

   Also it allows to enable lockdep in printk(). This already paid off.
   The testing in linux-next helped to discover two old problems that
   were hidden before[3][4].

 - Remove unused parameter by Sergey Senozhatsky. Clean up after a past
   change.

[1] http://lkml.kernel.org/r/1481798878-31898-1-git-send-email-pmladek@suse.com
[2] http://lkml.kernel.org/r/20161227141611.940-1-sergey.senozhatsky@gmail.com
[3] http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com
[4] http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  printk: drop call_console_drivers() unused param
  printk: convert the rest to printk-safe
  printk: remove zap_locks() function
  printk: use printk_safe buffers in printk
  printk: report lost messages in printk safe/nmi contexts
  printk: always use deferred printk when flush printk_safe lines
  printk: introduce per-cpu safe_print seq buffer
  printk: rename nmi.c and exported api
  printk: use vprintk_func in vprintk()
  MAINTAINERS: Add printk maintainers
2017-02-22 17:33:34 -08:00
Linus Torvalds
6ef192f225 Modules updates for v4.11
Summary of modules changes for the 4.11 merge window:
 
 - A few small code cleanups
 
 - Add modules git tree url to MAINTAINERS
 
 Signed-off-by: Jessica Yu <jeyu@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYrKzPAAoJEMBFfjjOO8Fy8B4QAJpwYokr7a7irVaRt9+c+So5
 iRoQ+ZGB7r0oiJOpuUeVwvr/h5WUCc8kZctGG249Gg9WrT7ypVCKNGOuGv9KUi4g
 lZZ3SkebBfAkzqRABa4VA4uoz6lC4KgaxVeMBZOu0HUmcM4fGmjv8iONj/6eqkpv
 jWSsO2iyTHa5c6L08I2M2unOMG4PqAS7ZS1S58A3A3vG7py9vJhq7gnom4dYHYQW
 2sOGyNvs3RaTyyb/Gvsx/hcs5TPLyr+fzIruqFWzepGcafBSxQy/TdThJw5x5oGU
 QLjP/EqSKQWGyJ/Pzx8UE9bGxxStyJOEEhniyigQvIq1ERkPeXZx+1nllWvBXZ9f
 v+OplyWAzvQNNv+MZEE6s0l7EQDiowOmnpyfHZOQTHky4JwAZ/WwKcjzLsLaRENW
 ePWLsM8F7Hhg9rpXBBEK5USTh0brcaNs6ox0CjlMqme8aNxYBoOUB7KDzlyWQCLd
 rtY6F9upfYmG13J6cDV6qbSirHt5L2aErgOFTbl9ZGQb9rXdsv4VjYUMaVrFgz/5
 9dUNqoZUd4jvLAZT7XSUJsqUKqIUWTE9vFhPiKUDGyptynCwk23VMcd3p/SDjhyL
 kuuIOxYEAi/F+claors16UN6psGb9yHYHsmuTsSbJcBHA+VyavqIuCcVVbKIScVR
 nR+VSRH0vnx3M7hR/OEq
 =zs2F
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:
 "Summary of modules changes for the 4.11 merge window:

   - A few small code cleanups

   - Add modules git tree url to MAINTAINERS"

* tag 'modules-for-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  MAINTAINERS: add tree for modules
  module: fix memory leak on early load_module() failures
  module: Optimize search_module_extables()
  modules: mark __inittest/__exittest as __maybe_unused
  livepatch/module: print notice of TAINT_LIVEPATCH
  module: Drop redundant declaration of struct module
2017-02-22 17:08:33 -08:00
Pavel Emelyanov
893e26e61d userfaultfd: non-cooperative: Add fork() event
When the mm with uffd-ed vmas fork()-s the respective vmas notify their
uffds with the event which contains a descriptor with new uffd.  This
new descriptor can then be used to get events from the child and
populate its mm with data.  Note, that there can be different uffd-s
controlling different vmas within one mm, so first we should collect all
those uffds (and ctx-s) in a list and then notify them all one by one
but only once per fork().

The context is created at fork() time but the descriptor, file struct
and anon inode object is created at event read time.  So some trickery
is added to the userfaultfd_ctx_read() to handle the ctx queues' locking
vs file creation.

Another thing worth noticing is that the task that fork()-s waits for
the uffd event to get processed WITHOUT the mmap sem.

[aarcange@redhat.com: build warning fix]
  Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com
Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:28 -08:00
Prarit Bhargava
8dcde9def5 kernel/watchdog.c: do not hardcode CPU 0 as the initial thread
When CONFIG_BOOTPARAM_HOTPLUG_CPU0 is enabled, the socket containing the
boot cpu can be replaced.  During the hot add event, the message

NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.

is output implying that the NMI watchdog was disabled at some point.  This
is not the case and the message has caused confusion for users of systems
that support the removal of the boot cpu socket.

The watchdog code is coded to assume that cpu 0 is always the first cpu to
initialize the watchdog, and the last to stop its watchdog thread.  That
is not the case for initializing if cpu 0 has been removed and added.  The
removal case has never been correct because the smpboot code will remove
the watchdog threads starting with the lowest cpu number.

This patch adds watchdog_cpus to track the number of cpus with active NMI
watchdog threads so that the first and last thread can be used to set and
clear the value of firstcpu_err.  firstcpu_err is set when the first
watchdog thread is enabled, and cleared when the last watchdog thread is
disabled.

Link: http://lkml.kernel.org/r/1480425321-32296-1-git-send-email-prarit@redhat.com
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joshua Hunt <johunt@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:27 -08:00
Ross Zwisler
d3213e8fd4 tracing: add __print_flags_u64()
Patch series "DAX tracepoints, mm argument simplification", v4.

This contains both my DAX tracepoint code and Dave Jiang's MM argument
simplifications.  Dave's code was written with my tracepoint code as a
baseline, so it seemed simplest to keep them together in a single series.

This patch (of 7):

Add __print_flags_u64() and the helper trace_print_flags_seq_u64() in the
same spirit as __print_symbolic_u64() and trace_print_symbols_seq_u64().
These functions allow us to print symbols associated with flags that are
64 bits wide even on 32 bit machines.

These will be used by the DAX code so that we can print the flags set in a
pfn_t such as PFN_SG_CHAIN, PFN_SG_LAST, PFN_DEV and PFN_MAP.

Without this new function I was getting errors like the following when
compiling for i386:

  include/linux/pfn_t.h:13:22: warning: large integer implicitly truncated to unsigned type [-Woverflow]
   #define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1))
    ^

Link: http://lkml.kernel.org/r/1484085142-2297-2-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:26 -08:00
Kees Cook
d7276e321f seccomp: Only dump core when single-threaded
The SECCOMP_RET_KILL filter return code has always killed the current
thread, not the entire process. Changing this as a side-effect of dumping
core isn't a safe thing to do (a few test suites have already flagged this
behavioral change). Instead, restore the RET_KILL semantics, but still
dump core when a RET_KILL delivers SIGSYS to a single-threaded process.

Fixes: b25e67161c ("seccomp: dump core when using SECCOMP_RET_KILL")
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-02-23 09:42:35 +11:00
Linus Torvalds
b2064617c7 driver core patches for 4.11-rc1
Here is the "small" driver core patches for 4.11-rc1.
 
 Not much here, some firmware documentation and self-test updates, a
 debugfs code formatting issue, and a new feature for call_usermodehelper
 to make it more robust on systems that want to lock it down in a more
 secure way.
 
 All of these have been linux-next for a while now with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWK2jKg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymCEACgozYuqZZ/TUGW0P3xVNi7fbfUWCEAn3nYExrc
 XgevqeYOSKp2We6X/2JX
 =aZ+5
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "small" driver core patches for 4.11-rc1.

  Not much here, some firmware documentation and self-test updates, a
  debugfs code formatting issue, and a new feature for call_usermodehelper
  to make it more robust on systems that want to lock it down in a more
  secure way.

  All of these have been linux-next for a while now with no reported
  issues"

* tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  kernfs: handle null pointers while printing node name and path
  Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()
  Make static usermode helper binaries constant
  kmod: make usermodehelper path a const string
  firmware: revamp firmware documentation
  selftests: firmware: send expected errors to /dev/null
  selftests: firmware: only modprobe if driver is missing
  platform: Print the resource range if device failed to claim
  kref: prefer atomic_inc_not_zero to atomic_add_unless
  debugfs: improve formatting of debugfs_real_fops()
2017-02-22 11:44:32 -08:00
Linus Torvalds
ca78d3173c arm64 updates for 4.11:
- Errata workarounds for Qualcomm's Falkor CPU
 - Qualcomm L2 Cache PMU driver
 - Qualcomm SMCCC firmware quirk
 - Support for DEBUG_VIRTUAL
 - CPU feature detection for userspace via MRS emulation
 - Preliminary work for the Statistical Profiling Extension
 - Misc cleanups and non-critical fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJYpIxqAAoJELescNyEwWM0xdwH/AsTYAXPZDMdRnrQUyV0Fd2H
 /9pMzww6dHXEmCMKkImf++otUD6S+gTCJTsj7kEAXT5sZzLk27std5lsW7R9oPjc
 bGQMalZy+ovLR1gJ6v072seM3In4xph/qAYOpD8Q0AfYCLHjfMMArQfoLa8Esgru
 eSsrAgzVAkrK7XHi3sYycUjr9Hac9tvOOuQ3SaZkDz4MfFIbI4b43+c1SCF7wgT9
 tQUHLhhxzGmgxjViI2lLYZuBWsIWsE+algvOe1qocvA9JEIXF+W8NeOuCjdL8WwX
 3aoqYClC+qD/9+/skShFv5gM5fo0/IweLTUNIHADXpB6OkCYDyg+sxNM+xnEWQU=
 =YrPg
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 - Errata workarounds for Qualcomm's Falkor CPU
 - Qualcomm L2 Cache PMU driver
 - Qualcomm SMCCC firmware quirk
 - Support for DEBUG_VIRTUAL
 - CPU feature detection for userspace via MRS emulation
 - Preliminary work for the Statistical Profiling Extension
 - Misc cleanups and non-critical fixes

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (74 commits)
  arm64/kprobes: consistently handle MRS/MSR with XZR
  arm64: cpufeature: correctly handle MRS to XZR
  arm64: traps: correctly handle MRS/MSR with XZR
  arm64: ptrace: add XZR-safe regs accessors
  arm64: include asm/assembler.h in entry-ftrace.S
  arm64: fix warning about swapper_pg_dir overflow
  arm64: Work around Falkor erratum 1003
  arm64: head.S: Enable EL1 (host) access to SPE when entered at EL2
  arm64: arch_timer: document Hisilicon erratum 161010101
  arm64: use is_vmalloc_addr
  arm64: use linux/sizes.h for constants
  arm64: uaccess: consistently check object sizes
  perf: add qcom l2 cache perf events driver
  arm64: remove wrong CONFIG_PROC_SYSCTL ifdef
  ARM: smccc: Update HVC comment to describe new quirk parameter
  arm64: do not trace atomic operations
  ACPI/IORT: Fix the error return code in iort_add_smmu_platform_device()
  ACPI/IORT: Fix iort_node_get_id() mapping entries indexing
  arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA
  perf: xgene: Include module.h
  ...
2017-02-22 10:46:44 -08:00
Linus Torvalds
38705613b7 powerpc updates for 4.11 part 1.
Highlights include:
 
  - Support for direct mapped LPC on POWER9, giving Linux direct access to
    devices that may be on there such as a UART.
 
  - Memory hotplug support for the Power9 Radix MMU.
 
  - Add new AUX vectors describing the processor's cache geometry, to be used by
    glibc.
 
  - The ability for a guest to ask the hypervisor to resize the guest's hash
    table, and in addition support for doing so automatically when memory is
    hotplugged into/out-of the guest. This allows the hash table to be sized
    based on the current memory usage of the guest, rather than the maximum
    possible memory usage.
 
  - Implementation of optprobes (kprobe optimisation) for powerpc.
 
 In addition there's the topic branch shared with the KVM tree, which includes
 support for guests to use the Radix MMU on Power9.
 
 Thanks to:
   Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton Blanchard,
   Benjamin Herrenschmidt, Chris Packham, Daniel Axtens, Daniel Borkmann, David
   Gibson, Finn Thain, Gautham R. Shenoy, Gavin Shan, Greg Kurz, Joel Stanley,
   John Allen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
   Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi
   Bangoria, Reza Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYrWj0AAoJEFHr6jzI4aWAsn4P/08Kz3TtOvDuuPGVNoO7fWOn
 ag5/zVt8R4FuCALqWpAZbVqMuUU4wLxG0RuWlmBNYYhrjMC6JxHpOSjQXxM2D7YT
 CdGTJxG414r6HMOeToL9i/z33o0m+KT07tscer+QMKlXVKCR2z0fEJch+zoPBHYA
 5eVBFpLLTtbNiX9UnrcM/vYz61d56kT4YJey9/8qbAkTAc1rMPa8ucU5UiKYJ7yX
 mF8cd7WE+7aqif9V8yN59G2rcbz+h3pbMw/gzImiYsYrUj4fLjU+VTKL5PPT+UFy
 WWBXD3MAMm1dksMMZi3hgoo2BZhDn3RkymeYi6Jo4kDknNMPZzkMxGyvaJ8Eq5H/
 bIYXdS1AbTtvaaEEuWqDFjpnChOEvj/8IeqitU0jjlql8BVjNKg/ESaaKucZr+pO
 pk2Mfvw0Tb/lxJT5qj27yq4aRsxJwdFOoPYCN7MquPp/wV2Tg5M6h4nVQ4T6Wl0S
 tMFQeCqXflhDWh0Xgr2UXpF66/YTj3Du5LasOTkgGeU30Z8TcNGFEmDWShKP3cEm
 e0oQE+OHhIGN4WSBXBAto/Gw8/0v3pXlMs+VEfeHqenwPss2sWtSwXWe8khmiy9e
 DJ48sTzj75/Zx1fiqRldw9YEnrL+NK0eOOpvzxeyKpfvdUytc+chFfEqUmO6kl3Z
 DW2UmlZxmW+b0SfexCHL
 =Icle
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for direct mapped LPC on POWER9, giving Linux direct access
     to devices that may be on there such as a UART.

   - Memory hotplug support for the Power9 Radix MMU.

   - Add new AUX vectors describing the processor's cache geometry, to
     be used by glibc.

   - The ability for a guest to ask the hypervisor to resize the guest's
     hash table, and in addition support for doing so automatically when
     memory is hotplugged into/out-of the guest. This allows the hash
     table to be sized based on the current memory usage of the guest,
     rather than the maximum possible memory usage.

   - Implementation of optprobes (kprobe optimisation) for powerpc.

  In addition there's the topic branch shared with the KVM tree, which
  includes support for guests to use the Radix MMU on Power9.

  Thanks to:
    Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton
    Blanchard, Benjamin Herrenschmidt, Chris Packham, Daniel Axtens,
    Daniel Borkmann, David Gibson, Finn Thain, Gautham R. Shenoy, Gavin
    Shan, Greg Kurz, Joel Stanley, John Allen, Madhavan Srinivasan,
    Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Nathan Fontenot,
    Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Reza
    Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun"

* tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (129 commits)
  powerpc/mm/radix: Skip ptesync in pte update helpers
  powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm
  powerpc/mm/radix: Update pte update sequence for pte clear case
  powerpc/mm: Update PROTFAULT handling in the page fault path
  powerpc/xmon: Fix data-breakpoint
  powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=y
  powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y
  powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n
  powerpc/pseries: Fix typo in parameter description
  powerpc/kprobes: Remove kprobe_exceptions_notify()
  kprobes: Introduce weak variant of kprobe_exceptions_notify()
  powerpc/ftrace: Fix confusing help text for DISABLE_MPROFILE_KERNEL
  powerpc/powernv: Fix opal_exit tracepoint opcode
  powerpc: Add a prototype for mcount() so it can be versioned
  powerpc: Drop GPL from of_node_to_nid() export to match other arches
  powerpc/kprobes: Optimize kprobe in kretprobe_trampoline()
  powerpc/kprobes: Implement Optprobes
  powerpc/kprobes: Fixes for kprobe_lookup_name() on BE
  powerpc: Add helper to check if offset is within relative branch range
  powerpc/bpf: Introduce __PPC_SH64()
  ...
2017-02-22 10:30:38 -08:00
Linus Torvalds
3051bf36c2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini
      Varadhan.

   2) Simplify classifier state on sk_buff in order to shrink it a bit.
      From Willem de Bruijn.

   3) Introduce SIPHASH and it's usage for secure sequence numbers and
      syncookies. From Jason A. Donenfeld.

   4) Reduce CPU usage for ICMP replies we are going to limit or
      suppress, from Jesper Dangaard Brouer.

   5) Introduce Shared Memory Communications socket layer, from Ursula
      Braun.

   6) Add RACK loss detection and allow it to actually trigger fast
      recovery instead of just assisting after other algorithms have
      triggered it. From Yuchung Cheng.

   7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot.

   8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert.

   9) Export MPLS packet stats via netlink, from Robert Shearman.

  10) Significantly improve inet port bind conflict handling, especially
      when an application is restarted and changes it's setting of
      reuseport. From Josef Bacik.

  11) Implement TX batching in vhost_net, from Jason Wang.

  12) Extend the dummy device so that VF (virtual function) features,
      such as configuration, can be more easily tested. From Phil
      Sutter.

  13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric
      Dumazet.

  14) Add new bpf MAP, implementing a longest prefix match trie. From
      Daniel Mack.

  15) Packet sample offloading support in mlxsw driver, from Yotam Gigi.

  16) Add new aquantia driver, from David VomLehn.

  17) Add bpf tracepoints, from Daniel Borkmann.

  18) Add support for port mirroring to b53 and bcm_sf2 drivers, from
      Florian Fainelli.

  19) Remove custom busy polling in many drivers, it is done in the core
      networking since 4.5 times. From Eric Dumazet.

  20) Support XDP adjust_head in virtio_net, from John Fastabend.

  21) Fix several major holes in neighbour entry confirmation, from
      Julian Anastasov.

  22) Add XDP support to bnxt_en driver, from Michael Chan.

  23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan.

  24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi.

  25) Support GRO in IPSEC protocols, from Steffen Klassert"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits)
  Revert "ath10k: Search SMBIOS for OEM board file extension"
  net: socket: fix recvmmsg not returning error from sock_error
  bnxt_en: use eth_hw_addr_random()
  bpf: fix unlocking of jited image when module ronx not set
  arch: add ARCH_HAS_SET_MEMORY config
  net: napi_watchdog() can use napi_schedule_irqoff()
  tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
  net/hsr: use eth_hw_addr_random()
  net: mvpp2: enable building on 64-bit platforms
  net: mvpp2: switch to build_skb() in the RX path
  net: mvpp2: simplify MVPP2_PRS_RI_* definitions
  net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT
  net: mvpp2: remove unused register definitions
  net: mvpp2: simplify mvpp2_bm_bufs_add()
  net: mvpp2: drop useless fields in mvpp2_bm_pool and related code
  net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'
  net: mvpp2: release reference to txq_cpu[] entry after unmapping
  net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()
  net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()
  net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set
  ...
2017-02-22 10:15:09 -08:00
Linus Torvalds
7bb033829e This renames the (now inaccurate) CONFIG_DEBUG_RODATA and related config
CONFIG_SET_MODULE_RONX to the more sensible CONFIG_STRICT_KERNEL_RWX and
 CONFIG_STRICT_MODULE_RWX.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJYrJ2ZAAoJEIly9N/cbcAmb4UQAIDnJYF4xecUfxofypQwt7ey
 DcR8SH+g/Rkm3v2bUOrVdlP333ePRUEs6C47PgYSLlKsZiQA3H6bsTILHJZGHZ3j
 laNH4sjQ0j+Sr2rHXk8fLz3YpHHwIy49bfu2ERXFH92BMnTMCv1h9IWFgOMH+4y5
 09n16TPHMUj1k0DGjHO/n03qLIKOo3Xy/Va5dhQ/6dGU4zR4KhOBnhLlG3IU7Atd
 KTR+ba/qym7bDQbTezMuaajTiZctr6a45yBKDWu4Knu+ot2a7K7fYvfRT3YVb5SU
 aTSYps7NKQbewcQSqNdek1zytoy2Ck7CH511e+3ypwNmao5KQwRgH7OX1pDEXyZv
 rGDaVzKMTSddH23jLEKUbpR847Lza9+V3h5YtbMG8GgiCKs91Ec666iEE3NVZBO8
 1hiiYhE2iDxi10B/EZZcn2gOt2JaB2m2GxWIrJOz4txtDAWbUYlhUpWEUynBTPQ0
 cYBZVnge81awipZJTWUv57LyufnTnMSK3i8Q8t0woj4C7pFbPYfjnKCrgwTQyAvr
 mD4uFBrgFb1lftbc3kfTdeoZmXerzvubsstWdr3rU3nsiJFzY1SwJZe8n0THyL4g
 DzURFrj/8UXb32Kavysz6FTxFO9u87mJm6yqHn/Y3bEK7Y7cch/NYjRC9Q6dpH+4
 ld9apHF6iRrqgf+x6oOh
 =7KhR
 -----END PGP SIGNATURE-----

Merge tag 'rodata-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull rodata updates from Kees Cook:
 "This renames the (now inaccurate) DEBUG_RODATA and related
  SET_MODULE_RONX configs to the more sensible STRICT_KERNEL_RWX and
  STRICT_MODULE_RWX"

* tag 'rodata-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
  arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
2017-02-21 17:56:45 -08:00
Linus Torvalds
6d1c42d9b9 Final extable.h related changes.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYrFn+AAoJEOvOhAQsB9HWbDAP/i1bMxYJSnwD2D/PBvzpY2AW
 uinEigaRr6kpRCOfa9FrfgxokKfosZOx5h7Se3f6O3mPwgpsU+dqbaE18Z5XSgxh
 +a9+HvAv3/XNZg7SvBtBaoYDblHWJ6AJ9rN9fuKg3e8btE3rSFG147vj1atlVz1+
 iRsXcCPb1p5db2+wZdsYJPI5Zwt4N0nR6cxPX4RQ6jseiVqPpt/FDtB60RYCjbID
 J0cOk1VV1Jn2H1Rfl+hjNQjIPMNx3zftOLQ2usr/kwuEqeuTKZR06yLXFOT6bdXU
 6JBdfL+e2kHKbaLyJGr6MCjTokaMgN3SGZJWJqHgk5Nggq5BD+2c4AOs8t6URnE0
 KThGiyY+YI5C/W6kMlEozLARiMKe4IIQpx1uj2Hv+YkndntvqjCqvfdQQJKnzm0G
 YWfPnsG2dysiovwEOBoBwyFVFLFzzJ1o3uyRGkCzVGaLQVzD5ktAJM6ynMOxwcIn
 zSN+agzdTAD7QJIDaa1p2r5fAqy7i4xIn2+ts1s9c410fdUTB4A2QJzTywjPAdCp
 IRxcLLpYDeBZ5cbhqjR677WgPtteYFTljoX+/8BOFO2PI+HjKHrxfW02WSiCS0iu
 CUndrlpmuyKlIrpw7mYpDTbORcSQSiUkB1pRGT7poh2p0KKAGSo9ZrRbx+qRvPdH
 AxO+ZR6Jjj5LAMk5MkRz
 =RK2h
 -----END PGP SIGNATURE-----

Merge tag 'extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux

Pull exception table module split from Paul Gortmaker:
 "Final extable.h related changes.

  This completes the separation of exception table content from the
  module.h header file. This is achieved with the final commit that
  removes the one line back compatible change that sourced extable.h
  into the module.h file.

  The commits are unchanged since January, with the exception of a
  couple Acks that came in for the last two commits a bit later. The
  changes have been in linux-next for quite some time[1] and have got
  widespread arch coverage via toolchains I have and also from
  additional ones the kbuild bot has.

  Maintaners of the various arch were Cc'd during the postings to
  lkml[2] and informed that the intention was to take the remaining arch
  specific changes and lump them together with the final two non-arch
  specific changes and submit for this merge window.

  The ia64 diffstat stands out and probably warrants a mention. In an
  earlier review, Al Viro made a valid comment that the original header
  separation of content left something to be desired, and that it get
  fixed as a part of this change, hence the larger diffstat"

* tag 'extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (21 commits)
  module.h: remove extable.h include now users have migrated
  core: migrate exception table users off module.h and onto extable.h
  cris: migrate exception table users off module.h and onto extable.h
  hexagon: migrate exception table users off module.h and onto extable.h
  microblaze: migrate exception table users off module.h and onto extable.h
  unicore32: migrate exception table users off module.h and onto extable.h
  score: migrate exception table users off module.h and onto extable.h
  metag: migrate exception table users off module.h and onto extable.h
  arc: migrate exception table users off module.h and onto extable.h
  nios2: migrate exception table users off module.h and onto extable.h
  sparc: migrate exception table users onto extable.h
  openrisc: migrate exception table users off module.h and onto extable.h
  frv: migrate exception table users off module.h and onto extable.h
  sh: migrate exception table users off module.h and onto extable.h
  xtensa: migrate exception table users off module.h and onto extable.h
  mn10300: migrate exception table users off module.h and onto extable.h
  alpha: migrate exception table users off module.h and onto extable.h
  arm: migrate exception table users off module.h and onto extable.h
  m32r: migrate exception table users off module.h and onto extable.h
  ia64: ensure exception table search users include extable.h
  ...
2017-02-21 14:28:55 -08:00
Linus Torvalds
b8989bccd6 Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "The audit changes for v4.11 are relatively small compared to what we
  did for v4.10, both in terms of size and impact.

   - two patches from Steve tweak the formatting for some of the audit
     records to make them more consistent with other audit records.

   - three patches from Richard record the name of a module on module
     load, fix the logging of sockaddr information when using
     socketcall() on 32-bit systems, and add the ability to reset
     audit's lost record counter.

   - my lone patch just fixes an annoying style nit that I was reminded
     about by one of Richard's patches.

  All these patches pass our test suite"

* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
  audit: remove unnecessary curly braces from switch/case statements
  audit: log module name on init_module
  audit: log 32-bit socketcalls
  audit: add feature audit_lost reset
  audit: Make AUDIT_ANOM_ABEND event normalized
  audit: Make AUDIT_KERNEL event conform to the specification
2017-02-21 13:25:50 -08:00
Linus Torvalds
c9341ee0af Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:
 "Highlights:

   - major AppArmor update: policy namespaces & lots of fixes

   - add /sys/kernel/security/lsm node for easy detection of loaded LSMs

   - SELinux cgroupfs labeling support

   - SELinux context mounts on tmpfs, ramfs, devpts within user
     namespaces

   - improved TPM 2.0 support"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (117 commits)
  tpm: declare tpm2_get_pcr_allocation() as static
  tpm: Fix expected number of response bytes of TPM1.2 PCR Extend
  tpm xen: drop unneeded chip variable
  tpm: fix misspelled "facilitate" in module parameter description
  tpm_tis: fix the error handling of init_tis()
  KEYS: Use memzero_explicit() for secret data
  KEYS: Fix an error code in request_master_key()
  sign-file: fix build error in sign-file.c with libressl
  selinux: allow changing labels for cgroupfs
  selinux: fix off-by-one in setprocattr
  tpm: silence an array overflow warning
  tpm: fix the type of owned field in cap_t
  tpm: add securityfs support for TPM 2.0 firmware event log
  tpm: enhance read_log_of() to support Physical TPM event log
  tpm: enhance TPM 2.0 PCR extend to support multiple banks
  tpm: implement TPM 2.0 capability to get active PCR banks
  tpm: fix RC value check in tpm2_seal_trusted
  tpm_tis: fix iTPM probe via probe_itpm() function
  tpm: Begin the process to deprecate user_read_timer
  tpm: remove tpm_read_index and tpm_write_index from tpm.h
  ...
2017-02-21 12:49:56 -08:00
Luis R. Rodriguez
a5544880af module: fix memory leak on early load_module() failures
While looking for early possible module loading failures I was
able to reproduce a memory leak possible with kmemleak. There
are a few rare ways to trigger a failure:

  o we've run into a failure while processing kernel parameters
    (parse_args() returns an error)
  o mod_sysfs_setup() fails
  o we're a live patch module and copy_module_elf() fails

Chances of running into this issue is really low.

kmemleak splat:

unreferenced object 0xffff9f2c4ada1b00 (size 32):
  comm "kworker/u16:4", pid 82, jiffies 4294897636 (age 681.816s)
  hex dump (first 32 bytes):
    6d 65 6d 73 74 69 63 6b 30 00 00 00 00 00 00 00  memstick0.......
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8c6cfeba>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff8c200046>] __kmalloc_track_caller+0x126/0x230
    [<ffffffff8c1bc581>] kstrdup+0x31/0x60
    [<ffffffff8c1bc5d4>] kstrdup_const+0x24/0x30
    [<ffffffff8c3c23aa>] kvasprintf_const+0x7a/0x90
    [<ffffffff8c3b5481>] kobject_set_name_vargs+0x21/0x90
    [<ffffffff8c4fbdd7>] dev_set_name+0x47/0x50
    [<ffffffffc07819e5>] memstick_check+0x95/0x33c [memstick]
    [<ffffffff8c09c893>] process_one_work+0x1f3/0x4b0
    [<ffffffff8c09cb98>] worker_thread+0x48/0x4e0
    [<ffffffff8c0a2b79>] kthread+0xc9/0xe0
    [<ffffffff8c6dab5f>] ret_from_fork+0x1f/0x40
    [<ffffffffffffffff>] 0xffffffffffffffff

Cc: stable <stable@vger.kernel.org> # v2.6.30
Fixes: e180a6b775 ("param: fix charp parameters set via sysfs")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-02-21 12:34:38 -08:00
Linus Torvalds
772c8f6f3b for-4.11/linus-merge-signed
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJYqeb8AAoJEPfTWPspceCmB3UP/3UtcPrzEm8w2cxB9MaWhZN3
 J+jiwlO4vaqhm2HVzQtoJqfaqRlud/iDx5cIXE2S7FnIM54ZKs3CANbKu8X+b1zm
 eJije3zMI8A8qyftigbz6a/Y2kWE4ZqFEc9WU5CWawfTl3ImCVUi8+F5X0wOLU/h
 r50zAQOEyURH4G5usNl9q0olF6FonJ82AcYm1iJ0QP2wYWZRJauC0rRn8IT93tyK
 bZPHnGKdkd7km8yi3zr2GNWOfuZZuA0HWAaF4qfrHPZQ883gITFAUIlFb1f+2TNl
 DkQzRrBB2wPWPnlbfb9KejMkvL94hflzsLb5rHt835DyVXFRyjxsgyAI8A+LPGSz
 vqZ3rsbWj6H4F9z2CkZ+T+AP/ZSWDNjwc0RXPm9HYdR5CDeTxIUVvnFQ44YNsmTv
 Xd5BKrUJ2oKegAxQG6zcuFx23p8JzhT70l+mNrMdtyeKnDD9FRdDvhKG9AHeTipn
 o/DnGivhS3UMQoQ7D68KOO+kuhLDeo7my5XGsnjzMO/iHqg++7IP2HyYYs/Ba4qZ
 cYaCtSDQW71Zt0vsqa6dvPuXBveu4h8Qh8R7uAGjSGS9IAFFb4Cab2tiUdISE6PE
 YnMWzY+G6pT8imlLVOL5/QFuo2Q4pUsaL0AHpXMCN9TZnQtbqXa8eqwnKnQ0m2KN
 7ut0IYYEPaYUX5xFn1K6
 =z7AL
 -----END PGP SIGNATURE-----

Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:

 - blk-mq scheduling framework from me and Omar, with a port of the
   deadline scheduler for this framework. A port of BFQ from Paolo is in
   the works, and should be ready for 4.12.

 - Various fixups and improvements to the above scheduling framework
   from Omar, Paolo, Bart, me, others.

 - Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
   This allows us to export more information that helps debug hangs or
   performance issues, without cluttering or abusing the sysfs API.

 - Fixes for the sbitmap code, the scalable bitmap code that was
   migrated from blk-mq, from Omar.

 - Removal of the BLOCK_PC support in struct request, and refactoring of
   carrying SCSI payloads in the block layer. This cleans up the code
   nicely, and enables us to kill the SCSI specific parts of struct
   request, shrinking it down nicely. From Christoph mainly, with help
   from Hannes.

 - Support for ranged discard requests and discard merging, also from
   Christoph.

 - Support for OPAL in the block layer, and for NVMe as well. Mainly
   from Scott Bauer, with fixes/updates from various others folks.

 - Error code fixup for gdrom from Christophe.

 - cciss pci irq allocation cleanup from Christoph.

 - Making the cdrom device operations read only, from Kees Cook.

 - Fixes for duplicate bdi registrations and bdi/queue life time
   problems from Jan and Dan.

 - Set of fixes and updates for lightnvm, from Matias and Javier.

 - A few fixes for nbd from Josef, using idr to name devices and a
   workqueue deadlock fix on receive. Also marks Josef as the current
   maintainer of nbd.

 - Fix from Josef, overwriting queue settings when the number of
   hardware queues is updated for a blk-mq device.

 - NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
   aborted, if we didn't end up aborting it.

 - SG gap merging fix from Ming Lei for block.

 - Loop fix also from Ming, fixing a race and crash between setting loop
   status and IO.

 - Two block race fixes from Tahsin, fixing request list iteration and
   fixing a race between device registration and udev device add
   notifiations.

 - Double free fix from cgroup writeback, from Tejun.

 - Another double free fix in blkcg, from Hou Tao.

 - Partition overflow fix for EFI from Alden Tondettar.

* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
  nvme: Check for Security send/recv support before issuing commands.
  block/sed-opal: allocate struct opal_dev dynamically
  block/sed-opal: tone down not supported warnings
  block: don't defer flushes on blk-mq + scheduling
  blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
  blk-mq: don't special case flush inserts for blk-mq-sched
  blk-mq-sched: don't add flushes to the head of requeue queue
  blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
  block: do not allow updates through sysfs until registration completes
  lightnvm: set default lun range when no luns are specified
  lightnvm: fix off-by-one error on target initialization
  Maintainers: Modify SED list from nvme to block
  Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
  uapi: sed-opal fix IOW for activate lsp to use correct struct
  cdrom: Make device operations read-only
  elevator: fix loading wrong elevator type for blk-mq devices
  cciss: switch to pci_irq_alloc_vectors
  block/loop: fix race between I/O and set_status
  blk-mq-sched: don't hold queue_lock when calling exit_icq
  block: set make_request_fn manually in blk_mq_update_nr_hw_queues
  ...
2017-02-21 10:57:33 -08:00
Mark Brown
fd4a61e08a sched/core: Fix build paravirt build on arm and arm64
Commit 004172bdad ("sched/core: Remove unnecessary #include headers")
removed the inclusion of asm/paravirt.h which is used to get
declarations of paravirt_steal_rq_enabled and paravirt_steal_clock.

It is implicitly included on x86 but not on arm and arm64 breaking the
build if paravirtualization is used.  Since things from that header are
used directly fix the build by putting the direct inclusion back.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-21 10:54:02 -08:00
Linus Torvalds
02c3de1105 Power management updates for v4.11-rc1
- Operating Performance Points (OPP) framework fixes, cleanups and
    switch over from RCU-based synchronization to reference counting
    using krefs (Viresh Kumar, Wei Yongjun, Dave Gerlach).
 
  - cpufreq core cleanups and documentation updates (Viresh Kumar,
    Rafael Wysocki).
 
  - New cpufreq driver for Broadcom BMIPS SoCs (Markus Mayer).
 
  - New cpufreq-dt sub-driver for TI SoCs requiring special handling,
    like in the AM335x, AM437x, DRA7x, and AM57x families, along with
    new DT bindings for it (Dave Gerlach, Paul Gortmaker).
 
  - ARM64 SoCs support for the qoriq cpufreq driver (Tang Yuantian).
 
  - intel_pstate driver updates including a new sysfs knob to control
    the driver's operation mode and fixes related to the no_turbo
    sysfs knob and the hardware-managed P-states feature support
    (Rafael Wysocki, Srinivas Pandruvada).
 
  - New interface to export ultra-turbo frequencies for the powernv
    cpufreq driver (Shilpasri Bhat).
 
  - Assorted fixes for cpufreq drivers (Arnd Bergmann, Dan Carpenter,
    Wei Yongjun).
 
  - devfreq core fixes, mostly related to the sysfs interface exported
    by it (Chanwoo Choi, Chris Diamand).
 
  - Updates of the exynos-bus and exynos-ppmu devfreq drivers (Chanwoo
    Choi).
 
  - Device PM QoS extension to support CPUs and support for per-CPU
    wakeup (device resume) latency constraints in the cpuidle menu
    governor (Alex Shi).
 
  - Wakeup IRQs framework fixes (Grygorii Strashko).
 
  - Generic power domains framework update including a fix to make
    it handle asynchronous invocations of *noirq suspend/resume
    callbacks correctly (Ulf Hansson, Geert Uytterhoeven).
 
  - Assorted fixes and cleanups in the core suspend/hibernate code,
    PM QoS framework and x86 ACPI idle support code (Corentin Labbe,
    Geert Uytterhoeven, Geliang Tang, John Keeping, Nick Desaulniers).
 
  - Update of the analyze_suspend.py script is updated to version 4.5
    offering multiple improvements (Todd Brandt).
 
  - New tool for intel_pstate diagnostics using the pstate_sample
    tracepoint (Doug Smythies).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYq3IjAAoJEILEb/54YlRx/lYP+gNXhfETSzjd4kWSHy3FVEDb
 gc5rMiE2j0OYgVSXwBI7p4EqMPy56lSWBASvbF2o6v9CIxb880KLFEsMDCVHwn46
 6xfEnIRxf1oeRqn7EG9ZPIcTgNsUyvK+gah7zgLXu/0KU7ceXxygvNk47qpeOZ8f
 dKYgIk/TOSGPC8H2nsg8VBKlK/ZOj5hID4F3MmFw6yDuWVCYuh2EokYXS4Nx0JwY
 UQGpWtz+FWWs71vhgVl33GbPXWvPqA7OMe0btZ3RCnhnz4tA/mH+jDWiaspCdS3J
 vKGeZyZptjIMJcufm3X7s7ghYjELheqQusMODDXk4AaWQ5nz8V5/h7NThYfa9J1b
 M93Tb0rMb2MqUhBpv/M6D3qQroZmhq55QKfQrul3QWSOiQUzTWJcbbpyeBQ7nkrI
 F1qNqQfuCnBL/r9y7HpW8P2iFg9kCHkwTtXMdp/lzGXdKzSGtAUSkYg5ohnUzQTp
 2WCPTEk+5DxLVPjW5rDoZOotr5p1kdcdWBk6r3MEWRokZK6PJo7rJBcnTtXSo2mO
 lLRba006q+fTlI5wZtjAI0rOiS3JgtT6cRx7uPjZlze9TGjklJhdsCPJbM5gcOT+
 YiOxvqD+9if5QRSxiEZNj3bQ43wYhXmpctfIanyxziq09BPIPxvgfRR/BkUzc34R
 ps4CIvImim5v5xc8Zsbk
 =57xJ
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "The majority of changes go into the Operating Performance Points (OPP)
  framework and cpufreq this time, followed by devfreq and some
  scattered updates all over.

  The OPP changes are mostly related to switching over from RCU-based
  synchronization, that turned out to be overly complicated and
  problematic, to reference counting using krefs.

  In the cpufreq land there are core cleanups, documentation updates, a
  new driver for Broadcom BMIPS SoCs, a new cpufreq-dt sub-driver for TI
  SoCs that require special handling, ARM64 SoCs support for the qoriq
  driver, intel_pstate updates, powernv driver update and assorted
  fixes.

  The devfreq changes are mostly fixes related to the sysfs interface
  and some Exynos drivers updates.

  Apart from that, the cpuidle menu governor will support per-CPU PM QoS
  constraints for the wakeup latency now, some bugs in the wakeup IRQs
  framework are fixed, the generic power domains framework should handle
  asynchronous invocations of *noirq suspend/resume callbacks from now
  on, the analyze_suspend.py script is updated and there is a new tool
  for intel_pstate diagnostics.

  Specifics:

   - Operating Performance Points (OPP) framework fixes, cleanups and
     switch over from RCU-based synchronization to reference counting
     using krefs (Viresh Kumar, Wei Yongjun, Dave Gerlach)

   - cpufreq core cleanups and documentation updates (Viresh Kumar,
     Rafael Wysocki)

   - New cpufreq driver for Broadcom BMIPS SoCs (Markus Mayer)

   - New cpufreq-dt sub-driver for TI SoCs requiring special handling,
     like in the AM335x, AM437x, DRA7x, and AM57x families, along with
     new DT bindings for it (Dave Gerlach, Paul Gortmaker)

   - ARM64 SoCs support for the qoriq cpufreq driver (Tang Yuantian)

   - intel_pstate driver updates including a new sysfs knob to control
     the driver's operation mode and fixes related to the no_turbo sysfs
     knob and the hardware-managed P-states feature support (Rafael
     Wysocki, Srinivas Pandruvada)

   - New interface to export ultra-turbo frequencies for the powernv
     cpufreq driver (Shilpasri Bhat)

   - Assorted fixes for cpufreq drivers (Arnd Bergmann, Dan Carpenter,
     Wei Yongjun)

   - devfreq core fixes, mostly related to the sysfs interface exported
     by it (Chanwoo Choi, Chris Diamand)

   - Updates of the exynos-bus and exynos-ppmu devfreq drivers (Chanwoo
     Choi)

   - Device PM QoS extension to support CPUs and support for per-CPU
     wakeup (device resume) latency constraints in the cpuidle menu
     governor (Alex Shi)

   - Wakeup IRQs framework fixes (Grygorii Strashko)

   - Generic power domains framework update including a fix to make it
     handle asynchronous invocations of *noirq suspend/resume callbacks
     correctly (Ulf Hansson, Geert Uytterhoeven)

   - Assorted fixes and cleanups in the core suspend/hibernate code, PM
     QoS framework and x86 ACPI idle support code (Corentin Labbe, Geert
     Uytterhoeven, Geliang Tang, John Keeping, Nick Desaulniers)

   - Update of the analyze_suspend.py script is updated to version 4.5
     offering multiple improvements (Todd Brandt)

   - New tool for intel_pstate diagnostics using the pstate_sample
     tracepoint (Doug Smythies)"

* tag 'pm-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (85 commits)
  MAINTAINERS: cpufreq: add bmips-cpufreq.c
  PM / QoS: Fix memory leak on resume_latency.notifiers
  PM / Documentation: Spelling s/wrtie/write/
  PM / sleep: Fix test_suspend after sleep state rework
  cpufreq: CPPC: add ACPI_PROCESSOR dependency
  cpufreq: make ti-cpufreq explicitly non-modular
  cpufreq: Do not clear real_cpus mask on policy init
  tools/power/x86: Debug utility for intel_pstate driver
  AnalyzeSuspend: fix drag and zoom bug in javascript
  PM / wakeirq: report a wakeup_event on dedicated wekup irq
  PM / wakeirq: Fix spurious wake-up events for dedicated wakeirqs
  PM / wakeirq: Enable dedicated wakeirq for suspend
  cpufreq: dt: Don't use generic platdev driver for ti-cpufreq platforms
  cpufreq: ti: Add cpufreq driver to determine available OPPs at runtime
  Documentation: dt: add bindings for ti-cpufreq
  PM / OPP: Expose _of_get_opp_desc_node as dev_pm_opp API
  cpufreq: qoriq: Don't look at clock implementation details
  cpufreq: qoriq: add ARM64 SoCs support
  PM / Domains: Provide dummy governors if CONFIG_PM_GENERIC_DOMAINS=n
  cpufreq: brcmstb-avs-cpufreq: remove unnecessary platform_set_drvdata()
  ...
2017-02-20 17:41:31 -08:00
Linus Torvalds
ebb4949eb3 IOMMU Updates for Linux v4.11
The changes include:
 
 	* KVM PCIe/MSI passthrough support on ARM/ARM64
 
 	* Introduction of a core representation for individual hardware
 	  iommus
 
 	* Support for IOMMU privileged mappings as supported by some
 	  ARM IOMMUS
 
 	* 16-bit SID support for ARM-SMMUv2
 
 	* Stream table optimization for ARM-SMMUv3
 
 	* Various fixes and other small improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJYqw3hAAoJECvwRC2XARrjPy0P/35ykfHIAESJuF+72ziaoAYA
 ZvMrli8rGq7n+ntaIGPx9rV+hZTUSF8V2bfsYV7SAn5iYuViXZqvOtC3BAEp6GNC
 cdMeQfqXoHiWVMdXdOihzk+6YCQvBxqPOvUtYFqVhOo3Yrz8Dc71KsKvrTndEUVY
 f7bXHKssVONkWMga9lIVDgEefG5VyJPEQaxJXB9ymLHXbwWOcISe1lgtkrzFSxSH
 H9YNI07Tfcxfn6rN8jGmcYFYM58xwBicpB4HBw5uytMBYAsxqTEsx4X5dGpOF6RH
 cFW9nby+9ZlcTMyuWXKAck3o8df2ZC1xiSjnz+DHQdBPFiFNqIL3PVUcaz9PnF2e
 e6Y+DA3s+jykeiCvi2K0Z9RwTg7t8S5spel+UCeNVSnIjE9pqZNLF8vsDjF17zuR
 +zcFm7RVI397QVQGp0dbqhtxnwqt/3CX/wlzpvuNdEZa4vwujpcnM9tfl6gyFrF8
 awK9Fj5ryAn4DEiM+8yiRHwLrU5ij1cfc8jQdqleEB2ca7Wv3g1uhhS0QTXOFY9u
 A7ygOna25U1EcOwjC6ebjiEL115ZEOrXo+eChhzCHoUEHCVxL+L/NAMEsUcMqPIw
 3XsHhru0HbXgd5O5wHX39s2je8G3+ElqQwy8Ja3DimV6tvon7yaKCXy9QU+2aa1u
 3r53R/0mW1ijtOfK+I0b
 =5b3I
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull IOMMU UPDATES from Joerg Roedel:

 - KVM PCIe/MSI passthrough support on ARM/ARM64

 - introduction of a core representation for individual hardware iommus

 - support for IOMMU privileged mappings as supported by some ARM IOMMUS

 - 16-bit SID support for ARM-SMMUv2

 - stream table optimization for ARM-SMMUv3

 - various fixes and other small improvements

* tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (61 commits)
  vfio/type1: Fix error return code in vfio_iommu_type1_attach_group()
  iommu: Remove iommu_register_instance interface
  iommu/exynos: Make use of iommu_device_register interface
  iommu/mediatek: Make use of iommu_device_register interface
  iommu/msm: Make use of iommu_device_register interface
  iommu/arm-smmu: Make use of the iommu_register interface
  iommu: Add iommu_device_set_fwnode() interface
  iommu: Make iommu_device_link/unlink take a struct iommu_device
  iommu: Add sysfs bindings for struct iommu_device
  iommu: Introduce new 'struct iommu_device'
  iommu: Rename struct iommu_device
  iommu: Rename iommu_get_instance()
  iommu: Fix static checker warning in iommu_insert_device_resv_regions
  iommu: Avoid unnecessary assignment of dev->iommu_fwspec
  iommu/mediatek: Remove bogus 'select' statements
  iommu/dma: Remove bogus dma_supported() implementation
  iommu/ipmmu-vmsa: Restrict IOMMU Domain Geometry to 32-bit address space
  iommu/vt-d: Don't over-free page table directories
  iommu/vt-d: Tylersburg isoch identity map check is done too late.
  iommu/vt-d: Fix some macros that are incorrectly specified in intel-iommu
  ...
2017-02-20 16:42:43 -08:00
Linus Torvalds
42e1b14b6e Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Implement wraparound-safe refcount_t and kref_t types based on
     generic atomic primitives (Peter Zijlstra)

   - Improve and fix the ww_mutex code (Nicolai Hähnle)

   - Add self-tests to the ww_mutex code (Chris Wilson)

   - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
     Bueso)

   - Micro-optimize the current-task logic all around the core kernel
     (Davidlohr Bueso)

   - Tidy up after recent optimizations: remove stale code and APIs,
     clean up the code (Waiman Long)

   - ... plus misc fixes, updates and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  fork: Fix task_struct alignment
  locking/spinlock/debug: Remove spinlock lockup detection code
  lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
  lkdtm: Convert to refcount_t testing
  kref: Implement 'struct kref' using refcount_t
  refcount_t: Introduce a special purpose refcount type
  sched/wake_q: Clarify queue reinit comment
  sched/wait, rcuwait: Fix typo in comment
  locking/mutex: Fix lockdep_assert_held() fail
  locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
  locking/rwsem: Reinit wake_q after use
  locking/rwsem: Remove unnecessary atomic_long_t casts
  jump_labels: Move header guard #endif down where it belongs
  locking/atomic, kref: Implement kref_put_lock()
  locking/ww_mutex: Turn off __must_check for now
  locking/atomic, kref: Avoid more abuse
  locking/atomic, kref: Use kref_get_unless_zero() more
  locking/atomic, kref: Kill kref_sub()
  locking/atomic, kref: Add kref_read()
  locking/atomic, kref: Add KREF_INIT()
  ...
2017-02-20 13:23:30 -08:00
Linus Torvalds
828cad8ea0 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this (fairly busy) cycle were:

   - There was a class of scheduler bugs related to forgetting to update
     the rq-clock timestamp which can cause weird and hard to debug
     problems, so there's a new debug facility for this: which uncovered
     a whole lot of bugs which convinced us that we want to keep the
     debug facility.

     (Peter Zijlstra, Matt Fleming)

   - Various cputime related updates: eliminate cputime and use u64
     nanoseconds directly, simplify and improve the arch interfaces,
     implement delayed accounting more widely, etc. - (Frederic
     Weisbecker)

   - Move code around for better structure plus cleanups (Ingo Molnar)

   - Move IO schedule accounting deeper into the scheduler plus related
     changes to improve the situation (Tejun Heo)

   - ... plus a round of sched/rt and sched/deadline fixes, plus other
     fixes, updats and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
  sched/core: Remove unlikely() annotation from sched_move_task()
  sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
  sched/topology: Split out scheduler topology code from core.c into topology.c
  sched/core: Remove unnecessary #include headers
  sched/rq_clock: Consolidate the ordering of the rq_clock methods
  delayacct: Include <uapi/linux/taskstats.h>
  sched/core: Clean up comments
  sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
  sched/clock: Add dummy clear_sched_clock_stable() stub function
  sched/cputime: Remove generic asm headers
  sched/cputime: Remove unused nsec_to_cputime()
  s390, sched/cputime: Remove unused cputime definitions
  powerpc, sched/cputime: Remove unused cputime definitions
  s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
  ia64, sched/cputime: Remove unused cputime definitions
  ia64: Convert vtime to use nsec units directly
  ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
  sched/cputime: Remove jiffies based cputime
  sched/cputime, vtime: Return nsecs instead of cputime_t to account
  sched/cputime: Complete nsec conversion of tick based accounting
  ...
2017-02-20 12:52:55 -08:00
Linus Torvalds
7f4eb0a6d5 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "On the kernel side the main changes in this cycle were:

   - Add Intel Kaby Lake CPU support (Srinivas Pandruvada)

   - AMD uncore driver updates for fam17 (Janakarajan Natarajan)

   - Intel/PT updates and core events optimizations and cleanups
     (Alexander Shishkin)

   - cgroups events fixes (David Carrillo-Cisneros)

   - kprobes improvements (Masami Hiramatsu)

   - ... plus misc fixes and updates.

  On the tooling side the main changes were:

   - Support clang build in tools/{perf,lib/{bpf,traceevent,api}} with
     CC=clang, to, for instance, take advantage of better warnings
     (Arnaldo Carvalho de Melo):

   - Introduce the 'delta-abs' 'perf diff' compute method, that orders
     the histogram entries by the absolute value of the percentage delta
     for a function in two perf.data files, i.e. the functions that
     changed the most (increase or decrease in samples) comes first
     (Namhyung Kim)

   - Add support for parsing Intel uncore vendor event files and add
     uncore vendor events for the Intel server processors (Haswell,
     Broadwell, IvyBridge), Xeon Phi (Knights Landing) and Broadwell DE
     (Andi Kleen)

   - Introduce 'perf ftrace' a perf front end to the kernel's ftrace
     function and function_graph tracer, defaulting to the
     "function_graph" tracer, more work will be done in reviving this
     effort, forward porting it from its initial patch submission
     (Namhyung Kim)

   - Add 'e' and 'c' hotkeys to expand/collapse call chains for a single
     hist entry in the 'perf report' and 'perf top' TUI (Jiri Olsa)

   - Account thread wait time (off CPU time) separately: sleep, iowait
     and preempt, based on the prev_state of the last event, show the
     breakdown when using "perf sched timehist --state" (Namhyumg Kim)

   - Add more triggers to switch the output file (perf.data.TIMESTAMP).

     Now, in addition to switching to a different output file when
     receiving a SIGUSR2, one can also specify file size and time based
     triggers:

           perf record -a --switch-output=signal

     is equivalent to what we had before:

           perf record -a --switch-output

     While we can also ask for the file to be "sliced" by size, taking
     into account that that will happen only when we get woken up by the
     kernel, i.e. one has to take into account the --mmap-pages (the
     size of the perf mmap ring buffer):

           perf record -a --switch-output=2G

     will break the perf.data output into multiple files limited to 2GB
     of samples, right when generating the output.

     For time based samples, alert() will be used, so to have 1 minute
     limited perf.data output files:

          perf record -a --switch-output=1m

     (Jiri Olsa)

   - Improve 'perf trace' (Arnaldo Carvalho de Melo)

   - 'perf kallsyms' toy tool to look for extended symbol information on
     the running kernel and demonstrate the machine/thread/symbol APIs
     for use in other tools, such as 'perf probe' (Arnaldo Carvalho de
     Melo)

   - ... plus tons of other changes, see the shortlog and Git log for
     details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (131 commits)
  perf tools: Add missing parse_events_error() prototype
  perf pmu: Fix check for unset alias->unit array
  perf tools: Be consistent on the type of map->symbols[] interator
  perf intel pt decoder: clang has no -Wno-override-init
  perf evsel: Do not put a variable sized type not at the end of a struct
  perf probe: Avoid accessing uninitialized 'map' variable
  perf tools: Do not put a variable sized type not at the end of a struct
  perf record: Do not put a variable sized type not at the end of a struct
  perf tests: Synthesize struct instead of using field after variable sized type
  perf bench numa: Make sure dprintf() is not defined
  Revert "perf bench futex: Sanitize numeric parameters"
  tools lib subcmd: Make it an error to pass a signed value to OPTION_UINTEGER
  tools: Set the maximum optimization level according to the compiler being used
  tools: Suppress request for warning options not existent in clang
  samples/bpf: Reset global variables
  samples/bpf: Ignore already processed ELF sections
  samples/bpf: Add missing header
  perf symbols: dso->name is an array, no need to check it against NULL
  perf tests record: No need to test an array against NULL
  perf symbols: No need to check if sym->name is NULL
  ...
2017-02-20 12:21:13 -08:00
Linus Torvalds
f7458a5d63 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The RCU changes in this cycle are:

   - Dynticks updates, consolidating open-coded counter accesses into a
     well-defined API

   - SRCU updates: Simplify algorithm, add formal verification

   - Documentation updates

   - Miscellaneous fixes

   - Torture-test updates

  Most of the diffstat comes from the relatively large documentation
  update"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  srcu: Reduce probability of SRCU ->unlock_count[] counter overflow
  rcutorture: Add CBMC-based formal verification for SRCU
  srcu: Force full grace-period ordering
  srcu: Implement more-efficient reader counts
  rcu: Adjust FQS offline checks for exact online-CPU detection
  rcu: Check cond_resched_rcu_qs() state less often to reduce GP overhead
  rcu: Abstract extended quiescent state determination
  rcu: Abstract dynticks extended quiescent state enter/exit operations
  rcu: Add lockdep checks to synchronous expedited primitives
  rcu: Eliminate unused expedited_normal counter
  llist: Clarify comments about when locking is needed
  rcu: Fix comment in rcu_organize_nocb_kthreads()
  rcu: Enable RCU tracepoints by default to aid in debugging
  rcu: Make rcu_cpu_starting() use its "cpu" argument
  rcu: Add comment headers to expedited-grace-period counter functions
  rcu: Don't wake rcuc/X kthreads on NOCB CPUs
  rcu: Re-enable TASKS_RCU for User Mode Linux
  rcu: Once again use NMI-based stack traces in stall warnings
  rcu: Remove short-term CPU kicking
  rcu: Add long-term CPU kicking
  ...
2017-02-20 11:21:17 -08:00
Linus Torvalds
1cd4027cfe Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "This update provides:

   - Yet another two irq controller chip drivers

   - A few updates and fixes for GICV3

   - A resource managed function for interrupt allocation

   - Fixes, updates and enhancements all over the place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/qcom: Fix error handling
  genirq: Clarify logic calculating bogus irqreturn_t values
  genirq/msi: Add stubs for get_cached_msi_msg/pci_write_msi_msg
  genirq/devres: Use dev_name(dev) as default for devname
  genirq: Fix /proc/interrupts output alignment
  irqdesc: Add a resource managed version of irq_alloc_descs()
  irqchip/gic-v3-its: Zero command on allocation
  irqchip/gic-v3-its: Fix command buffer allocation
  irqchip/mips-gic: Fix local interrupts
  irqchip: Add a driver for Cortina Gemini
  irqchip: DT bindings for Cortina Gemini irqchip
  irqchip/gic-v3: Remove duplicate definition of GICD_TYPER_LPIS
  irqchip/gic-v3-its: Rename MAPVI to MAPTI
  irqchip/gic-v3-its: Drop deprecated GITS_BASER_TYPE_CPU
  irqchip/gic-v3-its: Refactor command encoding
  irqchip/gic-v3-its: Enable cacheable attribute Read-allocate hints
  irqchip/qcom: Add IRQ combiner driver
  ACPI: Add support for ResourceSource/IRQ domain mapping
  ACPI: Generic GSI: Do not attempt to map non-GSI IRQs during bus scan
  irq/platform-msi: Fix comment about maximal MSIs
2017-02-20 10:52:23 -08:00
Linus Torvalds
20dcfe1b7d Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Nothing exciting, just the usual pile of fixes, updates and cleanups:

   - A bunch of clocksource driver updates

   - Removal of CONFIG_TIMER_STATS and the related /proc file

   - More posix timer slim down work

   - A scalability enhancement in the tick broadcast code

   - Math cleanups"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  hrtimer: Catch invalid clockids again
  math64, tile: Fix build failure
  clocksource/drivers/arm_arch_timer:: Mark cyclecounter __ro_after_init
  timerfd: Protect the might cancel mechanism proper
  timer_list: Remove useless cast when printing
  time: Remove CONFIG_TIMER_STATS
  clocksource/drivers/arm_arch_timer: Work around Hisilicon erratum 161010101
  clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure
  clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter
  clocksource/drivers/arm_arch_timer: Add dt binding for hisilicon-161010101 erratum
  clocksource/drivers/ostm: Add renesas-ostm timer driver
  clocksource/drivers/ostm: Document renesas-ostm timer DT bindings
  clocksource/drivers/tcb_clksrc: Use 32 bit tcb as sched_clock
  clocksource/drivers/gemini: Add driver for the Cortina Gemini
  clocksource: add DT bindings for Cortina Gemini
  clockevents: Add a clkevt-of mechanism like clksrc-of
  tick/broadcast: Reduce lock cacheline contention
  timers: Omit POSIX timer stuff from task_struct when disabled
  x86/timer: Make delay() work during early bootup
  delay: Add explanation of udelay() inaccuracy
  ...
2017-02-20 10:06:32 -08:00
Rafael J. Wysocki
fccddb25e1 Merge branch 'pm-sleep'
* pm-sleep:
  PM / Documentation: Spelling s/wrtie/write/
  PM / sleep: Fix test_suspend after sleep state rework
  PM / Hibernate: Use rb_entry() instead of container_of()
2017-02-20 14:26:13 +01:00
Peter Zijlstra
95cb64c1fe fork: Fix task_struct alignment
Stupid bug that wrecked the alignment of task_struct and causes WARN()s
in the x86 FPU code on some platforms.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e274795ea7 ("locking/mutex: Fix mutex handoff")
Link: http://lkml.kernel.org/r/20170218142645.GH6500@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-20 11:22:37 +01:00
David S. Miller
f787d1debf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-19 11:18:46 -05:00
Linus Torvalds
244ff16fb4 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
 "Move the futex init function to core initcall so user mode helper does
  not run into an uninitialized futex syscall"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Move futex_init() to core_initcall
2017-02-18 17:33:17 -08:00
Linus Torvalds
e602e70084 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Two small fixes::

   - Prevent deadlock on the tick broadcast lock. Found and fixed by
     Mike.

   - Stop using printk() in the timekeeping debug code to prevent a
     deadlock against the scheduler"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Use deferred printk() in debug code
  tick/broadcast: Prevent deadlock on tick_broadcast_lock
2017-02-18 17:30:36 -08:00
Sergey Senozhatsky
fc98c3c8c9 printk: use rcuidle console tracepoint
Use rcuidle console tracepoint because, apparently, it may be issued
from an idle CPU:

  hw-breakpoint: Failed to enable monitor mode on CPU 0.
  hw-breakpoint: CPU 0 failed to disable vector catch

  ===============================
  [ ERR: suspicious RCU usage.  ]
  4.10.0-rc8-next-20170215+ #119 Not tainted
  -------------------------------
  ./include/trace/events/printk.h:32 suspicious rcu_dereference_check() usage!

  other info that might help us debug this:

  RCU used illegally from idle CPU!
  rcu_scheduler_active = 2, debug_locks = 0
  RCU used illegally from extended quiescent state!
  2 locks held by swapper/0/0:
   #0:  (cpu_pm_notifier_lock){......}, at: [<c0237e2c>] cpu_pm_exit+0x10/0x54
   #1:  (console_lock){+.+.+.}, at: [<c01ab350>] vprintk_emit+0x264/0x474

  stack backtrace:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc8-next-20170215+ #119
  Hardware name: Generic OMAP4 (Flattened Device Tree)
    console_unlock
    vprintk_emit
    vprintk_default
    printk
    reset_ctrl_regs
    dbg_cpu_pm_notify
    notifier_call_chain
    cpu_pm_exit
    omap_enter_idle_coupled
    cpuidle_enter_state
    cpuidle_enter_state_coupled
    do_idle
    cpu_startup_entry
    start_kernel

This RCU warning, however, is suppressed by lockdep_off() in printk().
lockdep_off() increments the ->lockdep_recursion counter and thus
disables RCU_LOCKDEP_WARN() and debug_lockdep_rcu_enabled(), which want
lockdep to be enabled "current->lockdep_recursion == 0".

Link: http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Tony Lindgren <tony@atomide.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Russell King <rmk@armlinux.org.uk>
Cc: <stable@vger.kernel.org> [3.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-18 17:27:00 -08:00
Marc Zyngier
336a9cde10 hrtimer: Catch invalid clockids again
commit 82e88ff1ea ("hrtimer: Revert CLOCK_MONOTONIC_RAW support") removed
unfortunately a sanity check in the hrtimer code which was part of that
MONOTONIC_RAW patch series.

It would have caught the bogus usage of CLOCK_MONOTONIC_RAW in the wireless
code. So bring it back.

It is way too easy to take any random clockid and feed it to the hrtimer
subsystem. At best, it gets mapped to a monotonic base, but it would be
better to just catch illegal values as early as possible.
    
Detect invalid clockids, map them to CLOCK_MONOTONIC and emit a warning.

[ tglx: Replaced the BUG by a WARN and gracefully map to CLOCK_MONOTONIC ]

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Link: http://lkml.kernel.org/r/1452879670-16133-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-18 10:58:39 +01:00
Geert Uytterhoeven
8cff66791a PM / sleep: Fix test_suspend after sleep state rework
When passing "test_suspend=mem" to the kernel:

    PM: can't test 'mem' suspend state

and the suspend test is not run.

Commit 406e79385f ("PM / sleep: System sleep state selection
interface rework") changed pm_labels[] from a contiguous NULL-terminated
array to a sparse array (with the first element unpopulated), breaking
the assumptions of the iterator in setup_test_suspend().

Iterate from PM_SUSPEND_MIN to PM_SUSPEND_MAX - 1 to fix this.

Fixes: 406e79385f (PM / sleep: System sleep state selection interface rework)
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-02-18 02:16:27 +01:00
Jens Axboe
818551e2b2 Merge branch 'for-4.11/next' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:08:19 -07:00
Daniel Borkmann
74451e66d5 bpf: make jited programs visible in traces
Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.

This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_<tag>, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.

Before:

  7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
         f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

After:

  7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
  [...]
  7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
  7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
         f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 13:40:05 -05:00
Daniel Borkmann
9383191da4 bpf: remove stubs for cBPF from arch code
Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
that a single __weak function in the core that can be overridden
similarly to the eBPF one. Also remove stale pr_err() mentions
of bpf_jit_compile.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 13:40:04 -05:00
Daniel Borkmann
c78f8bdfa1 bpf: mark all registered map/prog types as __ro_after_init
All map types and prog types are registered to the BPF core through
bpf_register_map_type() and bpf_register_prog_type() during init and
remain unchanged thereafter. As by design we don't (and never will)
have any pluggable code that can register to that at any later point
in time, lets mark all the existing bpf_{map,prog}_type_list objects
in the tree as __ro_after_init, so they can be moved to read-only
section from then onwards.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 13:40:04 -05:00
Joel Fernandes
67d04bb2bc tracing: Remove outdated ring buffer comment
The comment about ring buffer's organization is outdated and the code sits
elsewhere, remove the comment.
Link: http://lkml.kernel.org/r/20170217041058.23904-1-joelaf@google.com

Cc: Ingo Molnar <mingo@redhat.com>

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-17 09:56:51 -05:00
David S. Miller
3f64116a83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-16 19:34:01 -05:00
Linus Torvalds
558e8e27e7 Revert "nohz: Fix collision between tick and other hrtimers"
This reverts commit 24b91e360e and commit
7bdb59f1ad ("tick/nohz: Fix possible missing clock reprog after tick
soft restart") that depends on it,

Pavel reports that it causes occasional boot hangs for him that seem to
depend on just how the machine was booted.  In particular, his machine
hangs at around the PCI fixups of the EHCI USB host controller, but only
hangs from cold boot, not from a warm boot.

Thomas Gleixner suspecs it's a CPU hotplug interaction, particularly
since Pavel also saw suspend/resume issues that seem to be related.
We're reverting for now while trying to figure out the root cause.

Reported-bisected-and-tested-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org  # reverted commits were marked for stable
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-16 12:19:18 -08:00
Linus Torvalds
3c7a9f32f9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) In order to avoid problems in the future, make cgroup bpf overriding
    explicit using BPF_F_ALLOW_OVERRIDE. From Alexei Staovoitov.

 2) LLC sets skb->sk without proper skb->destructor and this explodes,
    fix from Eric Dumazet.

 3) Make sure when we have an ipv4 mapped source address, the
    destination is either also an ipv4 mapped address or
    ipv6_addr_any(). Fix from Jonathan T. Leighton.

 4) Avoid packet loss in fec driver by programming the multicast filter
    more intelligently. From Rui Sousa.

 5) Handle multiple threads invoking fanout_add(), fix from Eric
    Dumazet.

 6) Since we can invoke the TCP input path in process context, without
    BH being disabled, we have to accomodate that in the locking of the
    TCP probe. Also from Eric Dumazet.

 7) Fix erroneous emission of NETEVENT_DELAY_PROBE_TIME_UPDATE when we
    aren't even updating that sysctl value. From Marcus Huewe.

 8) Fix endian bugs in ibmvnic driver, from Thomas Falcon.

[ This is the second version of the pull that reverts the nested
  rhashtable changes that looked a bit too scary for this late in the
  release  - Linus ]

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
  rhashtable: Revert nested table changes.
  ibmvnic: Fix endian errors in error reporting output
  ibmvnic: Fix endian error when requesting device capabilities
  net: neigh: Fix netevent NETEVENT_DELAY_PROBE_TIME_UPDATE notification
  net: xilinx_emaclite: fix freezes due to unordered I/O
  net: xilinx_emaclite: fix receive buffer overflow
  bpf: kernel header files need to be copied into the tools directory
  tcp: tcp_probe: use spin_lock_bh()
  uapi: fix linux/if_pppol2tp.h userspace compilation errors
  packet: fix races in fanout_add()
  ibmvnic: Fix initial MTU settings
  net: ethernet: ti: cpsw: fix cpsw assignment in resume
  kcm: fix a null pointer dereference in kcm_sendmsg()
  net: fec: fix multicast filtering hardware setup
  ipv6: Handle IPv4-mapped src to in6addr_any dst.
  ipv6: Inhibit IPv4-mapped src address on the wire.
  net/mlx5e: Disable preemption when doing TC statistics upcall
  rhashtable: Add nested tables
  tipc: Fix tipc_sk_reinit race conditions
  gfs2: Use rhashtable walk interface in glock_hash_walk
  ...
2017-02-16 08:37:18 -08:00
Jeremy Kerr
5d4bac9a5f genirq: Clarify logic calculating bogus irqreturn_t values
Although irqreturn_t is an enum, we treat it (and its enumeration
constants) as a bitmask.

However, bad_action_ret() uses a less-than operator to determine whether
an irqreturn_t falls within allowable bit values, which means we need to
know the signededness of an enum type to read the logic, which is
implementation-dependent.

This change explicitly uses an unsigned type for the comparison. We do
this instead of changing to a bitwise test, as the latter compiles to
increased instructions in this hot path.

It looks like we get the correct behaviour currently (bad_action_ret(-1)
returns 1), so this is purely a readability fix.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Link: http://lkml.kernel.org/r/1487219049-4061-1-git-send-email-jk@ozlabs.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-16 15:32:19 +01:00
Masami Hiramatsu
bef5da6059 tracing/probes: Fix a warning message to show correct maximum length
Since tracing/*probe_events will accept a probe definition
up to 4096 - 2 ('\n' and '\0') bytes, it must show 4094 instead
of 4096 in warning message.

Note that there is one possible case of exceed 4094. If user
prepare 4096 bytes null-terminated string and syscall write
it with the count == 4095, then it can be accepted. However,
if user puts a '\n' after that, it must rejected.
So IMHO, the warning message should indicate shorter one,
since it is safer.

Link: http://lkml.kernel.org/r/148673290462.2579.7966778294009665632.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-15 10:32:48 -05:00
Wei Yongjun
8f0994bb8c tracing: Fix return value check in trace_benchmark_reg()
In case of error, the function kthread_run() returns ERR_PTR() and never
returns NULL. The NULL test in the return value check should be replaced
with IS_ERR().

Link: http://lkml.kernel.org/r/20170112135502.28556-1-weiyj.lk@gmail.com

Cc: stable@vger.kernel.org
Fixes: 81dc9f0e ("tracing: Add tracepoint benchmark tracepoint")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-15 09:02:27 -05:00
Arnd Bergmann
eb583cd484 tracing: Use modern function declaration
We get a lot of harmless warnings about this header file at W=1 level
because of an unusual function declaration:

kernel/trace/trace.h:766:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]

This moves the inline statement where it normally belongs, avoiding the
warning.

Link: http://lkml.kernel.org/r/20170123122521.3389010-1-arnd@arndb.de

Fixes: 4046bf023b ("ftrace: Expose ftrace_hash_empty and ftrace_lookup_ip")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-15 09:02:27 -05:00
Jason Baron
3821fd35b5 jump_label: Reduce the size of struct static_key
The static_key->next field goes mostly unused. The field is used for
associating module uses with a static key. Most uses of struct static_key
define a static key in the core kernel and make use of it entirely within
the core kernel, or define the static key in a module and make use of it
only from within that module. In fact, of the ~3,000 static keys defined,
I found only about 5 or so that did not fit this pattern.

Thus, we can remove the static_key->next field entirely and overload
the static_key->entries field. That is, when all the static_key uses
are contained within the same module, static_key->entries continues
to point to those uses. However, if the static_key uses are not contained
within the module where the static_key is defined, then we allocate a
struct static_key_mod, store a pointer to the uses within that
struct static_key_mod, and have the static key point at the static_key_mod.
This does incur some extra memory usage when a static_key is used in a
module that does not define it, but since there are only a handful of such
cases there is a net savings.

In order to identify if the static_key->entries pointer contains a
struct static_key_mod or a struct jump_entry pointer, bit 1 of
static_key->entries is set to 1 if it points to a struct static_key_mod and
is 0 if it points to a struct jump_entry. We were already using bit 0 in a
similar way to store the initial value of the static_key. This does mean
that allocations of struct static_key_mod and that the struct jump_entry
tables need to be at least 4-byte aligned in memory. As far as I can tell
all arches meet this criteria.

For my .config, the patch increased the text by 778 bytes, but reduced
the data + bss size by 14912, for a net savings of 14,134 bytes.

   text	   data	    bss	    dec	    hex	filename
8092427	5016512	 790528	13899467	 d416cb	vmlinux.pre
8093205	5001600	 790528	13885333	 d3df95	vmlinux.post

Link: http://lkml.kernel.org/r/1486154544-4321-1-git-send-email-jbaron@akamai.com

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-15 09:02:26 -05:00
Masami Hiramatsu
7257634135 tracing/probe: Show subsystem name in messages
Show "trace_probe:", "trace_kprobe:" and "trace_uprobe:"
headers for each warning/error/info message. This will
help people to notice that kprobe/uprobe events caused
those messages.

Link: http://lkml.kernel.org/r/148646647813.24658.16705315294927615333.stgit@devbox

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-15 09:02:25 -05:00
Luiz Capitulino
8e0f11429a tracing/hwlat: Update old comment about migration
The ftrace hwlat does support a cpumask.

Link: http://lkml.kernel.org/r/20170213122517.6e211955@redhat.com

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-15 09:02:25 -05:00
Steven Rostedt (VMware)
1f9b3546cf tracing: Have traceprobe_probes_write() not access userspace unnecessarily
The code in traceprobe_probes_write() reads up to 4096 bytes from userpace
for each line. If userspace passes in several lines to execute, the code
will do a large read for each line, even though, it is highly likely that
the first read from userspace received all of the lines at once.

I changed the logic to do a single read from userspace, and to only read
from userspace again if not all of the read from userspace made it in.

I tested this by adding printk()s and writing files that would test -1, ==,
and +1 the buffer size, to make sure that there's no overflows and that if a
single line is written with +1 the buffer size, that it fails properly.

Link: http://lkml.kernel.org/r/20170209180458.5c829ab2@gandalf.local.home

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-15 09:00:55 -05:00
Sergey Senozhatsky
f222449c9d timekeeping: Use deferred printk() in debug code
We cannot do printk() from tk_debug_account_sleep_time(), because
tk_debug_account_sleep_time() is called under tk_core seq lock.
The reason why printk() is unsafe there is that console_sem may
invoke scheduler (up()->wake_up_process()->activate_task()), which,
in turn, can return back to timekeeping code, for instance, via
get_time()->ktime_get(), deadlocking the system on tk_core seq lock.

[   48.950592] ======================================================
[   48.950622] [ INFO: possible circular locking dependency detected ]
[   48.950622] 4.10.0-rc7-next-20170213+ #101 Not tainted
[   48.950622] -------------------------------------------------------
[   48.950622] kworker/0:0/3 is trying to acquire lock:
[   48.950653]  (tk_core){----..}, at: [<c01cc624>] retrigger_next_event+0x4c/0x90
[   48.950683]
               but task is already holding lock:
[   48.950683]  (hrtimer_bases.lock){-.-...}, at: [<c01cc610>] retrigger_next_event+0x38/0x90
[   48.950714]
               which lock already depends on the new lock.

[   48.950714]
               the existing dependency chain (in reverse order) is:
[   48.950714]
               -> #5 (hrtimer_bases.lock){-.-...}:
[   48.950744]        _raw_spin_lock_irqsave+0x50/0x64
[   48.950775]        lock_hrtimer_base+0x28/0x58
[   48.950775]        hrtimer_start_range_ns+0x20/0x5c8
[   48.950775]        __enqueue_rt_entity+0x320/0x360
[   48.950805]        enqueue_rt_entity+0x2c/0x44
[   48.950805]        enqueue_task_rt+0x24/0x94
[   48.950836]        ttwu_do_activate+0x54/0xc0
[   48.950836]        try_to_wake_up+0x248/0x5c8
[   48.950836]        __setup_irq+0x420/0x5f0
[   48.950836]        request_threaded_irq+0xdc/0x184
[   48.950866]        devm_request_threaded_irq+0x58/0xa4
[   48.950866]        omap_i2c_probe+0x530/0x6a0
[   48.950897]        platform_drv_probe+0x50/0xb0
[   48.950897]        driver_probe_device+0x1f8/0x2cc
[   48.950897]        __driver_attach+0xc0/0xc4
[   48.950927]        bus_for_each_dev+0x6c/0xa0
[   48.950927]        bus_add_driver+0x100/0x210
[   48.950927]        driver_register+0x78/0xf4
[   48.950958]        do_one_initcall+0x3c/0x16c
[   48.950958]        kernel_init_freeable+0x20c/0x2d8
[   48.950958]        kernel_init+0x8/0x110
[   48.950988]        ret_from_fork+0x14/0x24
[   48.950988]
               -> #4 (&rt_b->rt_runtime_lock){-.-...}:
[   48.951019]        _raw_spin_lock+0x40/0x50
[   48.951019]        rq_offline_rt+0x9c/0x2bc
[   48.951019]        set_rq_offline.part.2+0x2c/0x58
[   48.951049]        rq_attach_root+0x134/0x144
[   48.951049]        cpu_attach_domain+0x18c/0x6f4
[   48.951049]        build_sched_domains+0xba4/0xd80
[   48.951080]        sched_init_smp+0x68/0x10c
[   48.951080]        kernel_init_freeable+0x160/0x2d8
[   48.951080]        kernel_init+0x8/0x110
[   48.951080]        ret_from_fork+0x14/0x24
[   48.951110]
               -> #3 (&rq->lock){-.-.-.}:
[   48.951110]        _raw_spin_lock+0x40/0x50
[   48.951141]        task_fork_fair+0x30/0x124
[   48.951141]        sched_fork+0x194/0x2e0
[   48.951141]        copy_process.part.5+0x448/0x1a20
[   48.951171]        _do_fork+0x98/0x7e8
[   48.951171]        kernel_thread+0x2c/0x34
[   48.951171]        rest_init+0x1c/0x18c
[   48.951202]        start_kernel+0x35c/0x3d4
[   48.951202]        0x8000807c
[   48.951202]
               -> #2 (&p->pi_lock){-.-.-.}:
[   48.951232]        _raw_spin_lock_irqsave+0x50/0x64
[   48.951232]        try_to_wake_up+0x30/0x5c8
[   48.951232]        up+0x4c/0x60
[   48.951263]        __up_console_sem+0x2c/0x58
[   48.951263]        console_unlock+0x3b4/0x650
[   48.951263]        vprintk_emit+0x270/0x474
[   48.951293]        vprintk_default+0x20/0x28
[   48.951293]        printk+0x20/0x30
[   48.951324]        kauditd_hold_skb+0x94/0xb8
[   48.951324]        kauditd_thread+0x1a4/0x56c
[   48.951324]        kthread+0x104/0x148
[   48.951354]        ret_from_fork+0x14/0x24
[   48.951354]
               -> #1 ((console_sem).lock){-.....}:
[   48.951385]        _raw_spin_lock_irqsave+0x50/0x64
[   48.951385]        down_trylock+0xc/0x2c
[   48.951385]        __down_trylock_console_sem+0x24/0x80
[   48.951385]        console_trylock+0x10/0x8c
[   48.951416]        vprintk_emit+0x264/0x474
[   48.951416]        vprintk_default+0x20/0x28
[   48.951416]        printk+0x20/0x30
[   48.951446]        tk_debug_account_sleep_time+0x5c/0x70
[   48.951446]        __timekeeping_inject_sleeptime.constprop.3+0x170/0x1a0
[   48.951446]        timekeeping_resume+0x218/0x23c
[   48.951477]        syscore_resume+0x94/0x42c
[   48.951477]        suspend_enter+0x554/0x9b4
[   48.951477]        suspend_devices_and_enter+0xd8/0x4b4
[   48.951507]        enter_state+0x934/0xbd4
[   48.951507]        pm_suspend+0x14/0x70
[   48.951507]        state_store+0x68/0xc8
[   48.951538]        kernfs_fop_write+0xf4/0x1f8
[   48.951538]        __vfs_write+0x1c/0x114
[   48.951538]        vfs_write+0xa0/0x168
[   48.951568]        SyS_write+0x3c/0x90
[   48.951568]        __sys_trace_return+0x0/0x10
[   48.951568]
               -> #0 (tk_core){----..}:
[   48.951599]        lock_acquire+0xe0/0x294
[   48.951599]        ktime_get_update_offsets_now+0x5c/0x1d4
[   48.951629]        retrigger_next_event+0x4c/0x90
[   48.951629]        on_each_cpu+0x40/0x7c
[   48.951629]        clock_was_set_work+0x14/0x20
[   48.951660]        process_one_work+0x2b4/0x808
[   48.951660]        worker_thread+0x3c/0x550
[   48.951660]        kthread+0x104/0x148
[   48.951690]        ret_from_fork+0x14/0x24
[   48.951690]
               other info that might help us debug this:

[   48.951690] Chain exists of:
                 tk_core --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock

[   48.951721]  Possible unsafe locking scenario:

[   48.951721]        CPU0                    CPU1
[   48.951721]        ----                    ----
[   48.951721]   lock(hrtimer_bases.lock);
[   48.951751]                                lock(&rt_b->rt_runtime_lock);
[   48.951751]                                lock(hrtimer_bases.lock);
[   48.951751]   lock(tk_core);
[   48.951782]
                *** DEADLOCK ***

[   48.951782] 3 locks held by kworker/0:0/3:
[   48.951782]  #0:  ("events"){.+.+.+}, at: [<c0156590>] process_one_work+0x1f8/0x808
[   48.951812]  #1:  (hrtimer_work){+.+...}, at: [<c0156590>] process_one_work+0x1f8/0x808
[   48.951843]  #2:  (hrtimer_bases.lock){-.-...}, at: [<c01cc610>] retrigger_next_event+0x38/0x90
[   48.951843]   stack backtrace:
[   48.951873] CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.10.0-rc7-next-20170213+
[   48.951904] Workqueue: events clock_was_set_work
[   48.951904] [<c0110208>] (unwind_backtrace) from [<c010c224>] (show_stack+0x10/0x14)
[   48.951934] [<c010c224>] (show_stack) from [<c04ca6c0>] (dump_stack+0xac/0xe0)
[   48.951934] [<c04ca6c0>] (dump_stack) from [<c019b5cc>] (print_circular_bug+0x1d0/0x308)
[   48.951965] [<c019b5cc>] (print_circular_bug) from [<c019d2a8>] (validate_chain+0xf50/0x1324)
[   48.951965] [<c019d2a8>] (validate_chain) from [<c019ec18>] (__lock_acquire+0x468/0x7e8)
[   48.951995] [<c019ec18>] (__lock_acquire) from [<c019f634>] (lock_acquire+0xe0/0x294)
[   48.951995] [<c019f634>] (lock_acquire) from [<c01d0ea0>] (ktime_get_update_offsets_now+0x5c/0x1d4)
[   48.952026] [<c01d0ea0>] (ktime_get_update_offsets_now) from [<c01cc624>] (retrigger_next_event+0x4c/0x90)
[   48.952026] [<c01cc624>] (retrigger_next_event) from [<c01e4e24>] (on_each_cpu+0x40/0x7c)
[   48.952056] [<c01e4e24>] (on_each_cpu) from [<c01cafc4>] (clock_was_set_work+0x14/0x20)
[   48.952056] [<c01cafc4>] (clock_was_set_work) from [<c015664c>] (process_one_work+0x2b4/0x808)
[   48.952087] [<c015664c>] (process_one_work) from [<c0157774>] (worker_thread+0x3c/0x550)
[   48.952087] [<c0157774>] (worker_thread) from [<c015d644>] (kthread+0x104/0x148)
[   48.952087] [<c015d644>] (kthread) from [<c0107830>] (ret_from_fork+0x14/0x24)

Replace printk() with printk_deferred(), which does not call into
the scheduler.

Fixes: 0bf43f15db ("timekeeping: Prints the amounts of time spent during suspend")
Reported-and-tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J . Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: "[4.9+]" <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-15 11:46:08 +01:00
Alexander Alemayhu
7e57fbb2a3 bpf: reduce compiler warnings by adding fallthrough comments
Fixes the following warnings:

kernel/bpf/verifier.c: In function ‘may_access_direct_pkt_data’:
kernel/bpf/verifier.c:702:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (t == BPF_WRITE)
      ^
kernel/bpf/verifier.c:704:2: note: here
  case BPF_PROG_TYPE_SCHED_CLS:
  ^~~~
kernel/bpf/verifier.c: In function ‘reg_set_min_max_inv’:
kernel/bpf/verifier.c:2057:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
   true_reg->min_value = 0;
   ~~~~~~~~~~~~~~~~~~~~^~~
kernel/bpf/verifier.c:2058:2: note: here
  case BPF_JSGT:
  ^~~~
kernel/bpf/verifier.c:2068:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
   true_reg->min_value = 0;
   ~~~~~~~~~~~~~~~~~~~~^~~
kernel/bpf/verifier.c:2069:2: note: here
  case BPF_JSGE:
  ^~~~
kernel/bpf/verifier.c: In function ‘reg_set_min_max’:
kernel/bpf/verifier.c:2009:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
   false_reg->min_value = 0;
   ~~~~~~~~~~~~~~~~~~~~~^~~
kernel/bpf/verifier.c:2010:2: note: here
  case BPF_JSGT:
  ^~~~
kernel/bpf/verifier.c:2019:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
   false_reg->min_value = 0;
   ~~~~~~~~~~~~~~~~~~~~~^~~
kernel/bpf/verifier.c:2020:2: note: here
  case BPF_JSGE:
  ^~~~

Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-14 14:32:12 -05:00
Paul Moore
fe8e52b9b9 audit: remove unnecessary curly braces from switch/case statements
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-02-14 13:32:12 -05:00
Ingo Molnar
210f400d68 Linux 4.10-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYoM2fAAoJEHm+PkMAQRiGr9MH/izEAMri7rJ0QMc3ejt+WmD0
 8pkZw3+MVn71z6cIEgpzk4QkEWJd5rfhkETCeCp7qQ9V6cDW1FDE9+0OmPjiphDt
 nnzKs7t7skEBwH5Mq5xygmIfkv+Z0QGHZ20gfQWY3F56Uxo+ARF88OBHBLKhqx3v
 98C7YbMFLKBslKClA78NUEIdx0UfBaRqerlERx0Lfl9aoOrbBS6WI3iuREiylpih
 9o7HTrwaGKkU4Kd6NdgJP2EyWPsd1LGalxBBjeDSpm5uokX6ALTdNXDZqcQscHjE
 RmTqJTGRdhSThXOpNnvUJvk9L442yuNRrVme/IqLpxMdHPyjaXR3FGSIDb2SfjY=
 =VMy8
 -----END PGP SIGNATURE-----

Merge tag 'v4.10-rc8' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-14 07:29:14 +01:00
Richard Guy Briggs
ca86cad738 audit: log module name on init_module
This adds a new auxiliary record MODULE_INIT to the SYSCALL event.

We get finit_module for free since it made most sense to hook this in to
load_module().

https://github.com/linux-audit/audit-kernel/issues/7
https://github.com/linux-audit/audit-kernel/wiki/RFE-Module-Load-Record-Format

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
[PM: corrected links in the commit description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-02-13 16:17:13 -05:00
Yang Yang
25f71d1c3e futex: Move futex_init() to core_initcall
The UEVENT user mode helper is enabled before the initcalls are executed
and is available when the root filesystem has been mounted.

The user mode helper is triggered by device init calls and the executable
might use the futex syscall.

futex_init() is marked __initcall which maps to device_initcall, but there
is no guarantee that futex_init() is invoked _before_ the first device init
call which triggers the UEVENT user mode helper.

If the user mode helper uses the futex syscall before futex_init() then the
syscall crashes with a NULL pointer dereference because the futex subsystem
has not been initialized yet.

Move futex_init() to core_initcall so futexes are initialized before the
root filesystem is mounted and the usermode helper becomes available.

[ tglx: Rewrote changelog ]

Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: jiang.biao2@zte.com.cn
Cc: jiang.zhengxiong@zte.com.cn
Cc: zhong.weidong@zte.com.cn
Cc: deng.huali@zte.com.cn
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1483085875-6130-1-git-send-email-yang.yang29@zte.com.cn
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-13 16:12:22 +01:00
Mike Galbraith
202461e2f3 tick/broadcast: Prevent deadlock on tick_broadcast_lock
tick_broadcast_lock is taken from interrupt context, but the following call
chain takes the lock without disabling interrupts:

[   12.703736]  _raw_spin_lock+0x3b/0x50
[   12.703738]  tick_broadcast_control+0x5a/0x1a0
[   12.703742]  intel_idle_cpu_online+0x22/0x100
[   12.703744]  cpuhp_invoke_callback+0x245/0x9d0
[   12.703752]  cpuhp_thread_fun+0x52/0x110
[   12.703754]  smpboot_thread_fn+0x276/0x320

So the following deadlock can happen:

   lock(tick_broadcast_lock);
   <Interrupt>
      lock(tick_broadcast_lock);

intel_idle_cpu_online() is the only place which violates the calling
convention of tick_broadcast_control(). This was caused by the removal of
the smp function call in course of the cpu hotplug rework.

Instead of slapping local_irq_disable/enable() at the call site, we can
relax the calling convention and handle it in the core code, which makes
the whole machinery more robust.

Fixes: 29d7bbada9 ("intel_idle: Remove superfluous SMP fuction call")
Reported-by: Gabriel C <nix.or.die@gmail.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ruslan Ruslichenko <rruslich@cisco.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: lwn@lwn.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1486953115.5912.4.camel@gmx.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-13 09:49:31 +01:00
Alexei Starovoitov
7f67763337 bpf: introduce BPF_F_ALLOW_OVERRIDE flag
If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.

Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X

2.
prog X attached to /A with allow_override.
prog Y fails to attach to /A/B with default (non-override)
prog M attached to /A/B with allow_override.
Everything under /A/B runs prog M only.

3.
prog X attached to /A with allow_override.
prog Y fails to attach to /A with default.
The user has to detach first to switch the mode.

In the future this behavior may be extended with a chain of
non-overridable programs.

Also fix the bug where detach from cgroup where nothing is attached
was not throwing error. Return ENOENT in such case.

Add several testcases and adjust libbpf.

Fixes: 3007098494 ("cgroup: add support for eBPF programs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-12 21:52:19 -05:00
Heiner Kallweit
899b5fbf9d genirq/devres: Use dev_name(dev) as default for devname
Allow the devname parameter to be NULL and use dev_name(dev) in this case.
This should be an appropriate default for most use cases.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: http://lkml.kernel.org/r/05c63d67-30b4-7026-02d5-ce7fb7bc185f@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-12 19:49:25 +01:00
Linus Torvalds
fdb0ee7c65 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix a sporadic missed timer hw reprogramming bug that can result in
  random delays"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/nohz: Fix possible missing clock reprog after tick soft restart
2017-02-11 10:24:16 -08:00
Linus Torvalds
d5b76bef01 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A kernel crash fix plus three tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix crash in perf_event_read()
  perf callchain: Reference count maps
  perf diff: Fix -o/--order option behavior (again)
  perf diff: Fix segfault on 'perf diff -o N' option
2017-02-11 10:20:06 -08:00
Linus Torvalds
4e4f74a7ee Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull lockdep fix from Ingo Molnar:
 "This fixes an ugly lockdep stack trace output regression. (But also
  affects other stacktrace users such as kmemleak, KASAN, etc)"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stacktrace, lockdep: Fix address, newline ugliness
2017-02-11 10:16:05 -08:00
David S. Miller
35eeacf182 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-11 02:31:11 -05:00
Peter Zijlstra
5ff22646d2 module: Optimize search_module_extables()
While looking through the __ex_table stuff I found that we do a linear
lookup of the module. Also fix up a comment.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-02-10 19:21:10 -08:00
Steven Rostedt (VMware)
4c7384131c tracing: Have COMM event filter key be treated as a string
The GLOB operation "~" should be able to work with the COMM filter key in
order to trace programs with a glob. For example

  echo 'COMM ~ "systemd*"' > events/syscalls/filter

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-10 14:19:45 -05:00
H Hartley Sweeten
f435da416b genirq: Fix /proc/interrupts output alignment
If the irq_desc being output does not have a domain associated the
information following the 'name' is not aligned correctly.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Link: http://lkml.kernel.org/r/20170210165416.5629-1-hsweeten@visionengravers.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10 20:17:52 +01:00
Joerg Roedel
8d2932dd06 Merge branches 'iommu/fixes', 'arm/exynos', 'arm/renesas', 'arm/smmu', 'arm/mediatek', 'arm/core', 'x86/vt-d' and 'core' into next 2017-02-10 15:13:10 +01:00
Bartosz Golaszewski
2b5e77308f irqdesc: Add a resource managed version of irq_alloc_descs()
Add a devres flavor of __devm_irq_alloc_descs() and corresponding
helper macros.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Link: http://lkml.kernel.org/r/1486729403-21132-1-git-send-email-bgolaszewski@baylibre.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10 14:39:20 +01:00
Mars Cheng
7551b02b94 timer_list: Remove useless cast when printing
hrtimer_resolution is already unsigned int, not necessary to cast
it when printing.

Signed-off-by: Mars Cheng <mars.cheng@mediatek.com>
Cc: CC Hwang <cc.hwang@mediatek.com>
Cc: wsd_upstream@mediatek.com
Cc: Loda Chou <loda.chou@mediatek.com>
Cc: Jades Shih <jades.shih@mediatek.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: My Chuang <my.chuang@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Link: http://lkml.kernel.org/r/1486626615-5879-1-git-send-email-mars.cheng@mediatek.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10 11:15:08 +01:00
Kees Cook
dfb4357da6 time: Remove CONFIG_TIMER_STATS
Currently CONFIG_TIMER_STATS exposes process information across namespaces:

kernel/time/timer_list.c print_timer():

        SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);

/proc/timer_list:

 #11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570

Given that the tracer can give the same information, this patch entirely
removes CONFIG_TIMER_STATS.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: linux-doc@vger.kernel.org
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Xing Gao <xgao01@email.wm.edu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jessica Frazelle <me@jessfraz.com>
Cc: kernel-hardening@lists.openwall.com
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10 11:15:08 +01:00
Frederic Weisbecker
7bdb59f1ad tick/nohz: Fix possible missing clock reprog after tick soft restart
ts->next_tick keeps track of the next tick deadline in order to optimize
clock programmation on irq exit and avoid redundant clock device writes.

Now if ts->next_tick missed an update, we may spuriously miss a clock
reprog later as the nohz code is fooled by an obsolete next_tick value.

This is what happens here on a specific path: when we observe an
expired timer from the nohz update code on irq exit, we perform a soft
tick restart which simply fires the closest possible tick without
actually exiting the nohz mode and restoring a periodic state. But we
forget to update ts->next_tick accordingly.

As a result, after the next tick resulting from such soft tick restart,
the nohz code sees a stale value on ts->next_tick which doesn't match
the clock deadline that just expired. If that obsolete ts->next_tick
value happens to collide with the actual next tick deadline to be
scheduled, we may spuriously bypass the clock reprogramming. In the
worst case, the tick may never fire again.

Fix this with a ts->next_tick reset on soft tick restart.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1486485894-29173-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10 09:43:48 +01:00
Waiman Long
bc88c10d7e locking/spinlock/debug: Remove spinlock lockup detection code
The current spinlock lockup detection code can sometimes produce false
positives because of the unfairness of the locking algorithm itself.

So the lockup detection code is now removed. Instead, we are relying
on the NMI watchdog to detect potential lockup. We won't have lockup
detection if the watchdog isn't running.

The commented-out read-write lock lockup detection code are also
removed.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486583208-11038-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:09:49 +01:00
Byungchul Park
f9af456a61 lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
Bug messages and stack dump for MAX_LOCKDEP_CHAIN_HLOCKS should only
be printed once.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1484275324-28192-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:09:48 +01:00
Alexander Shishkin
6ce77bfd6c perf/core: Allow kernel filters on CPU events
While supporting file-based address filters for CPU events requires some
extra context switch handling, kernel address filters are easy, since the
kernel mapping is preserved across address spaces. It is also useful as
it permits tracing scheduling paths of the kernel.

This patch allows setting up kernel filters for CPU events.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170126094057.13805-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:08:09 +01:00
Alexander Shishkin
9ccbfbb157 perf/core: Do error out on a kernel filter on an exclude_filter event
It is currently possible to configure a kernel address filter for a
event that excludes kernel from its traces (attr.exclude_kernel==1).

While in reality this doesn't make sense, the SET_FILTER ioctl() should
return a error in such case, currently it does not. Furthermore, it
will still silently discard the filter and any potentially valid filters
that came with it.

This patch makes the SET_FILTER ioctl() error out in such cases.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170126094057.13805-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:08:09 +01:00
Steven Rostedt (VMware)
bb3bac2ca9 sched/core: Remove unlikely() annotation from sched_move_task()
The check for 'running' in sched_move_task() has an unlikely() around it. That
is, it is unlikely that the task being moved is running. That use to be
true. But with a couple of recent updates, it is now likely that the task
will be running.

The first change came from ea86cb4b76 ("sched/cgroup: Fix
cpu_cgroup_fork() handling") that moved around the use case of
sched_move_task() in do_fork() where the call is now done after the task is
woken (hence it is running).

The second change came from 8e5bfa8c1f ("sched/autogroup: Do not use
autogroup->tg in zombie threads") where sched_move_task() is called by the
exit path, by the task that is exiting. Hence it too is running.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:05:42 +01:00
Peter Zijlstra
451d24d1e5 perf/core: Fix crash in perf_event_read()
Alexei had his box explode because doing read() on a package
(rapl/uncore) event that isn't currently scheduled in ends up doing an
out-of-bounds load.

Rework the code to more explicitly deal with event->oncpu being -1.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Fixes: d6a2f9035b ("perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKG")
Link: http://lkml.kernel.org/r/20170131102710.GL6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10 09:04:50 +01:00
Naveen N. Rao
fc62d0207a kprobes: Introduce weak variant of kprobe_exceptions_notify()
kprobe_exceptions_notify() is not used on some of the architectures such
as arm[64] and powerpc anymore. Introduce a weak variant for such
architectures.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10 14:42:49 +11:00
James Morris
a2a15479d6 Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/selinux into next 2017-02-10 10:28:49 +11:00
Paul Gortmaker
8a293be0d6 core: migrate exception table users off module.h and onto extable.h
These files were including module.h for exception table related
functions.  We've now separated that content out into its own file
"extable.h" so now move over to that and where possible, avoid all
the extra header content in module.h that we don't really need to
compile these non-modular files.

Note:
   init/main.c still needs module.h for __init_or_module
   kernel/extable.c still needs module.h for is_module_text_address

...and so we don't get the benefit of removing module.h from the cpp
feed for these two files, unlike the almost universal 1:1 exchange
of module.h for extable.h we were able to do in the arch dirs.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-02-09 16:38:53 -05:00
Luis R. Rodriguez
ed5bd7dc88 kernel/ucount.c: mark user_header with kmemleak_ignore()
The user_header gets caught by kmemleak with the following splat as
missing a free:

  unreferenced object 0xffff99667a733d80 (size 96):
  comm "swapper/0", pid 1, jiffies 4294892317 (age 62191.468s)
  hex dump (first 32 bytes):
    a0 b6 92 b4 ff ff ff ff 00 00 00 00 01 00 00 00  ................
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
     kmemleak_alloc+0x4a/0xa0
     __kmalloc+0x144/0x260
     __register_sysctl_table+0x54/0x5e0
     register_sysctl+0x1b/0x20
     user_namespace_sysctl_init+0x17/0x34
     do_one_initcall+0x52/0x1a0
     kernel_init_freeable+0x173/0x200
     kernel_init+0xe/0x100
     ret_from_fork+0x2c/0x40

The BUG_ON()s are intended to crash so no need to clean up after
ourselves on error there.  This is also a kernel/ subsys_init() we don't
need a respective exit call here as this is never modular, so just white
list it.

Link: http://lkml.kernel.org/r/20170203211404.31458-1-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-08 15:41:43 -08:00
Daniel Borkmann
c502faf941 bpf, lpm: fix overflows in trie_alloc checks
Cap the maximum (total) value size and bail out if larger than KMALLOC_MAX_SIZE
as otherwise it doesn't make any sense to proceed further, since we're
guaranteed to fail to allocate elements anyway in lpm_trie_node_alloc();
likleyhood of failure is still high for large values, though, similarly
as with htab case in non-prealloc.

Next, make sure that cost vars are really u64 instead of size_t, so that we
don't overflow on 32 bit and charge only tiny map.pages against memlock while
allowing huge max_entries; cap also the max cost like we do with other map
types.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-08 14:40:03 -05:00
Sergey Senozhatsky
d9c23523ed printk: drop call_console_drivers() unused param
We do suppress_message_printing() check before we call
call_console_drivers() now, so `level' param is not needed
anymore.

Link: http://lkml.kernel.org/r/20161224140902.1962-2-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-02-08 14:01:36 +01:00
Sergey Senozhatsky
de6fcbdb68 printk: convert the rest to printk-safe
This patch converts the rest of logbuf users (which are
out of printk recursion case, but can deadlock in printk).
To make printk-safe usage easier the patch introduces 4
helper macros:
- logbuf_lock_irq()/logbuf_unlock_irq()
  lock/unlock the logbuf lock and disable/enable local IRQ

- logbuf_lock_irqsave(flags)/logbuf_unlock_irqrestore(flags)
  lock/unlock the logbuf lock and saves/restores local IRQ state

Link: http://lkml.kernel.org/r/20161227141611.940-9-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-02-08 13:58:44 +01:00
Sergey Senozhatsky
8b1742c9c2 printk: remove zap_locks() function
We use printk-safe now which makes printk-recursion detection code
in vprintk_emit() unreachable. The tricky thing here is that, apart
from detecting and reporting printk recursions, that code also used
to zap_locks() in case of panic() from the same CPU. However,
zap_locks() does not look to be needed anymore:

1) Since commit 08d78658f3 ("panic: release stale console lock to
   always get the logbuf printed out") panic flushing of `logbuf' to
   console ignores the state of `console_sem' by doing
   	panic()
		console_trylock();
		console_unlock();

2) Since commit cf9b1106c8 ("printk/nmi: flush NMI messages on the
   system panic") panic attempts to zap the `logbuf_lock' spin_lock to
   successfully flush nmi messages to `logbuf'.

Basically, it seems that we either already do what zap_locks() used to
do but in other places or we ignore the state of the lock. The only
reaming difference is that we don't re-init the console semaphore in
printk_safe_flush_on_panic(), but this is not necessary because we
don't call console drivers from printk_safe_flush_on_panic() due to
the fact that we are using a deferred printk() version (as was
suggested by Petr Mladek).

Link: http://lkml.kernel.org/r/20161227141611.940-8-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-02-08 13:54:27 +01:00
Sergey Senozhatsky
f975237b76 printk: use printk_safe buffers in printk
Use printk_safe per-CPU buffers in printk recursion-prone blocks:
-- around logbuf_lock protected sections in vprintk_emit() and
   console_unlock()
-- around down_trylock_console_sem() and up_console_sem()

Note that this solution addresses deadlocks caused by printk()
recursive calls only. That is vprintk_emit() and console_unlock().
The rest will be converted in a followup patch.

Another thing to note is that we now keep lockdep enabled in printk,
because we are protected against the printk recursion caused by
lockdep in vprintk_emit() by the printk-safe mechanism - we first
switch to per-CPU buffers and only then access the deadlock-prone
locks.

Examples:

1) printk() from logbuf_lock spin_lock section

Assume the following code:
  printk()
    raw_spin_lock(&logbuf_lock);
    WARN_ON(1);
    raw_spin_unlock(&logbuf_lock);

which now produces:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 366 at kernel/printk/printk.c:1811 vprintk_emit
 CPU: 0 PID: 366 Comm: bash
 Call Trace:
   warn_slowpath_null+0x1d/0x1f
   vprintk_emit+0x1cd/0x438
   vprintk_default+0x1d/0x1f
   printk+0x48/0x50
  [..]

2) printk() from semaphore sem->lock spin_lock section

Assume the following code

  printk()
    console_trylock()
      down_trylock()
        raw_spin_lock_irqsave(&sem->lock, flags);
        WARN_ON(1);
        raw_spin_unlock_irqrestore(&sem->lock, flags);

which now produces:

 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 363 at kernel/locking/semaphore.c:141 down_trylock
 CPU: 1 PID: 363 Comm: bash
 Call Trace:
   warn_slowpath_null+0x1d/0x1f
   down_trylock+0x3d/0x62
   ? vprintk_emit+0x3f9/0x414
   console_trylock+0x31/0xeb
   vprintk_emit+0x3f9/0x414
   vprintk_default+0x1d/0x1f
   printk+0x48/0x50
  [..]

3) printk() from console_unlock()

Assume the following code:

  printk()
    console_unlock()
      raw_spin_lock(&logbuf_lock);
      WARN_ON(1);
      raw_spin_unlock(&logbuf_lock);

which now produces:

 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 329 at kernel/printk/printk.c:2384 console_unlock
 CPU: 1 PID: 329 Comm: bash
 Call Trace:
   warn_slowpath_null+0x18/0x1a
   console_unlock+0x12d/0x559
   ? trace_hardirqs_on_caller+0x16d/0x189
   ? trace_hardirqs_on+0xd/0xf
   vprintk_emit+0x363/0x374
   vprintk_default+0x18/0x1a
   printk+0x43/0x4b
  [..]

4) printk() from try_to_wake_up()

Assume the following code:

  printk()
    console_unlock()
      up()
        try_to_wake_up()
          raw_spin_lock_irqsave(&p->pi_lock, flags);
          WARN_ON(1);
          raw_spin_unlock_irqrestore(&p->pi_lock, flags);

which now produces:

 ------------[ cut here ]------------
 WARNING: CPU: 3 PID: 363 at kernel/sched/core.c:2028 try_to_wake_up
 CPU: 3 PID: 363 Comm: bash
 Call Trace:
   warn_slowpath_null+0x1d/0x1f
   try_to_wake_up+0x7f/0x4f7
   wake_up_process+0x15/0x17
   __up.isra.0+0x56/0x63
   up+0x32/0x42
   __up_console_sem+0x37/0x55
   console_unlock+0x21e/0x4c2
   vprintk_emit+0x41c/0x462
   vprintk_default+0x1d/0x1f
   printk+0x48/0x50
  [..]

5) printk() from call_console_drivers()

Assume the following code:
  printk()
    console_unlock()
      call_console_drivers()
      ...
          WARN_ON(1);

which now produces:

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 305 at kernel/printk/printk.c:1604 call_console_drivers
 CPU: 2 PID: 305 Comm: bash
 Call Trace:
   warn_slowpath_null+0x18/0x1a
   call_console_drivers.isra.6.constprop.16+0x3a/0xb0
   console_unlock+0x471/0x48e
   vprintk_emit+0x1f4/0x206
   vprintk_default+0x18/0x1a
   vprintk_func+0x6e/0x70
   printk+0x3e/0x46
  [..]

6) unsupported placeholder in printk() format now prints an actual
   warning from vscnprintf(), instead of
   	'BUG: recent printk recursion!'.

 ------------[ cut here ]------------
 WARNING: CPU: 5 PID: 337 at lib/vsprintf.c:1900 format_decode
 Please remove unsupported %
  in format string
 CPU: 5 PID: 337 Comm: bash
 Call Trace:
   dump_stack+0x4f/0x65
   __warn+0xc2/0xdd
   warn_slowpath_fmt+0x4b/0x53
   format_decode+0x22c/0x308
   vsnprintf+0x89/0x3b7
   vscnprintf+0xd/0x26
   vprintk_emit+0xb4/0x238
   vprintk_default+0x1d/0x1f
   vprintk_func+0x6c/0x73
   printk+0x43/0x4b
  [..]

Link: http://lkml.kernel.org/r/20161227141611.940-7-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-08 13:51:49 +01:00
Sergey Senozhatsky
ddb9baa822 printk: report lost messages in printk safe/nmi contexts
Account lost messages in pritk-safe and printk-safe-nmi
contexts and report those numbers during printk_safe_flush().

The patch also moves lost message counter to struct
`printk_safe_seq_buf' instead of having dedicated static
counters - this simplifies the code.

Link: http://lkml.kernel.org/r/20161227141611.940-6-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-02-08 13:50:05 +01:00
Sergey Senozhatsky
7acac3445a printk: always use deferred printk when flush printk_safe lines
Always use printk_deferred() in printk_safe_flush_line().
Flushing can be done from NMI or printk_safe contexts (when
we are in panic), so we can't call console drivers, yet still
want to store the messages in the logbuf buffer. Therefore we
use a deferred printk version.

Link: http://lkml.kernel.org/r/20170206164253.GA463@tigerII.localdomain
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-08 11:19:10 +01:00
Sergey Senozhatsky
099f1c84c0 printk: introduce per-cpu safe_print seq buffer
This patch extends the idea of NMI per-cpu buffers to regions
that may cause recursive printk() calls and possible deadlocks.
Namely, printk() can't handle printk calls from schedule code
or printk() calls from lock debugging code (spin_dump() for instance);
because those may be called with `sem->lock' already taken or any
other `critical' locks (p->pi_lock, etc.). An example of deadlock
can be

 vprintk_emit()
  console_unlock()
   up()                        << raw_spin_lock_irqsave(&sem->lock, flags);
    wake_up_process()
     try_to_wake_up()
      ttwu_queue()
       ttwu_activate()
        activate_task()
         enqueue_task()
          enqueue_task_fair()
           cfs_rq_of()
            task_of()
             WARN_ON_ONCE(!entity_is_task(se))
              vprintk_emit()
               console_trylock()
                down_trylock()
                 raw_spin_lock_irqsave(&sem->lock, flags)
                 ^^^^ deadlock

and some other cases.

Just like in NMI implementation, the solution uses a per-cpu
`printk_func' pointer to 'redirect' printk() calls to a 'safe'
callback, that store messages in a per-cpu buffer and flushes
them back to logbuf buffer later.

Usage example:

 printk()
  printk_safe_enter_irqsave(flags)
  //
  //  any printk() call from here will endup in vprintk_safe(),
  //  that stores messages in a special per-CPU buffer.
  //
  printk_safe_exit_irqrestore(flags)

The 'redirection' mechanism, though, has been reworked, as suggested
by Petr Mladek. Instead of using a per-cpu @print_func callback we now
keep a per-cpu printk-context variable and call either default or nmi
vprintk function depending on its value. printk_nmi_entrer/exit and
printk_safe_enter/exit, thus, just set/celar corresponding bits in
printk-context functions.

The patch only adds printk_safe support, we don't use it yet.

Link: http://lkml.kernel.org/r/20161227141611.940-4-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-08 11:07:11 +01:00
Sergey Senozhatsky
f92bac3b14 printk: rename nmi.c and exported api
A preparation patch for printk_safe work. No functional change.
- rename nmi.c to print_safe.c
- add `printk_safe' prefix to some (which used both by printk-safe
  and printk-nmi) of the exported functions.

Link: http://lkml.kernel.org/r/20161227141611.940-3-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-02-08 11:02:33 +01:00
Sergey Senozhatsky
bd66a89249 printk: use vprintk_func in vprintk()
vprintk(), just like printk(), better be using per-cpu printk_func
instead of direct vprintk_emit() call. Just in case if vprintk()
will ever be called from NMI, or from any other context that can
deadlock in printk().

Link: http://lkml.kernel.org/r/20161227141611.940-2-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-02-08 10:57:32 +01:00
Ingo Molnar
1051408f7e sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
The names are all 'autogroup', not 'auto_group' - so rename
the kernel/sched/auto_group.[ch] to match the existing
nomenclature.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-08 09:01:11 +01:00
Omar Sandoval
bfeda41d06 stacktrace, lockdep: Fix address, newline ugliness
Since KERN_CONT became meaningful again, lockdep stack traces have had
annoying extra newlines, like this:

[    5.561122] -> #1 (B){+.+...}:
[    5.561528]
[    5.561532] [<ffffffff810d8873>] lock_acquire+0xc3/0x210
[    5.562178]
[    5.562181] [<ffffffff816f6414>] mutex_lock_nested+0x74/0x6d0
[    5.562861]
[    5.562880] [<ffffffffa01aa3c3>] init_btrfs_fs+0x21/0x196 [btrfs]
[    5.563717]
[    5.563721] [<ffffffff81000472>] do_one_initcall+0x52/0x1b0
[    5.564554]
[    5.564559] [<ffffffff811a3af6>] do_init_module+0x5f/0x209
[    5.565357]
[    5.565361] [<ffffffff81122f4d>] load_module+0x218d/0x2b80
[    5.566020]
[    5.566021] [<ffffffff81123beb>] SyS_finit_module+0xeb/0x120
[    5.566694]
[    5.566696] [<ffffffff816fd241>] entry_SYSCALL_64_fastpath+0x1f/0xc2

That's happening because each printk() call now gets printed on its own
line, and we do a separate call to print the spaces before the symbol.
Fix it by doing the printk() directly instead of using the
print_ip_sym() helper.

Additionally, the symbol address isn't very helpful, so let's get rid of
that, too. The final result looks like this:

[    5.194518] -> #1 (B){+.+...}:
[    5.195002]        lock_acquire+0xc3/0x210
[    5.195439]        mutex_lock_nested+0x74/0x6d0
[    5.196491]        do_one_initcall+0x52/0x1b0
[    5.196939]        do_init_module+0x5f/0x209
[    5.197355]        load_module+0x218d/0x2b80
[    5.197792]        SyS_finit_module+0xeb/0x120
[    5.198251]        entry_SYSCALL_64_fastpath+0x1f/0xc2

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Fixes: 4bcc595ccd ("printk: reinstate KERN_CONT for printing continuation lines")
Link: http://lkml.kernel.org/r/43b4e114724b2bdb0308fa86cb33aa07d3d67fad.1486510315.git.osandov@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-08 08:21:31 +01:00
David S. Miller
3efa70d78f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07 16:29:30 -05:00
Laura Abbott
0f5bf6d0af arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
Both of these options are poorly named. The features they provide are
necessary for system security and should not be considered debug only.
Change the names to CONFIG_STRICT_KERNEL_RWX and
CONFIG_STRICT_MODULE_RWX to better describe what these options do.

Signed-off-by: Laura Abbott <labbott@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-02-07 12:32:52 -08:00
Ingo Molnar
f2cb13609d sched/topology: Split out scheduler topology code from core.c into topology.c
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 10:58:12 +01:00
Ingo Molnar
004172bdad sched/core: Remove unnecessary #include headers
Over the years sched/core.c accumulated over 50 #include lines,
40 of which are superfluous. (!)

Removing them decreases the preprocessed .c file (.i) size noticeably:

            triton:~/tip> wc -l kernel/sched/core.i

  Before:   76387 kernel/sched/core.i
  After:    75896 kernel/sched/core.i

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 10:58:07 +01:00
Ingo Molnar
535b9552bb sched/rq_clock: Consolidate the ordering of the rq_clock methods
update_rq_clock_task() and update_rq_clock() we unnecessarily
spread across core.c, requiring an extra prototype line.

Move them next to each other and in the proper order.

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 10:58:01 +01:00
Ingo Molnar
d1ccc66df8 sched/core: Clean up comments
Refresh the comments in the core scheduler code:

 - Capitalize sentences consistently

 - Capitalize 'CPU' consistently

 - ... and other small details.

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-07 10:57:41 +01:00
William Tu
63dfef75ed bpf: enable verifier to add 0 to packet ptr
The patch fixes the case when adding a zero value to the packet
pointer.  The zero value could come from src_reg equals type
BPF_K or CONST_IMM.  The patch fixes both, otherwise the verifer
reports the following error:
  [...]
    R0=imm0,min_value=0,max_value=0
    R1=pkt(id=0,off=0,r=4)
    R2=pkt_end R3=fp-12
    R4=imm4,min_value=4,max_value=4
    R5=pkt(id=0,off=4,r=4)
  269: (bf) r2 = r0     // r2 becomes imm0
  270: (77) r2 >>= 3
  271: (bf) r4 = r1     // r4 becomes pkt ptr
  272: (0f) r4 += r2    // r4 += 0
  addition of negative constant to packet pointer is not allowed

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Mihai Budiu <mbudiu@vmware.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06 22:50:04 -05:00
Ingo Molnar
0fc6d1ca14 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-06 11:04:39 +01:00
Linus Torvalds
a572a1b999 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:

 - Prevent double activation of interrupt lines, which causes problems
   on certain interrupt controllers

 - Handle the fallout of the above because x86 (ab)uses the activation
   function to reconfigure interrupts under the hood.

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Make irq activate operations symmetric
  irqdomain: Avoid activating interrupts more than once
2017-02-04 12:18:01 -08:00
Waiman Long
668802c257 tick/broadcast: Reduce lock cacheline contention
It was observed that on an Intel x86 system without the ARAT (Always
running APIC timer) feature and with fairly large number of CPUs as
well as CPUs coming in and out of intel_idle frequently, the lock
contention on the tick_broadcast_lock can become significant.

To reduce contention, the lock is put into its own cacheline and all
the cpumask_var_t variables are put into the __read_mostly section.

Running the SP benchmark of the NAS Parallel Benchmarks on a 4-socket
16-core 32-thread Nehalam system, the performance number improved
from 3353.94 Mop/s to 3469.31 Mop/s when this patch was applied on
a 4.9.6 kernel.  This is a 3.4% improvement.

Signed-off-by: Waiman Long <longman@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1485799063-20857-1-git-send-email-longman@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-04 08:54:46 +01:00
Daniel Borkmann
3898fac1f4 trace: rename trace_print_hex_seq arg and add kdoc
Steven suggested to improve trace_print_hex_seq() a bit after commit
2acae0d5b0 ("trace: add variant without spacing in trace_print_hex_seq")
in two ways: i) by adding a kdoc comment for the helper function
itself and ii) by renaming 'spacing' argument into 'concatenate'
to better denote that we don't add spaces between each hex bytes.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 15:50:18 -05:00
Linus Torvalds
2d47b8aac7 Simple fix of s/static struct __init/static __init struct/
-----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYk7V0FBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 maMH/3squMpptgiSzWn79Ay0e/K69sBbKNNjLVA5YWeadDrVodUOkwv9wX1FMJSp
 kQKWfm2PUcy1TKK+Au9YZ94qHEcbvktR6UG9VX0Bbz9UiT6GywPpRExdUO2wo5rS
 uiiZaF2Qfvpr4bGXwUVuqxMBsKHPjRwv9WyokpdHZWF3mJIRa3WqL/ZEZYcsHaqc
 0J8ReWASVjVev1su42pj7gjWyI5Wm8K4vu5DNSfGPmfxA4zVQC7UgnvkjBFi14jF
 6NWu03t50VOxYix50SfIZHyONn347IanSJT68efAAgPLlaTcSUDgz8tbQ6PrZ5CP
 R4MHal7+L9qkiy/GEtg+85dfH+k=
 =TtxB
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.10-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Simple fix of s/static struct __init/static __init struct/"

* tag 'trace-v4.10-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/kprobes: Fix __init annotation
2017-02-03 11:06:59 -08:00
Ard Biesheuvel
71810db27c modversions: treat symbol CRCs as 32 bit quantities
The modversion symbol CRCs are emitted as ELF symbols, which allows us
to easily populate the kcrctab sections by relying on the linker to
associate each kcrctab slot with the correct value.

This has a couple of downsides:

 - Given that the CRCs are treated as memory addresses, we waste 4 bytes
   for each CRC on 64 bit architectures,

 - On architectures that support runtime relocation, a R_<arch>_RELATIVE
   relocation entry is emitted for each CRC value, which identifies it
   as a quantity that requires fixing up based on the actual runtime
   load offset of the kernel. This results in corrupted CRCs unless we
   explicitly undo the fixup (and this is currently being handled in the
   core module code)

 - Such runtime relocation entries take up 24 bytes of __init space
   each, resulting in a x8 overhead in [uncompressed] kernel size for
   CRCs.

Switching to explicit 32 bit values on 64 bit architectures fixes most
of these issues, given that 32 bit values are not treated as quantities
that require fixing up based on the actual runtime load offset.  Note
that on some ELF64 architectures [such as PPC64], these 32-bit values
are still emitted as [absolute] runtime relocatable quantities, even if
the value resolves to a build time constant.  Since relative relocations
are always resolved at build time, this patch enables MODULE_REL_CRCS on
powerpc when CONFIG_RELOCATABLE=y, which turns the absolute CRC
references into relative references into .rodata where the actual CRC
value is stored.

So redefine all CRC fields and variables as u32, and redefine the
__CRC_SYMBOL() macro for 64 bit builds to emit the CRC reference using
inline assembler (which is necessary since 64-bit C code cannot use
32-bit types to hold memory addresses, even if they are ultimately
resolved using values that do not exceed 0xffffffff).  To avoid
potential problems with legacy 32-bit architectures using legacy
toolchains, the equivalent C definition of the kcrctab entry is retained
for 32-bit architectures.

Note that this mostly reverts commit d4703aefdb ("module: handle ppc64
relocating kcrctabs when CONFIG_RELOCATABLE=y")

Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 08:28:25 -08:00
Steven Rostedt (VMware)
e704eff3ff ftrace: Have set_graph_function handle multiple functions in one write
Currently, only one function can be written to set_graph_function and
set_graph_notrace. The last function in the list will have saved, even
though other functions will be added then removed.

Change the behavior to be the same as set_ftrace_function as to allow
multiple functions to be written. If any one fails, none of them will be
added. The addition of the functions are done at the end when the file is
closed.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03 10:59:52 -05:00
Steven Rostedt (VMware)
649b988b12 ftrace: Do not hold references of ftrace_graph_{notrace_}hash out of graph_lock
The hashs ftrace_graph_hash and ftrace_graph_notrace_hash are modified
within the graph_lock being held. Holding a pointer to them and passing them
along can lead to a use of a stale pointer (fgd->hash). Move assigning the
pointer and its use to within the holding of the lock. Note, it's an
rcu_sched protected data, and other instances of referencing them are done
with preemption disabled. But the file manipuation code must be protected by
the lock.

The fgd->hash pointer is set to NULL when the lock is being released.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03 10:59:42 -05:00
Steven Rostedt (VMware)
0e684b6578 tracing: Reset parser->buffer to allow multiple "puts"
trace_parser_put() simply frees the allocated parser buffer. But it does not
reset the pointer that was freed. This means that if trace_parser_put() is
called on the same parser more than once, it will corrupt the allocation
system. Setting parser->buffer to NULL after free allows it to be called
more than once without any ill effect.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03 10:59:31 -05:00
Steven Rostedt (VMware)
ae98d27afc ftrace: Have set_graph_functions handle write with RDWR
Since reading the set_graph_functions uses seq functions, which sets the
file->private_data pointer to a seq_file descriptor. On writes the
ftrace_graph_data descriptor is set to file->private_data. But if the file
is opened for RDWR, the ftrace_graph_write() will incorrectly use the
file->private_data descriptor instead of
((struct seq_file *)file->private_data)->private pointer, and this can crash
the kernel.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03 10:59:23 -05:00
Steven Rostedt (VMware)
d4ad9a1cca ftrace: Reset fgd->hash in ftrace_graph_write()
fgd->hash is saved and then freed, but is never reset to either
ftrace_graph_hash nor ftrace_graph_notrace_hash. But if multiple writes are
performed, then the freed hash could be accessed again.

 # cd /sys/kernel/debug/tracing
 # head -1000 available_filter_functions > /tmp/funcs
 # cat /tmp/funcs > set_graph_function

Causes:

 general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
 Modules linked in:  [...]
 CPU: 2 PID: 1337 Comm: cat Not tainted 4.10.0-rc2-test-00010-g6b052e9 #32
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
 task: ffff880113a12200 task.stack: ffffc90001940000
 RIP: 0010:free_ftrace_hash+0x7c/0x160
 RSP: 0018:ffffc90001943db0 EFLAGS: 00010246
 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: 6b6b6b6b6b6b6b6b
 RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8800ce1e1d40
 RBP: ffff8800ce1e1d50 R08: 0000000000000000 R09: 0000000000006400
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: ffff8800ce1e1d40 R14: 0000000000004000 R15: 0000000000000001
 FS:  00007f9408a07740(0000) GS:ffff88011e500000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000aee1f0 CR3: 0000000116bb4000 CR4: 00000000001406e0
 Call Trace:
  ? ftrace_graph_write+0x150/0x190
  ? __vfs_write+0x1f6/0x210
  ? __audit_syscall_entry+0x17f/0x200
  ? rw_verify_area+0xdb/0x210
  ? _cond_resched+0x2b/0x50
  ? __sb_start_write+0xb4/0x130
  ? vfs_write+0x1c8/0x330
  ? SyS_write+0x62/0xf0
  ? do_syscall_64+0xa3/0x1b0
  ? entry_SYSCALL64_slow_path+0x25/0x25
 Code: 01 48 85 db 0f 84 92 00 00 00 b8 01 00 00 00 d3 e0 85 c0 7e 3f 83 e8 01 48 8d 6f 10 45 31 e4 4c 8d 34 c5 08 00 00 00 49 8b 45 08 <4a> 8b 34 20 48 85 f6 74 13 48 8b 1e 48 89 ef e8 20 fa ff ff 48
 RIP: free_ftrace_hash+0x7c/0x160 RSP: ffffc90001943db0
 ---[ end trace 999b48216bf4b393 ]---

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03 10:59:06 -05:00
Steven Rostedt (VMware)
555fc7813e ftrace: Replace (void *)1 with a meaningful macro name FTRACE_GRAPH_EMPTY
When the set_graph_function or set_graph_notrace contains no records, a
banner is displayed of either "#### all functions enabled ####" or
"#### all functions disabled ####" respectively. To tell the seq operations
to do this, (void *)1 is passed as a return value. Instead of using a
hardcoded meaningless variable, define it as a macro.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03 10:58:48 -05:00
Steven Rostedt (VMware)
2b2c279c81 ftrace: Create a slight optimization on searching the ftrace_hash
This is a micro-optimization, but as it has to deal with a fast path of the
function tracer, these optimizations can be noticed.

The ftrace_lookup_ip() returns true if the given ip is found in the hash. If
it's not found or the hash is NULL, it returns false. But there's some cases
that a NULL hash is a true, and the ftrace_hash_empty() is tested before
calling ftrace_lookup_ip() in those cases. But as ftrace_lookup_ip() tests
that first, that adds a few extra unneeded instructions in those cases.

A new static "always_inlined" function is created that does not perform the
hash empty test. This most only be used by callers that do the check first
anyway, as an empty or NULL hash could cause a crash if a lookup is
performed on it.

Also add kernel doc for the ftrace_lookup_ip() main function.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03 10:58:22 -05:00
Steven Rostedt (VMware)
2b0cce0e19 tracing: Add ftrace_hash_key() helper function
Replace the couple of use cases that has small logic to produce the ftrace
function key id with a helper function. No need for duplicate code.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03 10:58:05 -05:00
Pavel Tikhomirov
749860ce24 prctl: propagate has_child_subreaper flag to every descendant
If process forks some children when it has is_child_subreaper
flag enabled they will inherit has_child_subreaper flag - first
group, when is_child_subreaper is disabled forked children will
not inherit it - second group. So child-subreaper does not reparent
all his descendants when their parents die. Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.

To simplify these we can add a propagation of has_child_subreaper
flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
subreaper to setup has_child_subreaper flag.

In common cases when process like systemd first sets itself to
be a child-subreaper and only after that forks its services, we will
have zero-length list of descendants to walk. Testing with binary
subtree of 2^15 processes prctl took < 0.007 sec and has shown close
to linear dependency(~0.2 * n * usec) on lower numbers of processes.

Moreover, I doubt someone intentionaly pre-forks the children whitch
should reparent to init before becoming subreaper, because some our
ancestor migh have had is_child_subreaper flag while forking our
sub-tree and our childs will all inherit has_child_subreaper flag,
and we have no way to influence it. And only way to check if we have
no has_child_subreaper flag is to create some childs, kill them and
see where they will reparent to.

Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing
seems to be the same.

Optimize:

a) When descendant already has has_child_subreaper flag all his subtree
has it too already.

* for a) to be true need to move has_child_subreaper inheritance under
the same tasklist_lock with adding task to its ->real_parent->children
as without it process can inherit zero has_child_subreaper, then we
set 1 to it's parent flag, check that parent has no more children, and
only after child with wrong flag is added to the tree.

* Also make these inheritance more clear by using real_parent instead of
current, as on clone(CLONE_PARENT) if current has is_child_subreaper
and real_parent has no is_child_subreaper or has_child_subreaper, child
will have has_child_subreaper flag set without actually having a
subreaper in it's ancestors.

b) When some descendant is child_reaper, it's subtree is in different
pidns from us(original child-subreaper) and processes from other pidns
will never reparent to us.

So we can skip their(a,b) subtree from walk.

v2: switch to walk_process_tree() general helper, move
has_child_subreaper inheritance
v3: remove csr_descendant leftover, change current to real_parent
in has_child_subreaper inheritance
v4: small commit message fix

Fixes: ebec18a6d3 ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-03 16:55:15 +13:00
Oleg Nesterov
0f1b92cbdd introduce the walk_process_tree() helper
Add the new helper to walk the process tree, the next patch adds a user.
Note that it visits the group leaders only, proc_visitor can do
for_each_thread itself or we can trivially extend walk_process_tree() to
do this.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-03 16:34:41 +13:00
David S. Miller
e2160156bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All merge conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 16:54:00 -05:00
Linus Torvalds
891aa1e0f1 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Five kernel fixes:

   - an mmap tracing ABI fix for certain mappings

   - a use-after-free fix, found via KASAN

   - three CPU hotplug related x86 PMU driver fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Make package handling more robust
  perf/x86/intel/uncore: Clean up hotplug conversion fallout
  perf/x86/intel/rapl: Make package handling more robust
  perf/core: Fix PERF_RECORD_MMAP2 prot/flags for anonymous memory
  perf/core: Fix use-after-free bug
2017-02-02 13:30:19 -08:00
Tejun Heo
63f1ca5945 Merge branch 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11
Merge in to resolve conflicts in Documentation/cgroup-v2.txt.  The
conflicts are from multiple section additions and trivial to resolve.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-02-02 13:50:35 -05:00
Tejun Heo
576dd46450 cgroup: drop the matching uid requirement on migration for cgroup v2
Along with the write access to the cgroup.procs or tasks file, cgroup
has required the writer's euid, unless root, to match [s]uid of the
target process or task.  On cgroup v1, this is necessary because
there's nothing preventing a delegatee from pulling in tasks or
processes from all over the system.

If a user has a cgroup subdirectory delegated to it, the user would
have write access to the cgroup.procs or tasks file.  If there are no
further checks than file write access check, the user would be able to
pull processes from all over the system into its subhierarchy which is
clearly not the intended behavior.  The matching [s]uid requirement
partially prevents this problem by allowing a delegatee to pull in the
processes that belongs to it.  This isn't a sufficient protection
however, because a user would still be able to jump processes across
two disjoint sub-hierarchies that has been delegated to them.

cgroup v2 resolves the issue by requiring the writer to have access to
the common ancestor of the cgroup.procs file of the source and target
cgroups.  This confines each delegatee to their own sub-hierarchy
proper and bases all permission decisions on the cgroup filesystem
rather than having to pull in explicit uid matching.

cgroup v2 has still been applying the matching [s]uid requirement just
for historical reasons.  On cgroup2, the requirement doesn't serve any
purpose while unnecessarily complicating the permission model.  Let's
drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-02-02 13:47:56 -05:00
Tejun Heo
968ebff1ef cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
perf_event is a utility controller whose primary role is identifying
cgroup membership to filter perf events; however, because it also
tracks some per-css state, it can't be replaced by pure cgroup
membership test.  Mark the controller as implicitly enabled on the
default hierarchy so that perf events can always be filtered based on
cgroup v2 path as long as the controller is not mounted on a legacy
hierarchy.

"perf record" is updated accordingly so that it searches for both v1
and v2 hierarchies.  A v1 hierarchy is used if perf_event is mounted
on it; otherwise, it uses the v2 hierarchy.

v2: Doc updated to reflect more flexible rebinding behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2017-02-02 13:47:02 -05:00
Omar Sandoval
6ac93117ab blktrace: use existing disk debugfs directory
We may already have a directory to put the blktrace stuff in if

1. The disk uses blk-mq
2. CONFIG_BLK_DEBUG_FS is enabled
3. We are tracing the whole disk and not a partition

Instead of hardcoding this very specific case, let's use the new
debugfs_lookup(). If the directory exists, we use it, otherwise we
create one and clean it up later.

Fixes: 07e4fead45 ("blk-mq: create debugfs directory tree")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 10:20:16 -07:00
Omar Sandoval
18fbda91c6 block: use same block debugfs directory for blk-mq and blktrace
When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 10:20:16 -07:00
Omar Sandoval
a428d314eb blktrace: make do_blk_trace_setup() static
This isn't used outside of blktrace.c anymore.

Fixes: 62c2a7d969 ("block: push BKL into blktrace ioctls")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 10:20:16 -07:00
Arnd Bergmann
26a346f23c tracing/kprobes: Fix __init annotation
clang complains about "__init" being attached to a struct name:

kernel/trace/trace_kprobe.c:1375:15: error: '__section__' attribute only applies to functions and global variables

The intention must have been to mark the function as __init instead of
the type, so move the attribute there.

Link: http://lkml.kernel.org/r/20170201165826.2625888-1-arnd@arndb.de

Fixes: f18f97ac43 ("tracing/kprobes: Add a helper method to return number of probe hits")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-02 10:48:35 -05:00
Eric W. Biederman
93faccbbfa fs: Better permission checking for submounts
To support unprivileged users mounting filesystems two permission
checks have to be performed: a test to see if the user allowed to
create a mount in the mount namespace, and a test to see if
the user is allowed to access the specified filesystem.

The automount case is special in that mounting the original filesystem
grants permission to mount the sub-filesystems, to any user who
happens to stumble across the their mountpoint and satisfies the
ordinary filesystem permission checks.

Attempting to handle the automount case by using override_creds
almost works.  It preserves the idea that permission to mount
the original filesystem is permission to mount the sub-filesystem.
Unfortunately using override_creds messes up the filesystems
ordinary permission checks.

Solve this by being explicit that a mount is a submount by introducing
vfs_submount, and using it where appropriate.

vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
sget and friends know that a mount is a submount so they can take appropriate
action.

sget and sget_userns are modified to not perform any permission checks
on submounts.

follow_automount is modified to stop using override_creds as that
has proven problemantic.

do_mount is modified to always remove the new MS_SUBMOUNT flag so
that we know userspace will never by able to specify it.

autofs4 is modified to stop using current_real_cred that was put in
there to handle the previous version of submount permission checking.

cifs is modified to pass the mountpoint all of the way down to vfs_submount.

debugfs is modified to pass the mountpoint all of the way down to
trace_automount by adding a new parameter.  To make this change easier
a new typedef debugfs_automount_t is introduced to capture the type of
the debugfs automount function.

Cc: stable@vger.kernel.org
Fixes: 069d5ac9ae ("autofs:  Fix automounts by using current_real_cred()->uid")
Fixes: aeaa4a79ff ("fs: Call d_automount with the filesystems creds")
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-02-02 04:36:12 +13:00
Shile Zhang
975e155ed8 sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:

  ce0dbbbb30 ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")

... which name suggests to users that it's in milliseconds, while in reality
it's being set in milliseconds but the result is shown in jiffies.

This is obviously confusing when HZ is not 1000, it makes it appear like the
value set failed, such as HZ=100:

  root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
  root# cat /proc/sys/kernel/sched_rr_timeslice_ms
  10

Fix this to be milliseconds all around.

Signed-off-by: Shile Zhang <shile.zhang@nokia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 11:01:30 +01:00
Frederic Weisbecker
bfce1d6006 sched/cputime, vtime: Return nsecs instead of cputime_t to account
Turn the full dynticks cputime clock source to return nsec while keeping
its very internals jiffies based for performance reasons.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-27-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:59 +01:00
Frederic Weisbecker
2b1f967d80 sched/cputime: Complete nsec conversion of tick based accounting
This is the final step toward tick based cputime conversion. Now that
the whole cputime accounting engine accounts in nsecs, we can convert the
very source of the cputime to account in nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-26-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:59 +01:00
Frederic Weisbecker
fb8b049c98 sched/cputime: Push time to account_system_time() in nsecs
This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-25-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:58 +01:00
Frederic Weisbecker
18b43a9bd7 sched/cputime: Push time to account_idle_time() in nsecs
This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-24-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:57 +01:00
Frederic Weisbecker
be9095ed4f sched/cputime: Push time to account_steal_time() in nsecs
This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-23-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:57 +01:00
Frederic Weisbecker
23244a5c80 sched/cputime: Push time to account_user_time() in nsecs
This is one more step toward converting cputime accounting to pure nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-22-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:56 +01:00
Frederic Weisbecker
858cf3a8c5 timers/itimer: Convert internal cputime_t units to nsec
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.

Also convert itimers to use nsec based internal counters. This simplifies
it and removes the whole game with error/inc_error which served to deal
with cputime_t random granularity.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-20-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:55 +01:00
Frederic Weisbecker
ebd7e7fc4b timers/posix-timers: Convert internals to use nsecs
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.

Also convert posix-cpu-timers to use nsec based internal counters to
simplify it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-19-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:54 +01:00
Frederic Weisbecker
715eb7a924 timers/posix-timers: Use TICK_NSEC instead of a dynamically ad-hoc calculated version
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-18-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:54 +01:00
Frederic Weisbecker
a499a5a14d sched/cputime: Increment kcpustat directly on irqtime account
The irqtime is accounted is nsecs and stored in
cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the
accumulated amount reaches a new jiffy, this one gets accounted to the
kcpustat.

This was necessary when kcpustat was stored in cputime_t, which could at
worst have jiffies granularity. But now kcpustat is stored in nsecs
so this whole discretization game with temporary irqtime storage has
become unnecessary.

We can now directly account the irqtime to the kcpustat.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:53 +01:00
Frederic Weisbecker
bde8285e5c signal: Convert obsolete cputime type to nsecs
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-16-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:53 +01:00
Frederic Weisbecker
605dc2b31a tsacct: Convert obsolete cputime type to nsecs
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-15-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:52 +01:00
Frederic Weisbecker
dbf3da1c2d delaycct: Convert obsolete cputime type to nsecs
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-14-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:52 +01:00
Frederic Weisbecker
d4bc42af73 acct: Convert obsolete cputime type to nsecs
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-13-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:51 +01:00
Frederic Weisbecker
5613fda9a5 sched/cputime: Convert task/group cputime to nsecs
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:49 +01:00
Frederic Weisbecker
a1cecf2ba7 sched/cputime: Introduce special task_cputime_t() API to return old-typed cputime
This API returns a task's cputime in cputime_t in order to ease the
conversion of cputime internals to use nsecs units instead. Blindly
converting all cputime readers to use this API now will later let us
convert more smoothly and step by step all these places to use the
new nsec based cputime.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:48 +01:00
Frederic Weisbecker
16a6d9be90 sched/cputime: Convert guest time accounting to nsecs (u64)
cputime_t is being obsolete and replaced by nsecs units in order to make
internal timestamps less opaque and more granular.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:48 +01:00
Frederic Weisbecker
7fb1327ee9 sched/cputime: Convert kcpustat to nsecs
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.

Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:47 +01:00
Frederic Weisbecker
07e5f5e353 time: Introduce jiffies64_to_nsecs()
This will be needed for the cputime_t to nsec conversion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:45 +01:00
Frederic Weisbecker
93825f2ec7 jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY
NSEC_PER_JIFFY is an ad-hoc redefinition of TICK_NSEC. Let's rather
use a unique and well maintained version.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:45 +01:00
Ingo Molnar
ed5c8c854f Merge branch 'linus' into sched/core, to pick up fixes and refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:12:25 +01:00
Oleg Nesterov
c6c70f4455 exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
find_new_reaper() checks same_thread_group(reaper, child_reaper) to
prevent the cross-namespace reparenting but this is not enough if the
exiting parent was injected by setns() + fork().

Suppose we have a process P in the root namespace and some namespace X.
P does setns() to enter the X namespace, and forks the child C.
C forks a grandchild G and exits.

The grandchild G should be re-parented to X->child_reaper, but in this
case the ->real_parent chain does not lead to ->child_reaper, so it will
be wrongly reparanted to P's sub-reaper or a global init.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-01 18:20:48 +13:00
Linus Torvalds
a2ca3d6179 It was reported to me that the thread created by the hwlat tracer does
not migrate after the first instance. I found that there was as small
 bug in the logic, and fixed it. It's minor, but should be fixed regardless.
 There's not much impact outside the hwlat tracer.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYkP7tFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 srEH/iqi0mC4rJF57yLmwGXVAuwSRsxHlkGfyG3RBRnXOhiPLaM9Y6QZDWj8f99D
 DEnVJ7IMpW7tjJB0pDQBCMBi+NX6GvR9Zu4ub8k69UzDlw81CPdMiJ/HfEFsP5XA
 EOiDH/xGsYmbgGyqbU2VTb4lS8CijStht0jhydriLT1ga/W3bOl0/w6TeQW+AwqJ
 0tcksaKwAcH6iN11FUQfWPlirQ/aCTvj8FelgbT7MjYnwxDdbJSfKG+GwJjytLNQ
 Rj/STWD56OZpl91OuEYA50BZVxp3eF870xJXsLjg3mGIYBseW/T6ZTP8TOM+6WoH
 u2J08p8Owdtdo+9gkw+IwEuoFP4=
 =VCLk
 -----END PGP SIGNATURE-----

Merge tag 'trace-4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "It was reported to me that the thread created by the hwlat tracer does
  not migrate after the first instance. I found that there was as small
  bug in the logic, and fixed it. It's minor, but should be fixed
  regardless. There's not much impact outside the hwlat tracer"

* tag 'trace-4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix hwlat kthread migration
2017-01-31 16:32:40 -08:00
Linus Torvalds
f1774f46d4 Merge branch 'for-4.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "The cgroup creation path was getting the order of operations wrong and
  exposing cgroups which don't have their names set yet to controllers
  which can lead to NULL derefs.

  This contains the fix for the bug"

* 'for-4.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: don't online subsystems before cgroup_name/path() are operational
2017-01-31 13:54:41 -08:00
Steven Rostedt (VMware)
f447c196fe tracing: Clean up the hwlat binding code
Instead of initializing the affinity of the hwlat kthread in the thread
itself, simply set up the initial affinity at thread creation. This
simplifies the code.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-31 16:48:23 -05:00
Christoph Hellwig
57292b58dd block: introduce blk_rq_is_passthrough
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 14:00:34 -07:00
Steven Rostedt (VMware)
79c6f448c8 tracing: Fix hwlat kthread migration
The hwlat tracer creates a kernel thread at start of the tracer. It is
pinned to a single CPU and will move to the next CPU after each period of
running. If the user modifies the migration thread's affinity, it will not
change after that happens.

The original code created the thread at the first instance it was called,
but later was changed to destroy the thread after the tracer was finished,
and would not be created until the next instance of the tracer was
established. The code that initialized the affinity was only called on the
initial instantiation of the tracer. After that, it was not initialized, and
the previous affinity did not match the current newly created one, making
it appear that the user modified the thread's affinity when it did not, and
the thread failed to migrate again.

Cc: stable@vger.kernel.org
Fixes: 0330f7aa8e ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-31 09:13:49 -05:00
Ingo Molnar
a8709fa4a0 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

 - Dynticks updates, consolidating open-coded counter accesses into a well-defined API

 - SRCU updates: Simplify algorithm, add formal verification

 - Documentation updates

 - Miscellaneous fixes

 - Torture-test updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-31 07:45:42 +01:00
Joe Lawrence
7598d167df livepatch/module: print notice of TAINT_LIVEPATCH
Add back the "tainting kernel with TAINT_LIVEPATCH" kernel log message
that commit 2992ef29ae ("livepatch/module: make TAINT_LIVEPATCH
module-specific") dropped.  Now that it's a module-specific taint flag,
include the module name.

Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-01-30 17:07:32 -08:00
Tejun Heo
b807421a72 cgroup: misc cleanups
* cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other
  ss_masks.  Make it a u16.

* Move have_canfork_callback together with other callback ss_masks.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-01-30 17:09:07 -05:00
Joerg Roedel
93fa6cf60a Merge branch 'iommu/guest-msi' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/core 2017-01-30 15:58:47 +01:00
Marc Zyngier
08d85f3ea9 irqdomain: Avoid activating interrupts more than once
Since commit f3b0946d62 ("genirq/msi: Make sure PCI MSIs are
activated early"), we can end-up activating a PCI/MSI twice (once
at allocation time, and once at startup time).

This is normally of no consequences, except that there is some
HW out there that may misbehave if activate is used more than once
(the GICv3 ITS, for example, uses the activate callback
to issue the MAPVI command, and the architecture spec says that
"If there is an existing mapping for the EventID-DeviceID
combination, behavior is UNPREDICTABLE").

While this could be worked around in each individual driver, it may
make more sense to tackle the issue at the core level. In order to
avoid getting in that situation, let's have a per-interrupt flag
to remember if we have already activated that interrupt or not.

Fixes: f3b0946d62 ("genirq/msi: Make sure PCI MSIs are activated early")
Reported-and-tested-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-30 15:18:56 +01:00
Kan Liang
40999312c7 perf/core: Try parent PMU first when initializing a child event
perf has additional overhead when monitoring the task which
frequently generates child tasks.

perf_init_event() is one of the hotspots for the additional overhead:

Currently, to get the PMU, it tries to search the type in pmu_idr at
first. But it is not always successful, especially for the widely used
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events. So it has to go to the
slow path which go through the whole PMUs list.

It will be a big performance issue, if the PMUs list is long (e.g. server
with many uncore boxes) and the task frequently generates child tasks.

The child event inherits its parent event. So the child event should
try its parent PMU first.

Here is some data from the overhead test on Broadwell server:

  perf record -e $TEST_EVENTS -- ./loop.sh 50000

  loop.sh
    start=$(date +%s%N)
    i=0
    while [ "$i" -le "$1" ]
    do
            date > /dev/null
            i=`expr $i + 1`
    done
    end=$(date +%s%N)
    elapsed=`expr $end - $start`

  Event#	Original elapsed time	Elapsed time with patch		delta
  1		196,573,192,397		189,162,029,998			-3.77%
  2		257,567,753,013		241,620,788,683			-6.19%
  4		398,730,726,971		370,518,938,714			-7.08%
  8		824,983,761,120		740,702,489,329			-10.22%
  16		1,883,411,923,498	1,672,027,508,355		-11.22%

... which shows a nice performance improvement.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1484745662-15928-2-git-send-email-kan.liang@intel.com
[ Tidied up the changelog and the code comment. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:16 +01:00
Alexander Shishkin
487f05e18a perf/core: Optimize event rescheduling on active contexts
When new events are added to an active context, we go and reschedule
all cpu groups and all task groups in order to preserve the priority
(cpu pinned, task pinned, cpu flexible, task flexible), but in
reality we only need to reschedule groups of the same priority as
that of the events being added, and below.

This patch changes the behavior so that only groups that need to be
rescheduled are rescheduled.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170119164330.22887-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:15 +01:00
Alexander Shishkin
fe45bafbd0 perf/core: Don't re-schedule CPU flexible events needlessly
In the sched-in path, we first remove a CPU's flexible events in order to
give priority to the task's pinned events. However, this step can be safely
skipped if the task doesn't have its own pinned events.

This patch implements this skipping.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170119164330.22887-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:14 +01:00
David Carrillo-Cisneros
1fd7e41699 perf/core: Remove perf_cpu_context::unique_pmu
cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
with shared pmus in order to avoid visiting the same cpuctx more than once
in a for_each_pmu loop.

cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
have only one pmu per cpuctx. Since perf_pmu_sched_task() is only called in
hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.

The change above, together with the previous patch in this series, removed
the remaining uses of cpuctx->unique_pmu, so we remove it altogether.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170118192454.58008-3-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:14 +01:00
David Carrillo-Cisneros
058fe1c044 perf/core: Make cgroup switch visit only cpuctxs with cgroup events
This patch follows from a conversation in CQM/CMT's last series about
speeding up the context switch for cgroup events:

  https://patchwork.kernel.org/patch/9478617/

This is a low-hanging fruit optimization. It replaces the iteration over
the "pmus" list in cgroup switch by an iteration over a new list that
contains only cpuctxs with at least one cgroup event.

This is necessary because the number of PMUs have increased over the years
e.g modern x86 server systems have well above 50 PMUs.

The iteration over the full PMU list is unneccessary and can be costly in
heavy cache contention scenarios.

Below are some instrumentation measurements with 10, 50 and 90 percentiles
of the total cost of context switch before and after this optimization for
a simple array read/write microbenchark.

  Contention
    Level    Nr events      Before (us)            After (us)       Median
  L2    L3     types      (10%, 50%, 90%)       (10%, 50%, 90%     Speedup
  --------------------------------------------------------------------------
  Low   Low       1       (1.72, 2.42, 5.85)    (1.35, 1.64, 5.46)     29%
  High  Low       1       (2.08, 4.56, 19.8)    (1720, 2.20, 13.7)     51%
  High  High      1       (2.86, 10.4, 12.7)    (2.54, 4.32, 12.1)     58%

  Low   Low       2       (1.98, 3.20, 6.89)    (1.68, 2.41, 8.89)     24%
  High  Low       2       (2.48, 5.28, 22.4)    (2150, 3.69, 14.6)     30%
  High  High      2       (3.32, 8.09, 13.9)    (2.80, 5.15, 13.7)     36%

where:

  1 event type  = cycles
  2 event types = cycles,intel_cqm/llc_occupancy/

   Contention L2 Low: workset  <  L2 cache size.
                 High:  "     >>  L2   "     " .
   Contention L3 Low: workset of task on all sockets  <  L3 cache size.
                 High:   "     "   "   "   "    "    >>  L3   "     " .

   Median Speedup is (50%ile Before - 50%ile After) /  50%ile Before

Unsurprisingly, the benefits of this optimization decrease with the number
of cpuctxs with a cgroup events, yet, is never detrimental.

Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:13 +01:00
Ingo Molnar
ae5112a825 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:47:00 +01:00
Sebastian Andrzej Siewior
619bd4a718 sched/rt: Add a missing rescheduling point
Since the change in commit:

  fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")

... we don't reschedule a task under certain circumstances:

Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.

The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.

The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:37 +01:00
Mathieu Poirier
4b12db9391 sched/core: Fix &rd->cpudl memory leak
While in the process of initialising a root domain, if function
cpupri_init() fails the memory allocated in cpudl_init() is not
reclaimed.

Adding a new goto target to cleanup the previous initialistion of
the root_domain's dl_bw structure reclaims said memory.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485292295-21298-2-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:37 +01:00
Mathieu Poirier
92c99ac829 sched/core: Fix &rd->rto_mask memory leak
If function cpudl_init() fails the memory allocated for &rd->rto_mask
needs to be freed, something this patch is addressing.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485292295-21298-1-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:36 +01:00
Matt Fleming
4d25b35ea3 sched/fair: Restore previous rq_flags when migrating tasks in hotplug
__migrate_task() can return with a different runqueue locked than the
one we passed as an argument. So that we can repin the lock in
migrate_tasks() (and keep the update_rq_clock() bit) we need to
restore the old rq_flags before repinning.

Note that it wouldn't be correct to change move_queued_task() to repin
because of the change of runqueue and the fact that having an
up-to-date clock on the initial rq doesn't mean the new rq has one
too.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:35 +01:00
Peter Zijlstra
1b1d62254d sched/core: Add missing update_rq_clock() call in sched_move_task()
Bug was noticed via this warning:

  WARNING: CPU: 6 PID: 1 at kernel/sched/sched.h:804 detach_task_cfs_rq+0x8e8/0xb80
  rq->clock_update_flags < RQCF_ACT_SKIP
  Modules linked in:
  CPU: 6 PID: 1 Comm: systemd Not tainted 4.10.0-rc5-00140-g0874170baf55-dirty #1
  Hardware name: Supermicro SYS-4048B-TRFT/X10QBi, BIOS 1.0 04/11/2014
  Call Trace:
   dump_stack+0x4d/0x65
   __warn+0xcb/0xf0
   warn_slowpath_fmt+0x5f/0x80
   detach_task_cfs_rq+0x8e8/0xb80
   ? allocate_cgrp_cset_links+0x59/0x80
   task_change_group_fair+0x27/0x150
   sched_change_group+0x48/0xf0
   sched_move_task+0x53/0x150
   cpu_cgroup_attach+0x36/0x70
   cgroup_taskset_migrate+0x175/0x300
   cgroup_migrate+0xab/0xd0
   cgroup_attach_task+0xf0/0x190
   __cgroup_procs_write+0x1ed/0x2f0
   cgroup_procs_write+0x14/0x20
   cgroup_file_write+0x3f/0x100
   kernfs_fop_write+0x104/0x180
   __vfs_write+0x37/0x140
   vfs_write+0xb8/0x1b0
   SyS_write+0x55/0xc0
   do_syscall_64+0x61/0x170
   entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:34 +01:00
Peter Zijlstra
49ee576809 sched/core: Optimize pick_next_task() for idle_sched_class
Steve noticed that when we switch from IDLE to SCHED_OTHER we fail to
take the shortcut, even though all runnable tasks are of the fair
class, because prev->sched_class != &fair_sched_class.

Since I reworked the put_prev_task() stuff, we don't really care about
prev->class here, so removing that condition will allow this case.

This increases the likely case from 78% to 98% correct for Steve's
workload.

Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119174408.GN6485@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:34 +01:00
Peter Zijlstra
b9c16a0e1f locking/mutex: Fix lockdep_assert_held() fail
In commit:

  659cf9f582 ("locking/ww_mutex: Optimize ww-mutexes by waking at most one waiter for backoff when acquiring the lock")

I replaced a comment with a lockdep_assert_held(). However it turns out
we hide that lock from lockdep for hysterical raisins, which results
in the assertion always firing.

Remove the old debug code as lockdep will easily spot the abuse it was
meant to catch, which will make the lock visible to lockdep and make
the assertion work as intended.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolai Haehnle <Nicolai.Haehnle@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 659cf9f582 ("locking/ww_mutex: Optimize ww-mutexes by waking at most one waiter for backoff when acquiring the lock")
Link: http://lkml.kernel.org/r/20170117150609.GB32474@worktop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:42:59 +01:00
Steven Rostedt (VMware)
4009f4b3a9 locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
Running my likely/unlikely profiler for 3 weeks on two production
machines, I discovered that the unlikely() test in
__rt_mutex_slowlock() checking if state is TASK_INTERRUPTIBLE is hit
100% of the time, making it a very likely case.

The reason is, on a vanilla kernel, the majority case of calling
rt_mutex() is from the futex code. This code is always called as
TASK_INTERRUPTIBLE. In the -rt patch, this code is commonly called when
PREEMPT_RT is enabled with TASK_UNINTERRUPTIBLE. But that's not the
likely scenario.

The rt_mutex() code should be optimized for the common vanilla case,
and that is from a futex, with TASK_INTERRUPTIBLE as the state.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119113234.1efeedd1@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:42:59 +01:00
Peter Zijlstra
0b3589be9b perf/core: Fix PERF_RECORD_MMAP2 prot/flags for anonymous memory
Andres reported that MMAP2 records for anonymous memory always have
their protection field 0.

Turns out, someone daft put the prot/flags generation code in the file
branch, leaving them unset for anonymous memory.

Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Don Zickus <dzickus@redhat.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: anton@ozlabs.org
Cc: namhyung@kernel.org
Cc: stable@vger.kernel.org # v3.16+
Fixes: f972eb63b1 ("perf: Pass protection and flags bits through mmap2 interface")
Link: http://lkml.kernel.org/r/20170126221508.GF6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:41:26 +01:00
Peter Zijlstra
a76a82a3e3 perf/core: Fix use-after-free bug
Dmitry reported a KASAN use-after-free on event->group_leader.

It turns out there's a hole in perf_remove_from_context() due to
event_function_call() not calling its function when the task
associated with the event is already dead.

In this case the event will have been detached from the task, but the
grouping will have been retained, such that group operations might
still work properly while there are live child events etc.

This does however mean that we can miss a perf_group_detach() call
when the group decomposes, this in turn can then lead to
use-after-free.

Fix it by explicitly doing the group detach if its still required.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # v4.5+
Cc: syzkaller <syzkaller@googlegroups.com>
Fixes: 63b6da39bb ("perf: Fix perf_event_exit_task() race")
Link: http://lkml.kernel.org/r/20170126153955.GD6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:41:25 +01:00
Thomas Gleixner
9556ad6ad0 Merge branch 'fortglx/4.11/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
- Remove unused functions
 - Document udelay inaccuracy
 - Remove posix timer data from task struct when posix timers are off
2017-01-30 11:22:39 +01:00
Rafael J. Wysocki
858a0d7eb5 Merge back earlier suspend/hibernation changes for v4.11. 2017-01-30 09:00:02 +01:00
David S. Miller
4e8f2fc1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28 10:33:06 -05:00
Christoph Hellwig
48b77ad608 block: cleanup tracing
A couple tweaks to the tracing code:

 - trace the request size for all requests
 - trace request sector and nr_sectors only for fs requests, enforced by
   helpers
 - drop SCSI CDB tracing - we have SCSI tracing for this and are going
   to me the CDB out of the generic struct request soon.

With this the tracing code stops to know about BLOCK_PC requests entirely,
it's just FS vs passthrough requests now, where the latter includes any
driver-private requests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 15:08:35 -07:00
Nicolas Pitre
b18b6a9cef timers: Omit POSIX timer stuff from task_struct when disabled
When CONFIG_POSIX_TIMERS is disabled, it is preferable to remove related
structures from struct task_struct and struct signal_struct as they
won't contain anything useful and shouldn't be relied upon by mistake.
Code still referencing those structures is also disabled here.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-01-27 13:05:26 -08:00
Linus Torvalds
1b1bc42c16 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) GTP fixes from Andreas Schultz (missing genl module alias, clear IP
    DF on transmit).

 2) Netfilter needs to reflect the fwmark when sending resets, from Pau
    Espin Pedrol.

 3) nftable dump OOPS fix from Liping Zhang.

 4) Fix erroneous setting of VIRTIO_NET_HDR_F_DATA_VALID on transmit,
    from Rolf Neugebauer.

 5) Fix build error of ipt_CLUSTERIP when procfs is disabled, from Arnd
    Bergmann.

 6) Fix regression in handling of NETIF_F_SG in harmonize_features(),
    from Eric Dumazet.

 7) Fix RTNL deadlock wrt. lwtunnel module loading, from David Ahern.

 8) tcp_fastopen_create_child() needs to setup tp->max_window, from
    Alexey Kodanev.

 9) Missing kmemdup() failure check in ipv6 segment routing code, from
    Eric Dumazet.

10) Don't execute unix_bind() under the bindlock, otherwise we deadlock
    with splice. From WANG Cong.

11) ip6_tnl_parse_tlv_enc_lim() potentially reallocates the skb buffer,
    therefore callers must reload cached header pointers into that skb.
    Fix from Eric Dumazet.

12) Fix various bugs in legacy IRQ fallback handling in alx driver, from
    Tobias Regnery.

13) Do not allow lwtunnel drivers to be unloaded while they are
    referenced by active instances, from Robert Shearman.

14) Fix truncated PHY LED trigger names, from Geert Uytterhoeven.

15) Fix a few regressions from virtio_net XDP support, from John
    Fastabend and Jakub Kicinski.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (102 commits)
  ISDN: eicon: silence misleading array-bounds warning
  net: phy: micrel: add support for KSZ8795
  gtp: fix cross netns recv on gtp socket
  gtp: clear DF bit on GTP packet tx
  gtp: add genl family modules alias
  tcp: don't annotate mark on control socket from tcp_v6_send_response()
  ravb: unmap descriptors when freeing rings
  virtio_net: reject XDP programs using header adjustment
  virtio_net: use dev_kfree_skb for small buffer XDP receive
  r8152: check rx after napi is enabled
  r8152: re-schedule napi for tx
  r8152: avoid start_xmit to schedule napi when napi is disabled
  r8152: avoid start_xmit to call napi_schedule during autosuspend
  net: dsa: Bring back device detaching in dsa_slave_suspend()
  net: phy: leds: Fix truncated LED trigger names
  net: phy: leds: Break dependency of phy.h on phy_led_triggers.h
  net: phy: leds: Clear phy_num_led_triggers on failure to avoid crash
  net-next: ethernet: mediatek: change the compatible string
  Documentation: devicetree: change the mediatek ethernet compatible string
  bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status().
  ...
2017-01-27 12:54:16 -08:00
Geliang Tang
47087eeb74 PM / Hibernate: Use rb_entry() instead of container_of()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-01-27 11:31:12 +01:00
Linus Torvalds
7d3a0fa52e Power management fixes for v4.10-rc6
- Revert the recent change that caused suspend-to-idle to be used
    as the default suspend method on systems where it is indicated to
    be efficient by the ACPI tables, as that turned out to be premature
    and introduced suspend regressions on some systems with missing
    power management support in device drivers (Rafael Wysocki).
 
  - Fix up the intel_pstate driver to take changes of the global
    limits via sysfs correctly when the performance policy is used
    which has been broken by a recent change in it (Srinivas Pandruvada).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYiomwAAoJEILEb/54YlRxgGEP/i2Z+MVZbIwifod6wz6Yt0mj
 Rm2uri24qEdJJSWCejDAngySiU+ymNJIYfVx2q5l99x1W4WTwyNWuEdWhGMTBDwG
 tI3eeLSCZcAmPeMb7V/l8lPvJiKGBf8tM162j09zpZpgO4LFsunPX9PA5IWbK55M
 U4hYeho3OlLjT9yiS8Yc9iLAZPrf7MRDBqtM0PeCklJHHyYmberbmSirH/TPm1Sq
 TPHHbBk6d2sFxA5mkUEItm5y7g9Wq9kN/E08a6TA0HthQBnjEaJ9GTQCpSBOEGHF
 MgEu/MOxm1Kou9YvOQMN3B1L9/VOb5JatV8RkMDltEctJseymQejdg8gFMLIKZ5j
 lDAfC/tSpXAXvxUPl/ObYloKJP3HV4ly8urxZ8rqqUWLPq7vK/jo5OwA97Kjp5a0
 /qW0LACoK8B96WOYYaNR1hWullH7+hDItKkbbBSBKKSNPXCgOmzqkGjCqvze4yYl
 yS2PeOgr6cA3D0iAMIFhmiRkaauAv/Dl++yiF7oVEpn8JI0UfpjOKBegPOdAysFw
 ADb5wLAHZ1LCyTn2CnLz1i2F87HxrYrFrKdTwjjBLyEu2aIw9sFuWSvbxuzZ7asf
 u2JaUA+KmvMhkZMUkgX2jhzR3siHLEFOqnfWQh2rnZV2ukNCWxL/YjjcWkoUuLTc
 H+Kx+VxcU6wKJPVWsnsK
 =YF+u
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix two regressions introduced recently, one by reverting the
  problematic commit and one by fixing up the behavior in an overlooked
  case.

  Specifics:

   - Revert the recent change that caused suspend-to-idle to be used as
     the default suspend method on systems where it is indicated to be
     efficient by the ACPI tables, as that turned out to be premature
     and introduced suspend regressions on some systems with missing
     power management support in device drivers (Rafael Wysocki).

   - Fix up the intel_pstate driver to take changes of the global limits
     via sysfs correctly when the performance policy is used which has
     been broken by a recent change in it (Srinivas Pandruvada)"

* tag 'pm-4.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy
  Revert "PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag"
2017-01-26 17:14:17 -08:00
Rafael J. Wysocki
ff7e593c9c Merge branches 'pm-sleep' and 'pm-cpufreq'
* pm-sleep:
  Revert "PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag"

* pm-cpufreq:
  cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy
2017-01-27 00:08:59 +01:00
Tejun Heo
bdf3d06bed Merge branch 'for-4.10-fixes' into for-4.11 2017-01-26 16:47:42 -05:00
Tejun Heo
07cd129455 cgroup: don't online subsystems before cgroup_name/path() are operational
While refactoring cgroup creation, a5bca21520 ("cgroup: factor out
cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems
before the new cgroup is associated with it kernfs_node.  This is fine
for cgroup proper but cgroup_name/path() depend on the associated
kernfs_node and if a subsystem makes the new cgroup_subsys_state
visible, which they're allowed to after onlining, it can lead to NULL
dereference.

The current code performs cgroup creation and subsystem onlining in
cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems
visible afterwards.  There's no reason to online the subsystems early
and we can simply drop cgroup_apply_control_enable() call from
cgroup_create() so that the subsystems are onlined and made visible at
the same time.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: a5bca21520 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()") 
Cc: stable@vger.kernel.org # v4.6+
2017-01-26 16:47:28 -05:00
Eric Dumazet
ff9f8a7cf9 sysctl: fix proc_doulongvec_ms_jiffies_minmax()
We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.

We need to do the opposite operation when value is written by user.

Only matters when HZ != 1000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-26 09:21:24 -08:00
Paul E. McKenney
31945aa9f1 Merge branches 'doc.2017.01.15b', 'dyntick.2017.01.23a', 'fixes.2017.01.23a', 'srcu.2017.01.25a' and 'torture.2017.01.15b' into HEAD
doc.2017.01.15b: Documentation updates
dyntick.2017.01.23a: Dyntick tracking consolidation
fixes.2017.01.23a: Miscellaneous fixes
srcu.2017.01.25a: SRCU rewrite, fixes, and verification
torture.2017.01.15b: Torture-test updates
2017-01-25 12:56:05 -08:00
Paul E. McKenney
7f554a3d05 srcu: Reduce probability of SRCU ->unlock_count[] counter overflow
Because there are no memory barriers between the srcu_flip() ->completed
increment and the summation of the read-side ->unlock_count[] counters,
both the compiler and the CPU can reorder the summation with the
->completed increment.  If the updater is preempted long enough during
this process, the read-side counters could overflow, resulting in a
too-short grace period.

This commit therefore adds a memory barrier just after the ->completed
increment, ensuring that if the summation misses an increment of
->unlock_count[] from __srcu_read_unlock(), the next __srcu_read_lock()
will see the new value of ->completed, thus bounding the number of
->unlock_count[] increments that can be missed to NR_CPUS.  The actual
overflow computation is more complex due to the possibility of nesting
of __srcu_read_lock().

Reported-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-01-25 12:54:22 -08:00
Paul E. McKenney
d85b62f18d srcu: Force full grace-period ordering
If a process invokes synchronize_srcu(), is delayed just the right amount
of time, and thus does not sleep when waiting for the grace period to
complete, there is no ordering between the end of the grace period and
the code following the synchronize_srcu().  Similarly, there can be a
lack of ordering between the end of the SRCU grace period and callback
invocation.

This commit adds the necessary ordering.

Reported-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Further smp_mb() adjustment per email with Lance Roy. ]
2017-01-25 12:54:22 -08:00
Lance Roy
f2c4689640 srcu: Implement more-efficient reader counts
SRCU uses two per-cpu counters: a nesting counter to count the number of
active critical sections, and a sequence counter to ensure that the nesting
counters don't change while they are being added together in
srcu_readers_active_idx_check().

This patch instead uses per-cpu lock and unlock counters. Because both
counters only increase and srcu_readers_active_idx_check() reads the unlock
counter before the lock counter, this achieves the same end without having
to increment two different counters in srcu_read_lock(). This also saves a
smp_mb() in srcu_readers_active_idx_check().

Possible bug: There is no guarantee that the lock counter won't overflow
during srcu_readers_active_idx_check(), as there are no memory barriers
around srcu_flip() (see comment in srcu_readers_active_idx_check() for
details). However, this problem was already present before this patch.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Lance Roy <ldr709@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-01-25 12:53:20 -08:00
Daniel Borkmann
a67edbf4fb bpf: add initial bpf tracepoints
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.

For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.

Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:

  # ./perf record -a -e bpf:* sleep 10
  # ./perf script
  sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
  sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
  sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
  sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
  [...]
  sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
       swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER

  [1] https://lwn.net/Articles/705270/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Daniel Borkmann
2acae0d5b0 trace: add variant without spacing in trace_print_hex_seq
For upcoming tracepoint support for BPF, we want to dump the program's
tag. Format should be similar to __print_hex(), but without spacing.
Add a __print_hex_str() variant for exactly that purpose that reuses
trace_print_hex_seq().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Ingo Molnar
47cd95a632 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-25 15:52:46 +01:00
Linus Torvalds
883af14e67 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "26 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  MAINTAINERS: add Dan Streetman to zbud maintainers
  MAINTAINERS: add Dan Streetman to zswap maintainers
  mm: do not export ioremap_page_range symbol for external module
  mn10300: fix build error of missing fpu_save()
  romfs: use different way to generate fsid for BLOCK or MTD
  frv: add missing atomic64 operations
  mm, page_alloc: fix premature OOM when racing with cpuset mems update
  mm, page_alloc: move cpuset seqcount checking to slowpath
  mm, page_alloc: fix fast-path race with cpuset update or removal
  mm, page_alloc: fix check for NULL preferred_zone
  kernel/panic.c: add missing \n
  fbdev: color map copying bounds checking
  frv: add atomic64_add_unless()
  mm/mempolicy.c: do not put mempolicy before using its nodemask
  radix-tree: fix private list warnings
  Documentation/filesystems/proc.txt: add VmPin
  mm, memcg: do not retry precharge charges
  proc: add a schedule point in proc_pid_readdir()
  mm: alloc_contig: re-allow CMA to compact FS pages
  mm/slub.c: trace free objects at KERN_INFO
  ...
2017-01-24 16:54:39 -08:00
Jiri Slaby
ff7a28a074 kernel/panic.c: add missing \n
When a system panics, the "Rebooting in X seconds.." message is never
printed because it lacks a new line.  Fix it.

Link: http://lkml.kernel.org/r/20170119114751.2724-1-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Don Zickus
b94f51183b kernel/watchdog: prevent false hardlockup on overloaded system
On an overloaded system, it is possible that a change in the watchdog
threshold can be delayed long enough to trigger a false positive.

This can easily be achieved by having a cpu spinning indefinitely on a
task, while another cpu updates watchdog threshold.

What happens is while trying to park the watchdog threads, the hrtimers
on the other cpus trigger and reprogram themselves with the new slower
watchdog threshold.  Meanwhile, the nmi watchdog is still programmed
with the old faster threshold.

Because the one cpu is blocked, it prevents the thread parking on the
other cpus from completing, which is needed to shutdown the nmi watchdog
and reprogram it correctly.  As a result, a false positive from the nmi
watchdog is reported.

Fix this by setting a park_in_progress flag to block all lockups until
the parking is complete.

Fix provided by Ulrich Obergfell.

[akpm@linux-foundation.org: s/park_in_progress/watchdog_park_in_progress/]
Link: http://lkml.kernel.org/r/1481041033-192236-1-git-send-email-dzickus@redhat.com
Signed-off-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Linus Torvalds
19ca2c8fec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fix from Eric Biederman:
 "This has a single brown bag fix.

  The possible deadlock with dec_pid_namespaces that I had thought was
  fixed earlier turned out only to have been moved. So instead of being
  cleaver this change takes ucounts_lock with irqs disabled. So
  dec_ucount can be used from any context without fear of deadlock.

  The items accounted for dec_ucount and inc_ucount are all
  comparatively heavy weight objects so I don't exepct this will have
  any measurable performance impact"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  userns: Make ucounts lock irq-safe
2017-01-24 12:21:51 -08:00
Daniel Borkmann
3fadc80115 bpf: enable verifier to better track const alu ops
William reported couple of issues in relation to direct packet
access. Typical scheme is to check for data + [off] <= data_end,
where [off] can be either immediate or coming from a tracked
register that contains an immediate, depending on the branch, we
can then access the data. However, in case of calculating [off]
for either the mentioned test itself or for access after the test
in a more "complex" way, then the verifier will stop tracking the
CONST_IMM marked register and will mark it as UNKNOWN_VALUE one.

Adding that UNKNOWN_VALUE typed register to a pkt() marked
register, the verifier then bails out in check_packet_ptr_add()
as it finds the registers imm value below 48. In the first below
example, that is due to evaluate_reg_imm_alu() not handling right
shifts and thus marking the register as UNKNOWN_VALUE via helper
__mark_reg_unknown_value() that resets imm to 0.

In the second case the same happens at the time when r4 is set
to r4 &= r5, where it transitions to UNKNOWN_VALUE from
evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside
evaluate_reg_alu(), where the register's imm turns into 3. That
is, for registers with type UNKNOWN_VALUE, imm of 0 means that
we don't know what value the register has, and for imm > 0 it
means that the value has [imm] upper zero bits. F.e. when shifting
an UNKNOWN_VALUE register by 3 to the right, no matter what value
it had, we know that the 3 upper most bits must be zero now.
This is to make sure that ALU operations with unknown registers
don't overflow. Meaning, once we know that we have more than 48
upper zero bits, or, in other words cannot go beyond 0xffff offset
with ALU ops, such an addition will track the target register
as a new pkt() register with a new id, but 0 offset and 0 range,
so for that a new data/data_end test will be required. Is the source
register a CONST_IMM one that is to be added to the pkt() register,
or the source instruction is an add instruction with immediate
value, then it will get added if it stays within max 0xffff bounds.
>From there, pkt() type, can be accessed should reg->off + imm be
within the access range of pkt().

  [...]
  from 28 to 30: R0=imm1,min_value=1,max_value=1
    R1=pkt(id=0,off=0,r=22) R2=pkt_end
    R3=imm144,min_value=144,max_value=144
    R4=imm0,min_value=0,max_value=0
    R5=inv48,min_value=2054,max_value=2054 R10=fp
  30: (bf) r5 = r3
  31: (07) r5 += 23
  32: (77) r5 >>= 3
  33: (bf) r6 = r1
  34: (0f) r6 += r5
  cannot add integer value with 0 upper zero bits to ptr_to_packet

  [...]
  from 52 to 80: R0=imm1,min_value=1,max_value=1
    R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
    R4=imm272 R5=inv56,min_value=17,max_value=17
    R6=pkt(id=0,off=26,r=34) R10=fp
  80: (07) r4 += 71
  81: (18) r5 = 0xfffffff8
  83: (5f) r4 &= r5
  84: (77) r4 >>= 3
  85: (0f) r1 += r4
  cannot add integer value with 3 upper zero bits to ptr_to_packet

Thus to get above use-cases working, evaluate_reg_imm_alu() has
been extended for further ALU ops. This is fine, because we only
operate strictly within realm of CONST_IMM types, so here we don't
care about overflows as they will happen in the simulated but also
real execution and interaction with pkt() in check_packet_ptr_add()
will check actual imm value once added to pkt(), but it's irrelevant
before.

With regards to 06c1c04972 ("bpf: allow helpers access to variable
memory") that works on UNKNOWN_VALUE registers, the verifier becomes
now a bit smarter as it can better resolve ALU ops, so we need to
adapt two test cases there, as min/max bound tracking only becomes
necessary when registers were spilled to stack. So while mask was
set before to track upper bound for UNKNOWN_VALUE case, it's now
resolved directly as CONST_IMM, and such contructs are only necessary
when f.e. registers are spilled.

For commit 6b17387307 ("bpf: recognize 64bit immediate loads as
consts") that initially enabled dw load tracking only for nfp jit/
analyzer, I did couple of tests on large, complex programs and we
don't increase complexity badly (my tests were in ~3% range on avg).
I've added a couple of tests similar to affected code above, and
it works fine with verifier now.

Reported-by: William Tu <u9012063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Gianluca Borello <g.borello@gmail.com>
Cc: William Tu <u9012063@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:46:06 -05:00
Daniel Borkmann
d140199af5 bpf, lpm: fix kfree of im_node in trie_update_elem
We need to initialize im_node to NULL, otherwise in case of error path
it gets passed to kfree() as uninitialized pointer.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:17:35 -05:00
Nikolay Borisov
1cce1eea0a inotify: Convert to using per-namespace limits
This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.

Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.

Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace. Additionally, in order to preserve the
sysctl ABI make the existing inotify instances/watches sysctls
modify the values of the initial user namespace.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-24 12:03:07 +13:00
Daniel Mack
b95a5c4db0 bpf: add a longest prefix match trie map implementation
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 16:10:38 -05:00
Paul E. McKenney
38d30b336c rcu: Adjust FQS offline checks for exact online-CPU detection
Commit 7ec99de36f ("rcu: Provide exact CPU-online tracking for RCU"),
as its title suggests, got rid of RCU's remaining CPU-hotplug timing
guesswork.  This commit therefore removes the one-jiffy kludge that was
used to paper over this guesswork.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:44:18 -08:00
Paul E. McKenney
3a19b46a5c rcu: Check cond_resched_rcu_qs() state less often to reduce GP overhead
Commit 4a81e8328d ("rcu: Reduce overhead of cond_resched() checks
for RCU") moved quiescent-state generation out of cond_resched()
and commit bde6c3aa99 ("rcu: Provide cond_resched_rcu_qs() to force
quiescent states in long loops") introduced cond_resched_rcu_qs(), and
commit 5cd37193ce ("rcu: Make cond_resched_rcu_qs() apply to normal RCU
flavors") introduced the per-CPU rcu_qs_ctr variable, which is frequently
polled by the RCU core state machine.

This frequent polling can increase grace-period rate, which in turn
increases grace-period overhead, which is visible in some benchmarks
(for example, the "open1" benchmark in Anton Blanchard's "will it scale"
suite).  This commit therefore reduces the rate at which rcu_qs_ctr
is polled by moving that polling into the force-quiescent-state (FQS)
machinery, and by further polling it only after the grace period has
been in effect for at least jiffies_till_sched_qs jiffies.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:44:18 -08:00
Paul E. McKenney
02a5c550b2 rcu: Abstract extended quiescent state determination
This commit is the fourth step towards full abstraction of all accesses
to the ->dynticks counter, implementing previously open-coded checks and
comparisons in new rcu_dynticks_in_eqs() and rcu_dynticks_in_eqs_since()
functions.  This abstraction will ease changes to the ->dynticks counter
operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:44:18 -08:00
Paul E. McKenney
2625d469ba rcu: Abstract dynticks extended quiescent state enter/exit operations
This commit is the third step towards full abstraction of all accesses
to the ->dynticks counter, implementing the previously open-coded atomic
add of 1 and entry checks in a new rcu_dynticks_eqs_enter() function, and
the same but with exit checks in a new rcu_dynticks_eqs_exit() function.
This abstraction will ease changes to the ->dynticks counter operation.

Note that this commit gets rid of the smp_mb__before_atomic() and the
smp_mb__after_atomic() calls that were previously present.  The reason
that this is OK from a memory-ordering perspective is that the atomic
operation is now atomic_add_return(), which, as a value-returning atomic,
guarantees full ordering.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fixed RCU_TRACE() statements added by this commit. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:42:43 -08:00
Paul E. McKenney
8dc79888a7 rcu: Add lockdep checks to synchronous expedited primitives
The non-expedited synchronize_*rcu() primitives have lockdep checks, but
their expedited counterparts lack these checks.  This commit therefore
adds these checks to the expedited synchronize_*rcu() primitives.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:14 -08:00
Paul E. McKenney
bb4e2c08bb rcu: Eliminate unused expedited_normal counter
Expedited grace periods no longer fall back to normal grace periods
in response to lock contention, given that expedited grace periods
now use the rcu_node tree so as to avoid contention.  This commit
therfore removes the expedited_normal counter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:14 -08:00
Paul E. McKenney
9831ce3bb4 rcu: Fix comment in rcu_organize_nocb_kthreads()
It used to be that the rcuo callback-offload kthreads were spawned
in rcu_organize_nocb_kthreads(), and the comment before the "for"
loop says as much.  However, this spawning has long since moved to
the CPU-hotplug code, so this commit fixes this comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney
fdbb9b315c rcu: Make rcu_cpu_starting() use its "cpu" argument
The rcu_cpu_starting() function uses this_cpu_ptr() to locate the
incoming CPU's rcu_data structure.  This works for the boot CPU and for
all CPUs onlined after rcu_init() executes (during very early boot).
Currently, this is the full set of CPUs, so all is well.  But if
anyone ever parallelizes boot before rcu_init() time, it will fail.
This commit therefore substitutes the rcu_cpu_starting() function's
this_cpu_pointer() for per_cpu_ptr(), future-proofing the code and
(arguably) improving readability.

This commit inadvertently fixes a latent bug: If there ever had been
more than just the boot CPU online at rcu_init() time, the old code
would not initialize the non-boot CPUs, but rather would repeatedly
initialize the boot CPU.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney
09e2db37ec rcu: Add comment headers to expedited-grace-period counter functions
These functions (rcu_exp_gp_seq_start(), rcu_exp_gp_seq_end(),
rcu_exp_gp_seq_snap(), and rcu_exp_gp_seq_done() seemed too obvious
to comment when written, but not so much when being documented.
This commit therefore adds header comments to each of them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney
630c7ed9ca rcu: Don't wake rcuc/X kthreads on NOCB CPUs
Chris Friesen notice that rcuc/X kthreads were consuming CPU even on
NOCB CPUs.  This makes no sense because the only purpose or these
kthreads is to invoke normal (non-offloaded) callbacks, of which there
will never be any on NOCB CPUs.  This problem was due to a bug in
cpu_has_callbacks_ready_to_invoke(), which should have been checking
->nxttail[RCU_NEXT_TAIL] for NULL, but which was instead (incorrectly)
checking ->nxttail[RCU_DONE_TAIL].  Because ->nxttail[RCU_DONE_TAIL] is
never NULL, the only effect is to cause the rcuc/X kthread to execute
when it should not do so.

This commit therefore checks ->nxttail[RCU_NEXT_TAIL], which is NULL
for NOCB CPUs.

Reported-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney
7aa92230c9 rcu: Once again use NMI-based stack traces in stall warnings
This commit is for all intents and purposes a revert of bc1dce514e
("rcu: Don't use NMIs to dump other CPUs' stacks").  The reason to suppose
that this can now safely be reverted is the presence of 42a0bb3f71
("printk/nmi: generic solution for safe printk in NMI"), which is said
to have made NMI-based stack dumps safe.

However, this reversion keeps one nice property of bc1dce514e
("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that
only those CPUs blocking the grace period are dumped.  The new
trigger_single_cpu_backtrace() is used to make this happen, as
suggested by Josh Poimboeuf.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:12 -08:00
Paul E. McKenney
b201fa6737 rcu: Remove short-term CPU kicking
Commit 4914950aaa ("rcu: Stop treating in-kernel CPU-bound workloads
as errors") added a (relatively) short-timeout call to resched_cpu().
This was inspired by as issue that was fixed by b7e7ade34e ("sched/core:
Fix remote wakeups").  But given that this issue was fixed, it is time
for the current commit to remove this call to resched_cpu().

Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:12 -08:00
Paul E. McKenney
28053bc72c rcu: Add long-term CPU kicking
This commit prepares for the removal of short-term CPU kicking (in a
subsequent commit).  It does so by starting to invoke resched_cpu()
for each holdout at each force-quiescent-state interval that is more
than halfway through the stall-warning interval.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:33:02 -08:00
Tobias Klauser
94060d2235 rcu: Remove unused but set variable
Since commit 7ec99de36f ("rcu: Provide exact CPU-online tracking for
RCU"), the variable mask in rcu_init_percpu_data is set but no longer
used. Remove it to fix the following warning when building with 'W=1':

  kernel/rcu/tree.c: In function ‘rcu_init_percpu_data’:
  kernel/rcu/tree.c:3765:16: warning: variable ‘mask’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:32:35 -08:00
Paul E. McKenney
2535db485c rcu: Remove unneeded rcu_process_callbacks() declarations
The declarations of __rcu_process_callbacks() and rcu_process_callbacks()
are not needed, as the definition of both of these functions appear before
any uses.  This commit therefore removes both declarations.

Reported-by: "Ahmed, Iftekhar" <ahmedi@oregonstate.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:32:29 -08:00
Byungchul Park
c4402b27f1 rcu: Only dump stalled-tasks stacks if there was a real stall
The print_other_cpu_stall() function currently unconditionally invokes
rcu_print_detail_task_stall().  This is OK because if there was a stall
sufficient to cause print_other_cpu_stall() to be invoked, that stall
is very likely to persist through the entire print_other_cpu_stall()
execution.  However, if the stall did not persist, the variable ndetected
will be zero, and that variable is already tested in an "if" statement.
Therefore, this commit moves the call to rcu_print_detail_task_stall()
under that pre-existing "if" to improve readability, with a very rare
reduction in overhead.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ paulmck: Reworked commit log. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:32:22 -08:00
Mathieu Desnoyers
907565337e Fix: Disable sys_membarrier when nohz_full is enabled
Userspace applications should be allowed to expect the membarrier system
call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
nohz_full CPUs, but synchronize_sched() does not take those into
account.

Given that we do not want unrelated processes to be able to affect
real-time sensitive nohz_full CPUs, simply return ENOSYS when membarrier
is invoked on a kernel with enabled nohz_full CPUs.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: <stable@vger.kernel.org>	[3.10+]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:32:16 -08:00
Paul E. McKenney
4d4f88fa23 lockdep: Make RCU suspicious-access splats use pr_err
This commit switches RCU suspicious-access splats use pr_err()
instead of the current INFO printk()s.  This change makes it easier
to automatically classify splats.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-01-23 11:31:54 -08:00
Nikolay Borisov
880a38547f userns: Make ucounts lock irq-safe
The ucounts_lock is being used to protect various ucounts lifecycle
management functionalities. However, those services can also be invoked
when a pidns is being freed in an RCU callback (e.g. softirq context).
This can lead to deadlocks. There were already efforts trying to
prevent similar deadlocks in add7c65ca4 ("pid: fix lockdep deadlock
warning due to ucount_lock"), however they just moved the context
from hardirq to softrq. Fix this issue once and for all by explictly
making the lock disable irqs altogether.

Dmitry Vyukov <dvyukov@google.com> reported:

> I've got the following deadlock report while running syzkaller fuzzer
> on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
> device if it matters):
>
> =================================
> [ INFO: inconsistent lock state ]
> 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
> ---------------------------------
> inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
>  (ucounts_lock){+.?...}, at: [<     inline     >] spin_lock
> ./include/linux/spinlock.h:302
>  (ucounts_lock){+.?...}, at: [<ffff2000081678c8>]
> put_ucounts+0x60/0x138 kernel/ucount.c:162
> {SOFTIRQ-ON-W} state was registered at:
> [<ffff2000081c82d8>] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
> [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2941
> [<ffff2000081c97a8>] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
> [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
> [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
> [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
> [<     inline     >] spin_lock ./include/linux/spinlock.h:302
> [<     inline     >] get_ucounts kernel/ucount.c:131
> [<ffff200008167c28>] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
> [<     inline     >] inc_mnt_namespaces fs/namespace.c:2818
> [<ffff200008481850>] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
> [<ffff200008487298>] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
> [<     inline     >] init_mount_tree fs/namespace.c:3199
> [<ffff200009bd6674>] mnt_init+0x258/0x384 fs/namespace.c:3251
> [<ffff200009bd60bc>] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
> [<ffff200009bb1114>] start_kernel+0x414/0x460 init/main.c:648
> [<ffff200009bb01e8>] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
> irq event stamp: 2316924
> hardirqs last  enabled at (2316924): [<     inline     >] rcu_do_batch
> kernel/rcu/tree.c:2911
> hardirqs last  enabled at (2316924): [<     inline     >]
> invoke_rcu_callbacks kernel/rcu/tree.c:3182
> hardirqs last  enabled at (2316924): [<     inline     >]
> __rcu_process_callbacks kernel/rcu/tree.c:3149
> hardirqs last  enabled at (2316924): [<ffff200008210414>]
> rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
> hardirqs last disabled at (2316923): [<     inline     >] rcu_do_batch
> kernel/rcu/tree.c:2900
> hardirqs last disabled at (2316923): [<     inline     >]
> invoke_rcu_callbacks kernel/rcu/tree.c:3182
> hardirqs last disabled at (2316923): [<     inline     >]
> __rcu_process_callbacks kernel/rcu/tree.c:3149
> hardirqs last disabled at (2316923): [<ffff20000820fe80>]
> rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
> softirqs last  enabled at (2316912): [<ffff20000811b4c4>]
> _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
> softirqs last disabled at (2316913): [<     inline     >]
> do_softirq_own_stack ./include/linux/interrupt.h:488
> softirqs last disabled at (2316913): [<     inline     >]
> invoke_softirq kernel/softirq.c:371
> softirqs last disabled at (2316913): [<ffff20000811c994>]
> irq_exit+0x264/0x308 kernel/softirq.c:405
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(ucounts_lock);
>   <Interrupt>
>     lock(ucounts_lock);
>
>  *** DEADLOCK ***
>
> 1 lock held by swapper/2/0:
>  #0:  (rcu_callback){......}, at: [<     inline     >] __rcu_reclaim
> kernel/rcu/rcu.h:108
>  #0:  (rcu_callback){......}, at: [<     inline     >] rcu_do_batch
> kernel/rcu/tree.c:2919
>  #0:  (rcu_callback){......}, at: [<     inline     >]
> invoke_rcu_callbacks kernel/rcu/tree.c:3182
>  #0:  (rcu_callback){......}, at: [<     inline     >]
> __rcu_process_callbacks kernel/rcu/tree.c:3149
>  #0:  (rcu_callback){......}, at: [<ffff200008210390>]
> rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
>
> stack backtrace:
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
> Hardware name: Hardkernel ODROID-C2 (DT)
> Call trace:
> [<ffff20000808fa60>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
> [<ffff20000808fec0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
> [<ffff2000088a99e0>] dump_stack+0x110/0x168
> [<ffff2000082fa2b4>] print_usage_bug.part.27+0x49c/0x4bc
> kernel/locking/lockdep.c:2387
> [<     inline     >] print_usage_bug kernel/locking/lockdep.c:2357
> [<     inline     >] valid_state kernel/locking/lockdep.c:2400
> [<     inline     >] mark_lock_irq kernel/locking/lockdep.c:2617
> [<ffff2000081c89ec>] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
> [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2923
> [<ffff2000081c9a60>] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
> [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
> [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
> [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
> [<     inline     >] spin_lock ./include/linux/spinlock.h:302
> [<ffff2000081678c8>] put_ucounts+0x60/0x138 kernel/ucount.c:162
> [<ffff200008168364>] dec_ucount+0xf4/0x158 kernel/ucount.c:214
> [<     inline     >] dec_pid_namespaces kernel/pid_namespace.c:89
> [<ffff200008293dc8>] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
> [<     inline     >] __rcu_reclaim kernel/rcu/rcu.h:118
> [<     inline     >] rcu_do_batch kernel/rcu/tree.c:2919
> [<     inline     >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
> [<     inline     >] __rcu_process_callbacks kernel/rcu/tree.c:3149
> [<ffff2000082103d8>] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
> [<ffff2000080821dc>] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
> [<     inline     >] do_softirq_own_stack ./include/linux/interrupt.h:488
> [<     inline     >] invoke_softirq kernel/softirq.c:371
> [<ffff20000811c994>] irq_exit+0x264/0x308 kernel/softirq.c:405
> [<ffff2000081ecc28>] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
> [<ffff200008081c80>] gic_handle_irq+0x68/0xd8
> Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
> 7dc0:                                   ffff8000648d4b3c 0000000000000007
> 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
> 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
> 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
> 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
> 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
> 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
> 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
> 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
> 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
> [<ffff2000080837f8>] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
> [<     inline     >] arch_local_irq_restore
> ./arch/arm64/include/asm/irqflags.h:81
> [<ffff20000820a47c>] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
> [<     inline     >] cpuidle_idle_call kernel/sched/idle.c:200
> [<ffff2000081bcbfc>] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
> [<ffff2000081bd1cc>] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
> [<ffff200008099f8c>] secondary_start_kernel+0x2cc/0x358
> arch/arm64/kernel/smp.c:276
> [<000000000279f1a4>] 0x279f1a4

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: add7c65ca4 ("pid: fix lockdep deadlock warning due to ucount_lock")
Fixes: f333c700c6 ("pidns: Add a limit on the number of pid namespaces")
Cc: stable@vger.kernel.org
Link: https://www.spinics.net/lists/kernel/msg2426637.html
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-24 06:23:51 +13:00
Eric Auger
c7b41f0af3 irqdomain: irq_domain_check_msi_remap
This new function checks whether all MSI irq domains
implement IRQ remapping. This is useful to understand
whether VFIO passthrough is safe with respect to interrupts.

On ARM typically an MSI controller can sit downstream
to the IOMMU without preventing VFIO passthrough.
As such any assigned device can write into the MSI doorbell.
In case the MSI controller implements IRQ remapping, assigned
devices will not be able to trigger interrupts towards the
host. On the contrary, the assignment must be emphasized as
unsafe with respect to interrupts.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23 15:00:45 +00:00
Eric Auger
88156f0090 genirq/msi: Set IRQ_DOMAIN_FLAG_MSI on MSI domain creation
Now we have a flag value indicating an IRQ domain implements MSI,
let's set it on msi_create_irq_domain().

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23 15:00:45 +00:00
Eric Auger
631a9639ac irqdomain: Add irq domain MSI and MSI_REMAP flags
We introduce two new enum values for the irq domain flag:
- IRQ_DOMAIN_FLAG_MSI indicates the irq domain corresponds to
  an MSI domain
- IRQ_DOMAIN_FLAG_MSI_REMAP indicates the irq domain has MSI
  remapping capabilities.

Those values will be useful to check all MSI irq domains have
MSI remapping support when assessing the safety of IRQ assignment
to a guest.

irq_domain_hierarchical_is_msi_remap() allows to check if an
irq domain or any parent implements MSI remapping.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23 15:00:44 +00:00
Mike Frysinger
b25e67161c seccomp: dump core when using SECCOMP_RET_KILL
The SECCOMP_RET_KILL mode is documented as immediately killing the
process as if a SIGSYS had been sent and not caught (similar to a
SIGKILL).  However, a SIGSYS is documented as triggering a coredump
which does not happen today.

This has the advantage of being able to more easily debug a process
that fails a seccomp filter.  Today, most apps need to recompile and
change their filter in order to get detailed info out, or manually run
things through strace, or enable detailed kernel auditing.  Now we get
coredumps that fit into existing system-wide crash reporting setups.

From a security pov, this shouldn't be a problem.  Unhandled signals
can already be sent externally which trigger a coredump independent of
the status of the seccomp filter.  The act of dumping core itself does
not cause change in execution of the program.

URL: https://crbug.com/676357
Signed-off-by: Mike Frysinger <vapier@chromium.org>
Acked-by: Jorge Lucangeli Obes <jorgelo@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-01-23 21:42:42 +11:00
Linus Torvalds
24b86839fa Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp/hotplug fix from Thomas Gleixner:
 "Remove an unused variable which is a leftover from the notifier
  removal"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Remove unused but set variable in _cpu_down()
2017-01-22 12:45:47 -08:00
Waiman Long
bcc9a76d5a locking/rwsem: Reinit wake_q after use
In __rwsem_down_write_failed_common(), the same wake_q variable name
is defined twice, with the inner wake_q hiding the one in outer scope.
We can either use different names for the two wake_q's.

Even better, we can use the same wake_q twice, if necessary.

To enable the latter change, we need to define a new helper function
wake_q_init() to enable reinitalization of wake_q after use.

Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485052415-9611-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-22 09:54:00 +01:00
Namhyung Kim
b9b0c831be ftrace: Convert graph filter to use hash tables
Use ftrace_hash instead of a static array of a fixed size.  This is
useful when a graph filter pattern matches to a large number of
functions.  Now hash lookup is done with preemption disabled to protect
from the hash being changed/freed.

Link: http://lkml.kernel.org/r/20170120024447.26097-3-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-20 14:50:58 -05:00
Namhyung Kim
4046bf023b ftrace: Expose ftrace_hash_empty and ftrace_lookup_ip
It will be used when checking graph filter hashes later.

Link: http://lkml.kernel.org/r/20170120024447.26097-2-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
[ Moved ftrace_hash dec and functions outside of FUNCTION_GRAPH define ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-20 14:50:21 -05:00
Gianluca Borello
a5e8c07059 bpf: add bpf_probe_read_str helper
Provide a simple helper with the same semantics of strncpy_from_unsafe():

int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr)

This gives more flexibility to a bpf program. A typical use case is
intercepting a file name during sys_open(). The current approach is:

SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
	char buf[PATHLEN]; // PATHLEN is defined to 256
	bpf_probe_read(buf, sizeof(buf), ctx->di);

	/* consume buf */
}

This is suboptimal because the size of the string needs to be estimated
at compile time, causing more memory to be copied than often necessary,
and can become more problematic if further processing on buf is done,
for example by pushing it to userspace via bpf_perf_event_output(),
since the real length of the string is unknown and the entire buffer
must be copied (and defining an unrolled strnlen() inside the bpf
program is a very inefficient and unfeasible approach).

With the new helper, the code can easily operate on the actual string
length rather than the buffer size:

SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
	char buf[PATHLEN]; // PATHLEN is defined to 256
	int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di);

	/* consume buf, for example push it to userspace via
	 * bpf_perf_event_output(), but this time we can use
	 * res (the string length) as event size, after checking
	 * its boundaries.
	 */
}

Another useful use case is when parsing individual process arguments or
individual environment variables navigating current->mm->arg_start and
current->mm->env_start: using this helper and the return value, one can
quickly iterate at the right offset of the memory area.

The code changes simply leverage the already existent
strncpy_from_unsafe() kernel function, which is safe to be called from a
bpf program as it is used in bpf_trace_printk().

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:08:43 -05:00
Namhyung Kim
3e278c0dc1 ftrace: Factor out __ftrace_hash_move()
The __ftrace_hash_move() is to allocates properly-sized hash and move
entries in the src ftrace_hash.  It will be used to set function graph
filters which has nothing to do with the dyn_ftrace records.

Link: http://lkml.kernel.org/r/20170120024447.26097-1-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-20 11:40:07 -05:00
Rafael J. Wysocki
e326ce013a Revert "PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag"
Revert commit 08b98d3291 (PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0
flag) as it caused system suspend (in the default configuration) to fail
on Dell XPS13 (9360) with the Kaby Lake processor.

Fixes: 08b98d3291 (PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag)
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-01-20 03:33:57 +01:00
Peter Zijlstra
acb04058de sched/clock: Fix hotplug crash
Mike reported that he could trigger the WARN_ON_ONCE() in
set_sched_clock_stable() using hotplug.

This exposed a fundamental problem with the interface, we should never
mark the TSC stable if we ever find it to be unstable. Therefore
set_sched_clock_stable() is a broken interface.

The reason it existed is that not having it is a pain, it means all
relevant architecture code needs to call clear_sched_clock_stable()
where appropriate.

Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64
and parisc are trivial in that they never called
set_sched_clock_stable(), so add an unconditional call to
clear_sched_clock_stable() to them.

For x86 the story is a lot more involved, and what this patch tries to
do is ensure we preserve the status quo. So even is Cyrix or Transmeta
have usable TSC they never called set_sched_clock_stable() so they now
get an explicit mark unstable.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9881b024b7 ("sched/clock: Delay switching sched_clock to stable")
Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-20 02:38:46 +01:00
Steven Rostedt (VMware)
068f530b3f tracing: Add the constant count for branch tracer
The unlikely/likely branch profiler now gets called even if the if statement
is a constant (always goes in one direction without a compare). Add a value
to denote this in the likely/unlikely tracer as well.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-19 08:57:41 -05:00
Steven Rostedt (VMware)
134e6a034c tracing: Show number of constants profiled in likely profiler
Now that constants are traced, it is useful to see the number of constants
that are traced in the likely/unlikely profiler in order to know if they
should be ignored or not.

The likely/unlikely will display a number after the "correct" number if a
"constant" count exists.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-19 08:57:14 -05:00
Greg Kroah-Hartman
64e90a8acb Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()
Some usermode helper applications are defined at kernel build time, while
others can be changed at runtime.  To provide a sane way to filter these, add a
new kernel option "STATIC_USERMODEHELPER".  This option routes all
call_usermodehelper() calls through this binary, no matter what the caller
wishes to have called.

The new binary (by default set to /sbin/usermode-helper, but can be changed
through the STATIC_USERMODEHELPER_PATH option) can properly filter the
requested programs to be run by the kernel by looking at the first argument
that is passed to it.  All other options should then be passed onto the proper
program if so desired.

To disable all call_usermodehelper() calls by the kernel, set
STATIC_USERMODEHELPER_PATH to an empty string.

Thanks to Neil Brown for the idea of this feature.

Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 12:59:45 +01:00
Greg Kroah-Hartman
6d2c5d6c46 kmod: make usermodehelper path a const string
This is in preparation for making it so that usermode helper programs
can't be changed, if desired, by userspace.  We will tackle the mess of
cleaning up the write-ability of argv and env later, that's going to
take more work, for much less gain...

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 12:45:33 +01:00
Daniel Borkmann
d407bd25a2 bpf: don't trigger OOM killer under pressure with map alloc
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.

Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.

Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.

Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 17:12:26 -05:00
Richard Guy Briggs
92c82e8a32 audit: add feature audit_lost reset
Add a method to reset the audit_lost value.

An AUDIT_SET message with the AUDIT_STATUS_LOST flag set by itself
will return a positive value repesenting the current audit_lost value
and reset the counter to zero.  If AUDIT_STATUS_LOST is not the
only flag set, the reset command will be ignored.  The value sent with
the command is ignored.  The return value will be the +ve lost value at
reset time.

An AUDIT_CONFIG_CHANGE message will be queued to the listening audit
daemon.  The message will be a standard CONFIG_CHANGE message with the
fields "lost=0" and "old=" with the latter containing the value of
audit_lost at reset time.

See: https://github.com/linux-audit/audit-kernel/issues/3

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-01-18 14:32:52 -05:00
Linus Torvalds
ca92e6c7e6 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug update from Thomas Gleixner:
 "This contains a trivial typo fix and an extension to the core code for
  dynamically allocating states in the prepare stage.

  The extension is necessary right now because we need a proper way to
  unbreak LTTNG, which iscurrently non functional due to the removal of
  the notifiers. Surely it's out of tree, but it's widely used by
  distros.

  The simple solution would have been to reserve a state for LTTNG, but
  I'm not fond about unused crap in the kernel and the dynamic range,
  which we admittedly should have done right away, allows us to remove
  quite some of the hardcoded states, i.e. those which have no ordering
  requirements. So doing the right thing now is better than having an
  smaller intermediate solution which needs to be reworked anyway"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Provide dynamic range for prepare stage
  perf/x86/amd/ibs: Fix typo after cleanup state names in cpu/hotplug
2017-01-18 11:13:41 -08:00
Linus Torvalds
49b550fee8 Merge branch 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fixes from Ingo Molnar:
 "This fixes sporadic ACPI related hangs in synchronize_rcu() that were
  caused by the ACPI code mistakenly relying on an aspect of RCU that
  was neither promised to work nor reliable but which happened to work -
  until in v4.9 we changed the RCU implementation, which made the hangs
  more prominent.

  Since the mis-use of the RCU facility wasn't properly detected and
  prevented either, these fixes make the RCU side work reliably instead
  of working around the problem in the ACPI code.

  Hence the slightly larger diffstat that goes beyond the normal scope
  of RCU fixes in -rc kernels"

* 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Narrow early boot window of illegal synchronous grace periods
  rcu: Remove cond_resched() from Tiny synchronize_sched()
2017-01-18 10:47:11 -08:00
Tobias Klauser
0fec9557fd cpu/hotplug: Remove unused but set variable in _cpu_down()
After the recent removal of the hotplug notifiers the variable 'hasdied' in
_cpu_down() is set but no longer read, leading to the following GCC warning
when building with 'make W=1':

  kernel/cpu.c:767:7: warning: variable ‘hasdied’ set but not used [-Wunused-but-set-variable]

Fix it by removing the variable.

Fixes: 530e9b76ae ("cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20170117143501.20893-1-tklauser@distanz.ch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-18 11:55:09 +01:00
Linus Torvalds
0aa0313f9d Modules fixes for v4.10-rc5
- Fix out-of-tree module breakage when it supplies its own
   definitions of true and false
 
 Signed-off-by: Jessica Yu <jeyu@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYfm7sAAoJEMBFfjjOO8FyrmMP/j1Sa179+uBWWPE0Td7ip6yj
 EgvOtZGcnZfMuHbs5Evn8Fnz5K3of3IriiJNPePuQPu/YnoidltxkWOMXYkCj+Fn
 acW8VRtrh2urec70gRapuTmSpxs1I/XLUdNG+Ozm0FFX+L+k0ydCqEPGuVkwyHNK
 Wn31lVTiqx+zWm5PAJBzD6dEchQ0h2uppHRmZ+mIn3GyvYavIGnMMkdjqEEq9v8w
 UYdw52AJFAGMDO8LoSihX5cFbe0E28A58jkJuJ5AKXglaY6Nvl2xWOxfLhFnxO1m
 7KFuf+q2YO10hoJtdItEmPw2iC8pIgoAUGpZ+4h0iSWxyUC5V4QEmrhe4q9CtOLD
 +dfcd+44UekvWiWL4AQUO6IsUzIo8UqsJYf4Tic4/EjAKZtGTseKjQqCgBv3kJA+
 nN3hJ9gMN4NZWOMLihSn7Ml/whrxchdqlEP520nzGTnWUaLOUPp4XhfNlDaAH58K
 WfxiT0L6w+Cbg3xMCZRxQyqlJWWw8x1CM7B6eScHvN67TulC2enIQYTkv6eDOzQX
 DPz4lvcFisjASFP+i+3ouYL2pfLnm/IUG9K1wieqBvPHEdeZBuGr7+VEHjFmhhmG
 f3kKvYsRgUQF8tKeGtI0uPxnZNw4z4QaYOVKf8bISzpsIHOezdaouOm+KGUctHqO
 DNIWMf34W7fE5AVrGKp6
 =ybIc
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules fix from Jessica Yu:

 - fix out-of-tree module breakage when it supplies its own definitions
   of true and false

* tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  taint/module: Fix problems when out-of-kernel driver defines true or false
2017-01-17 14:49:21 -08:00
David S. Miller
580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Steven Rostedt (VMware)
d45ae1f704 tracing: Process constants for (un)likely() profiler
When running the likely/unlikely profiler, one of the results did not look
accurate. It noted that the unlikely() in link_path_walk() was 100%
incorrect. When I added a trace_printk() to see what was happening there, it
became 80% correct! Looking deeper into what whas happening, I found that
gcc split that if statement into two paths. One where the if statement
became a constant, the other path a variable. The other path had the if
statement always hit (making the unlikely there, always false), but since
the #define unlikely() has:

  #define unlikely() (__builtin_constant_p(x) ? !!(x) : __branch_check__(x, 0))

Where constants are ignored by the branch profiler, the "constant" path
made by the compiler was ignored, even though it was hit 80% of the time.

By just passing the constant value to the __branch_check__() function and
tracing it out of line (as always correct, as likely/unlikely isn't a factor
for constants), then we get back the accurate readings of branches that were
optimized by gcc causing part of the execution to become constant.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-17 15:13:05 -05:00
Larry Finger
5eb7c0d04f taint/module: Fix problems when out-of-kernel driver defines true or false
Commit 7fd8329ba5 ("taint/module: Clean up global and module taint
flags handling") used the key words true and false as character members
of a new struct. These names cause problems when out-of-kernel modules
such as VirtualBox include their own definitions of true and false.

Fixes: 7fd8329ba5 ("taint/module: Clean up global and module taint flags handling")
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-01-17 10:56:45 -08:00
Kenny Yu
6496bb72bf uprobe: Find last occurrence of ':' when parsing uprobe PATH:OFFSET
Previously, `create_trace_uprobe` found the *first* occurence
of the ':' character when parsing `PATH:OFFSET` for a uprobe.
However, if the path contains a ':' character, then the function
would parse the path incorrectly. Even worse, if the path does not
exist, the subsequent call to `kern_path()` would set `ret` to
`ENOENT`, leading to very cryptic errno values in user space.

The fix is to find the *last* occurence of ':'.

How to repro:: The write fails with "No such file or directory", suggesting
incorrectly that the `uprobe_events` file does not exist.

  $ mkdir testing && cd testing
  $ cp /bin/bash .
  $ cp /bin/bash ./bash:with:colon
  $ echo "p:uprobes/p__root_testing_bash_0x6 /root/testing/bash:0x6" > /sys/kernel/debug/tracing/uprobe_events     # this works
  $ echo "p:uprobes/p__root_testing_bash_with_colon_0x6 /root/testing/bash:with:colon:0x6" >> /sys/kernel/debug/tracing/uprobe_events     # this doesn't
  -bash: echo: write error: No such file or directory

With the patch:

  $ echo "p:uprobes/p__root_testing_bash_0x6 /root/testing/bash:0x6" > /sys/kernel/debug/tracing/uprobe_events     # this still works
  $ echo "p:uprobes/p__root_testing_bash_with_colon_0x6 /root/testing/bash:with:colon:0x6" >> /sys/kernel/debug/tracing/uprobe_events     # this works now too!
  $ cat /sys/kernel/debug/tracing/uprobe_events
  p:uprobes/p__root_testing_bash_0x6 /root/testing/bash:0x0000000000000006
  p:uprobes/p__root_testing_bash_with_colon_0x6 /root/testing/bash:with:colon:0x0000000000000006

Link: http://lkml.kernel.org/r/20170113165834.4081016-1-kennyyu@fb.com

Signed-off-by: Kenny Yu <kennyyu@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-17 12:57:47 -05:00
Linus Torvalds
4b19a9e20b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle multicast packets properly in fast-RX path of mac80211, from
    Johannes Berg.

 2) Because of a logic bug, the user can't actually force SW
    checksumming on r8152 devices. This makes diagnosis of hw
    checksumming bugs really annoying. Fix from Hayes Wang.

 3) VXLAN route lookup does not take the source and destination ports
    into account, which means IPSEC policies cannot be matched properly.
    Fix from Martynas Pumputis.

 4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.

 5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.

 6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
    construction properly in fib_dump_info(), from David Ahern.

 7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
    Schmidt.

 8) Openvswitch conntack actions need to maintain a correct checksum,
    fix from Lance Richardson.

 9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
    fact it already checks this, but not in all of the necessary spots.
    Fix from Basil Gunn.

10) Action GET operations in the packet scheduler can erroneously bump
    the reference count of the entry, making it unreleasable. Fix from
    Jamal Hadi Salim. Jamal gives a great set of example command lines
    that trigger this in the commit message.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
  net sched actions: fix refcnt when GETing of action after bind
  net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
  net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
  net/mlx4_core: Fix racy CQ (Completion Queue) free
  net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
  net/mlx5e: Fix a -Wmaybe-uninitialized warning
  ax25: Fix segfault after sock connection timeout
  bpf: rework prog_digest into prog_tag
  tipc: allocate user memory with GFP_KERNEL flag
  net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
  ip6_tunnel: Account for tunnel header in tunnel MTU
  mld: do not remove mld souce list info when set link down
  be2net: fix MAC addr setting on privileged BE3 VFs
  be2net: don't delete MAC on close on unprivileged BE3 VFs
  be2net: fix status check in be_cmd_pmac_add()
  cpmac: remove hopeless #warning
  ravb: do not use zero-length alignment DMA descriptor
  mlx4: do not call napi_schedule() without care
  openvswitch: maintain correct checksum state in conntrack actions
  tcp: fix tcp_fastopen unaligned access complaints on sparc
  ...
2017-01-17 09:33:10 -08:00
Sebastian Andrzej Siewior
7c6094db59 rcu: update: Make RCU_EXPEDITE_BOOT be the default
RCU_EXPEDITE_BOOT should speed up the boot process by enforcing
synchronize_rcu_expedited() instead of synchronize_rcu() during the boot
process. There should be no reason why one does not want this and there
is no need worry about real time latency at this point.
Therefore make it default.

Note that users wishing to avoid expediting entirely, for example when
bringing up new hardware possibly having flaky IPIs, can use the
rcu_normal boot parameter to override boot-time expediting.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Reworded commit log. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-16 16:56:39 -08:00
Paul E. McKenney
8b2f63ab05 rcu: Abstract the dynticks snapshot operation
This commit is the second step towards full abstraction of all accesses to
the ->dynticks counter, implementing the previously open-coded atomic
add of zero in a new rcu_dynticks_snap() function.  This abstraction will
ease changes o the ->dynticks counter operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-16 15:47:53 -08:00
Paul E. McKenney
6563de9d6f rcu: Abstract the dynticks momentary-idle operation
This commit is the first step towards full abstraction of all accesses to
the ->dynticks counter, implementing the previously open-coded atomic add
of two in a new rcu_dynticks_momentary_idle() function.  This abstraction
will ease changes to the ->dynticks counter operation.

Note that this commit gets rid of the smp_mb__before_atomic() and the
smp_mb__after_atomic() calls that were previously present.  The reason
that this is OK from a memory-ordering perspective is that the atomic
operation is now atomic_add_return(), which, as a value-returning atomic,
guarantees full ordering.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-16 14:59:54 -08:00
Daniel Borkmann
2d071c643f bpf, trace: make ctx access checks more robust
Make sure that ctx cannot potentially be accessed oob by asserting
explicitly that ctx access size into pt_regs for BPF_PROG_TYPE_KPROBE
programs must be within limits. In case some 32bit archs have pt_regs
not being a multiple of 8, then BPF_DW access could cause such access.

BPF_PROG_TYPE_KPROBE progs don't have a ctx conversion function since
there's no extra mapping needed. kprobe_prog_is_valid_access() didn't
enforce sizeof(long) as the only allowed access size, since LLVM can
generate non BPF_W/BPF_DW access to regs from time to time.

For BPF_PROG_TYPE_TRACEPOINT we don't have a ctx conversion either, so
add a BUILD_BUG_ON() check to make sure that BPF_DW access will not be
a similar issue in future (ctx works on event buffer as opposed to
pt_regs there).

Fixes: 2541517c32 ("tracing, perf: Implement BPF programs attached to kprobes")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 14:41:42 -05:00
Daniel Borkmann
f1f7714ea5 bpf: rework prog_digest into prog_tag
Commit 7bd509e311 ("bpf: add prog_digest and expose it via
fdinfo/netlink") was recently discussed, partially due to
admittedly suboptimal name of "prog_digest" in combination
with sha1 hash usage, thus inevitably and rightfully concerns
about its security in terms of collision resistance were
raised with regards to use-cases.

The intended use cases are for debugging resp. introspection
only for providing a stable "tag" over the instruction sequence
that both kernel and user space can calculate independently.
It's not usable at all for making a security relevant decision.
So collisions where two different instruction sequences generate
the same tag can happen, but ideally at a rather low rate. The
"tag" will be dumped in hex and is short enough to introspect
in tracepoints or kallsyms output along with other data such
as stack trace, etc. Thus, this patch performs a rename into
prog_tag and truncates the tag to a short output (64 bits) to
make it obvious it's not collision-free.

Should in future a hash or facility be needed with a security
relevant focus, then we can think about requirements, constraints,
etc that would fit to that situation. For now, rework the exposed
parts for the current use cases as long as nothing has been
released yet. Tested on x86_64 and s390x.

Fixes: 7bd509e311 ("bpf: add prog_digest and expose it via fdinfo/netlink")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 14:03:31 -05:00
Thomas Gleixner
4205e4786d cpu/hotplug: Provide dynamic range for prepare stage
Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to
the removal of the cpu hotplug notifiers.

Usually I don't care much about out of tree modules, but LTTNG is widely
used in distros. There are two ways to solve that:

1) Reserve a hotplug state for LTTNG

2) Add a dynamic range for the prepare states.

While #1 is the simplest solution, #2 is the proper one as we can convert
in tree users, which do not care about ordering, to the dynamic range as
well.

Add a dynamic range which allows LTTNG to request states in the prepare
stage.

Reported-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-16 13:20:05 +01:00
Ingo Molnar
3e4f7a4956 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into rcu/urgent
Pull an urgent RCU fix from Paul E. McKenney:

 "This series contains a pair of commits that permit RCU synchronous grace
  periods (synchronize_rcu() and friends) to work correctly throughout boot.
  This eliminates the current "dead time" starting when the scheduler spawns
  its first taks and ending when the last of RCU's kthreads is spawned
  (this last happens during early_initcall() time).  Although RCU's
  synchronous grace periods have long been documented as not working
  during this time, prior to 4.9, the expedited grace periods worked by
  accident, and some ACPI code came to rely on this unintentional behavior.
  (Note that this unintentional behavior was -not- reliable.  For example,
  failures from ACPI could occur on !SMP systems and on systems booting
  with the rcu_normal kernel boot parameter.)

  Either way, there is a bug that needs fixing, and the 4.9 switch of RCU's
  expedited grace periods to workqueues could be considered to have caused
  a regression.  This series therefore makes RCU's expedited grace periods
  operate correctly throughout the boot process.  This has been demonstrated
  to fix the problems ACPI was encountering, and has the added longer-term
  benefit of simplifying RCU's behavior."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-16 07:45:44 +01:00
Linus Torvalds
99421c1cb2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
 "This tree contains 4 fixes.

  The first is a fix for a race that can causes oopses under the right
  circumstances, and that someone just recently encountered.

  Past that are several small trivial correct fixes. A real issue that
  was blocking development of an out of tree driver, but does not appear
  to have caused any actual problems for in-tree code. A potential
  deadlock that was reported by lockdep. And a deadlock people have
  experienced and took the time to track down caused by a cleanup that
  removed the code to drop a reference count"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sysctl: Drop reference added by grab_header in proc_sys_readdir
  pid: fix lockdep deadlock warning due to ucount_lock
  libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
  mnt: Protect the mountpoint hashtable with mount_lock
2017-01-15 16:09:50 -08:00
Tejun Heo
bfc2cf6f61 cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
Currently, subsys->*attach() callbacks are called for all subsystems
which are attached to the hierarchy on which the migration is taking
place.

With cgroup_migrate_prepare_dst() filtering out identity migrations,
v1 hierarchies can avoid spurious ->*attach() callback invocations
where the source and destination csses are identical; however, this
isn't enough on v2 as only a subset of the attached controllers can be
affected on controller enable/disable.

While spurious ->*attach() invocations aren't critically broken,
they're unnecessary overhead and can lead to temporary overcharges on
certain controllers.  Fix it by tracking which subsystems are affected
by a migration and invoking ->*attach() callbacks only on those
subsystems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2017-01-15 19:03:41 -05:00
Tejun Heo
e595cd7069 cgroup: track migration context in cgroup_mgctx
cgroup migration is performed in four steps - css_set preloading,
addition of target tasks, actual migration, and clean up.  A list
named preloaded_csets is used to track the preloading.  This is a bit
too restricted and the code is already depending on the subtlety that
all source css_sets appear before destination ones.

Let's create struct cgroup_mgctx which keeps track of everything
during migration.  Currently, it has separate preload lists for source
and destination csets and also embeds cgroup_taskset which is used
during the actual migration.  This moves struct cgroup_taskset
definition to cgroup-internal.h.

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2017-01-15 19:03:41 -05:00
Tejun Heo
d8ebf5191d cgroup: cosmetic update to cgroup_taskset_add()
cgroup_taskset_add() was using list_add_tail() when for source csets
but list_move_tail() for destination.  As the operations are gated by
list_empty() test, list_move_tail() is equivalent to list_add_tail()
here.  Use list_add_tail() too for destination csets too.

This doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2017-01-15 19:03:40 -05:00
Linus Torvalds
a11ce3a4ac Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ fix from Ingo Molnar:
 "This fixes an old NOHZ race where we incorrectly calculate the next
  timer interrupt in certain circumstances where hrtimers are pending,
  that can cause hard to reproduce stalled-values artifacts in
  /proc/stat"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  nohz: Fix collision between tick and other hrtimers
2017-01-15 12:00:37 -08:00
Linus Torvalds
79078c53ba Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
  driver fixes, plus miscellanous tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Reject non sampling events with precise_ip
  perf/x86/intel: Account interrupts for PEBS errors
  perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
  perf/core: Fix sys_perf_event_open() vs. hotplug
  perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
  perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
  perf/x86: Set pmu->module in Intel PMU modules
  perf probe: Fix to probe on gcc generated symbols for offline kernel
  perf probe: Fix --funcs to show correct symbols for offline module
  perf symbols: Robustify reading of build-id from sysfs
  perf tools: Install tools/lib/traceevent plugins with install-bin
  tools lib traceevent: Fix prev/next_prio for deadline tasks
  perf record: Fix --switch-output documentation and comment
  perf record: Make __record_options static
  tools lib subcmd: Add OPT_STRING_OPTARG_SET option
  perf probe: Fix to get correct modname from elf header
  samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
  samples/bpf sock_example: Avoid getting ethhdr from two includes
  perf sched timehist: Show total scheduling time
2017-01-15 11:37:43 -08:00
Yang Shi
f4dbba5919 locktorture: Fix potential memory leak with rw lock test
When running locktorture module with the below commands with kmemleak enabled:

$ modprobe locktorture torture_type=rw_lock_irq
$ rmmod locktorture

The below kmemleak got caught:

root@10:~# echo scan > /sys/kernel/debug/kmemleak
[  323.197029] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
root@10:~# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffffffc07592d500 (size 128):
  comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 c3 7b 02 00 00 00 00 00  .........{......
    00 00 00 00 00 00 00 00 d7 9b 02 00 00 00 00 00  ................
  backtrace:
    [<ffffff80081e5a88>] create_object+0x110/0x288
    [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0
    [<ffffff80081d5acc>] __kmalloc+0x234/0x318
    [<ffffff80006fa130>] 0xffffff80006fa130
    [<ffffff8008083ae4>] do_one_initcall+0x44/0x138
    [<ffffff800817e28c>] do_init_module+0x68/0x1cc
    [<ffffff800811c848>] load_module+0x1a68/0x22e0
    [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0
    [<ffffff80080836f0>] el0_svc_naked+0x24/0x28
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffffffc07592d480 (size 128):
  comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 3b 6f 01 00 00 00 00 00  ........;o......
    00 00 00 00 00 00 00 00 23 6a 01 00 00 00 00 00  ........#j......
  backtrace:
    [<ffffff80081e5a88>] create_object+0x110/0x288
    [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0
    [<ffffff80081d5acc>] __kmalloc+0x234/0x318
    [<ffffff80006fa22c>] 0xffffff80006fa22c
    [<ffffff8008083ae4>] do_one_initcall+0x44/0x138
    [<ffffff800817e28c>] do_init_module+0x68/0x1cc
    [<ffffff800811c848>] load_module+0x1a68/0x22e0
    [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0
    [<ffffff80080836f0>] el0_svc_naked+0x24/0x28
    [<ffffffffffffffff>] 0xffffffffffffffff

It is because cxt.lwsa and cxt.lrsa don't get freed in module_exit, so free
them in lock_torture_cleanup() and free writer_tasks if reader_tasks is
failed at memory allocation.

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-14 21:36:06 -08:00
Paul E. McKenney
52d7e48b86 rcu: Narrow early boot window of illegal synchronous grace periods
The current preemptible RCU implementation goes through three phases
during bootup.  In the first phase, there is only one CPU that is running
with preemption disabled, so that a no-op is a synchronous grace period.
In the second mid-boot phase, the scheduler is running, but RCU has
not yet gotten its kthreads spawned (and, for expedited grace periods,
workqueues are not yet running.  During this time, any attempt to do
a synchronous grace period will hang the system (or complain bitterly,
depending).  In the third and final phase, RCU is fully operational and
everything works normally.

This has been OK for some time, but there has recently been some
synchronous grace periods showing up during the second mid-boot phase.
This code worked "by accident" for awhile, but started failing as soon
as expedited RCU grace periods switched over to workqueues in commit
8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue").
Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter).  The callchain from the failure case is as follows:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited

The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.

This commit therefore reworks RCU to permit synchronous grace periods
to proceed during this mid-boot phase.  This commit is therefore a
fix to a regression introduced in v4.9, and is therefore being put
forward post-merge-window in v4.10.

This commit sets a flag from the existing rcu_scheduler_starting()
function which causes all synchronous grace periods to take the expedited
path.  The expedited path now checks this flag, using the requesting task
to drive the expedited grace period forward during the mid-boot phase.
Finally, this flag is updated by a core_initcall() function named
rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

Note that this arrangement assumes that tasks are not sent POSIX signals
(or anything similar) from the time that the first task is spawned
through core_initcall() time.

Fixes: 8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue")
Reported-by: "Zheng, Lv" <lv.zheng@intel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Stan Kain <stan.kain@gmail.com>
Tested-by: Ivan <waffolz@hotmail.com>
Tested-by: Emanuel Castelo <emanuel.castelo@gmail.com>
Tested-by: Bruno Pesavento <bpesavento@infinito.it>
Tested-by: Borislav Petkov <bp@suse.de>
Tested-by: Frederic Bezies <fredbezies@gmail.com>
Cc: <stable@vger.kernel.org> # 4.9.0-
2017-01-14 21:23:48 -08:00
Paul E. McKenney
f466ae66fa rcu: Remove cond_resched() from Tiny synchronize_sched()
It is now legal to invoke synchronize_sched() at early boot, which causes
Tiny RCU's synchronize_sched() to emit spurious splats.  This commit
therefore removes the cond_resched() from Tiny RCU's synchronize_sched().

Fixes: 8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue")
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.9.0-
2017-01-14 21:22:20 -08:00
Peter Zijlstra
1e24edca05 locking/atomic, kref: Add KREF_INIT()
Since we need to change the implementation, stop exposing internals.

Provide KREF_INIT() to allow static initialization of struct kref.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:18 +01:00
Chris Wilson
2a0c112828 locking/ww_mutex: Add kselftests for ww_mutex stress
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-8-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:17 +01:00
Chris Wilson
d1b42b800e locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks
Check that ww_mutexes can detect cyclic deadlocks (generalised ABBA
cycles) and resolve them by lock reordering.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-7-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:16 +01:00
Chris Wilson
70207686e4 locking/ww_mutex: Add kselftests for ww_mutex ABBA deadlock detection
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-6-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:16 +01:00
Chris Wilson
c22fb3807f locking/ww_mutex: Add kselftests for ww_mutex AA deadlock detection
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-5-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:15 +01:00
Chris Wilson
f2a5fec173 locking/ww_mutex: Begin kselftests for ww_mutex
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-4-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:14 +01:00
Chris Wilson
0186a6cbdc locking/ww_mutex: Add ww_mutex to locktorture test
Although ww_mutexes degenerate into mutexes, it would be useful to
torture the deadlock handling between multiple ww_mutexes in addition to
torturing the regular mutexes.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-3-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:14 +01:00
Tejun Heo
1460cb65a1 locking/mutex, sched/wait: Add mutex_lock_io()
We sometimes end up propagating IO blocking through mutexes; however,
because there currently is no way of annotating mutex sleeps as
iowait, there are cases where iowait and /proc/stat:procs_blocked
report misleading numbers obscuring the actual state of the system.

This patch adds mutex_lock_io() so that mutex sleeps can be marked as
iowait in those cases.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1477673892-28940-4-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:05 +01:00
Tejun Heo
10ab56434f sched/core: Separate out io_schedule_prepare() and io_schedule_finish()
Now that IO schedule accounting is done inside __schedule(),
io_schedule() can be split into three steps - prep, schedule, and
finish - where the schedule part doesn't need any special annotation.
This allows marking a sleep as iowait by simply wrapping an existing
blocking function with io_schedule_prepare() and io_schedule_finish().

Because task_struct->in_iowait is single bit, the caller of
io_schedule_prepare() needs to record and the pass its state to
io_schedule_finish() to be safe regarding nesting.  While this isn't
the prettiest, these functions are mostly gonna be used by core
functions and we don't want to use more space for ->in_iowait.

While at it, as it's simple to do now, reimplement io_schedule()
without unnecessarily going through io_schedule_timeout().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:04 +01:00
Tejun Heo
e33a9bba85 sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler
For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside the
scheduler.

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

done close to the actual scheduling.  This patch moves #3 into the
scheduler so that later patches can separate out preparation and
finish steps from io_schedule().

Patch-originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: akpm@linux-foundation.org
Cc: axboe@kernel.dk
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/20161207204841.GA22296@htj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:03 +01:00
Dietmar Eggemann
b8fd842369 sched/fair: Explain why MIN_SHARES isn't scaled in calc_cfs_shares()
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e9a4d858-bcf3-36b9-e3a9-449953e34569@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:02 +01:00
Vincent Guittot
89ee048f3c sched/core: Fix group_entity's share update
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.

Let take the example of a task a that is dequeued of its task group A:
   root
  (cfs_rq)
    \
    (se)
     A
    (cfs_rq)
      \
      (se)
       a

Task "a" was the only task in task group A which becomes idle when a is
dequeued.

We have the sequence:

- dequeue_entity a->se
    - update_load_avg(a->se)
    - dequeue_entity_load_avg(A->cfs_rq, a->se)
    - update_cfs_shares(A->cfs_rq)
	A->cfs_rq->load.weight == 0
        A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
    - update_load_avg(A->se) but its weight is now null so the last time
      slot (up to a tick) will be accounted with a weight of 0 instead of
      its real weight during the time slot. The last time slot will be
      accounted as an idle one whereas it was a running one.

If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.

Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.

update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:02 +01:00
Peter Zijlstra
da9647e076 sched/completions: Fix complete_all() semantics
Documentation/scheduler/completion.txt says this about complete_all():

  "calls complete_all() to signal all current and future waiters."

Which doesn't strictly match the current semantics. Currently
complete_all() is equivalent to UINT_MAX/2 complete() invocations,
which is distinctly less than 'all current and future waiters'
(enumerable vs innumerable), although it has worked in practice.

However, Dmitry had a weird case where it might matter, so change
completions to use saturation semantics for complete()/complete_all().
Once done hits UINT_MAX (and complete_all() sets it there) it will
never again be decremented.

Requested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: der.herr@hofr.at
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:01 +01:00
Tommaso Cucinotta
59f8c29892 sched/deadline: Show leftover runtime and abs deadline in /proc/*/sched
This patch allows for reading the current (leftover) runtime and
absolute deadline of a SCHED_DEADLINE task through /proc/*/sched
(entries dl.runtime and dl.deadline), while debugging/testing.

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Daniel Bistrot de Oliveira <danielbristot@gmail.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1477473437-10346-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:00 +01:00
Peter Zijlstra
5680d8094f sched/clock: Provide better clock continuity
When switching between the unstable and stable variants it is
currently possible that clock discontinuities occur.

And while these will mostly be 'small', attempt to do better.

As observed on my IVB-EP, the sched_clock() is ~1.5s ahead of the
ktime_get_ns() based timeline at the point of switchover
(sched_clock_init_late()) after SMP bringup.

Equally, when the TSC is later found to be unstable -- typically
because SMM tries to hide its SMI latencies by mucking with the TSC --
we want to avoid large jumps.

Since the clocksource watchdog reports the issue after the fact we
cannot exactly fix up time, but since SMI latencies are typically
small (~10ns range), the discontinuity is mainly due to drift between
sched_clock() and ktime_get_ns() (which on my desktop is ~79s over
24days).

I dislike this patch because it adds overhead to the good case in
favour of dealing with badness. But given the widespread failure of
TSC stability this is worth it.

Note that in case the TSC makes drastic jumps after SMP bringup we're
still hosed. There's just not much we can do in that case without
stupid overhead.

If we were to somehow expose tsc_clocksource_reliable (which is hard
because this code is also used on ia64 and parisc) we could avoid some
of the newly introduced overhead.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:00 +01:00
Peter Zijlstra
9881b024b7 sched/clock: Delay switching sched_clock to stable
Currently we switch to the stable sched_clock if we guess the TSC is
usable, and then switch back to the unstable path if it turns out TSC
isn't stable during SMP bringup after all.

Delay switching to the stable path until after SMP bringup is
complete. This way we'll avoid switching during the time we detect the
worst of the TSC offences.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:59 +01:00
Peter Zijlstra
555570d744 sched/clock: Update static_key usage
sched_clock was still using the deprecated static_key interface.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:57 +01:00
Thomas Gleixner
12907fbb1a sched/clock, clocksource: Add optional cs::mark_unstable() method
PeterZ reported that we'd fail to mark the TSC unstable when the
clocksource watchdog finds it unsuitable.

Allow a clocksource to run a custom action when its being marked
unstable and hook up the TSC unstable code.

Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:43 +01:00
Matt Fleming
cb42c9a3eb sched/core: Add debugging code to catch missing update_rq_clock() calls
There's no diagnostic checks for figuring out when we've accidentally
missed update_rq_clock() calls. Let's add some by piggybacking on the
rq_*pin_lock() wrappers.

The idea behind the diagnostic checks is that upon pining rq lock the
rq clock should be updated, via update_rq_clock(), before anybody
reads the clock with rq_clock() or rq_clock_task().

The exception to this rule is when updates have explicitly been
disabled with the rq_clock_skip_update() optimisation.

There are some functions that only unpin the rq lock in order to grab
some other lock and avoid deadlock. In that case we don't need to
update the clock again and the previous diagnostic state can be
carried over in rq_repin_lock() by saving the state in the rq_flags
context.

Since this patch adds a new clock update flag and some already exist
in rq::clock_skip_update, that field has now been renamed. An attempt
has been made to keep the flag manipulation code small and fast since
it's used in the heart of the __schedule() fast path.

For the !CONFIG_SCHED_DEBUG case the only object code change (other
than addresses) is the following change to reset RQCF_ACT_SKIP inside
of __schedule(),

  -       c7 83 38 09 00 00 00    movl   $0x0,0x938(%rbx)
  -       00 00 00
  +       83 a3 38 09 00 00 fc    andl   $0xfffffffc,0x938(%rbx)

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-8-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:35 +01:00
Peter Zijlstra
2fb8d36787 sched/core: Add missing update_rq_clock() call in set_user_nice()
Address this rq-clock update bug:

  WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    set_next_entity()
    ? _raw_spin_lock()
    set_curr_task_fair()
    set_user_nice.part.85()
    set_user_nice()
    create_worker()
    worker_thread()
    kthread()
    ret_from_fork()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:34 +01:00
Peter Zijlstra
3bed5e2166 sched/core: Add missing update_rq_clock() call for task_hot()
Add the update_rq_clock() call at the top of the callstack instead of
at the bottom where we find it missing, this to aid later effort to
minimize the number of update_rq_lock() calls.

  WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    assert_clock_updated.isra.63.part.64()
    can_migrate_task()
    load_balance()
    pick_next_task_fair()
    __schedule()
    schedule()
    worker_thread()
    kthread()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:34 +01:00
Peter Zijlstra
80f5c1b84b sched/core: Add missing update_rq_clock() in detach_task_cfs_rq()
Instead of adding the update_rq_clock() all the way at the bottom of
the callstack, add one at the top, this to aid later effort to
minimize update_rq_lock() calls.

  WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    detach_task_cfs_rq()
    switched_from_fair()
    __sched_setscheduler()
    _sched_setscheduler()
    sched_set_stop_task()
    cpu_stop_create()
    __smpboot_create_thread.part.2()
    smpboot_register_percpu_thread_cpumask()
    cpu_stop_init()
    do_one_initcall()
    ? print_cpu_info()
    kernel_init_freeable()
    ? rest_init()
    kernel_init()
    ret_from_fork()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:33 +01:00
Peter Zijlstra
4126bad671 sched/core: Add missing update_rq_clock() in post_init_entity_util_avg()
Address this rq-clock update bug:

  WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    __warn()
    post_init_entity_util_avg()
    wake_up_new_task()
    _do_fork()
    kernel_thread()
    rest_init()
    start_kernel()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:32 +01:00
Matt Fleming
46f69fa337 sched/fair: Push rq lock pin/unpin into idle_balance()
Future patches will emit warnings if rq_clock() is called before
update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.

Since there is only one caller of idle_balance() we can push the
unpin/repin there.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:32 +01:00
Matt Fleming
92509b732b sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock
rq_clock() is called from sched_info_{depart,arrive}() after resetting
RQCF_ACT_SKIP but prior to a call to update_rq_clock().

In preparation for pending patches that check whether the rq clock has
been updated inside of a pin context before rq_clock() is called, move
the reset of rq->clock_skip_update immediately before unpinning the rq
lock.

This will avoid the new warnings which check if update_rq_clock() is
being actively skipped.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-6-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:31 +01:00
Matt Fleming
d8ac897137 sched/core: Add wrappers for lockdep_(un)pin_lock()
In preparation for adding diagnostic checks to catch missing calls to
update_rq_clock(), provide wrappers for (re)pinning and unpinning
rq->lock.

Because the pending diagnostic checks allow state to be maintained in
rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
for 'struct rq_flags *'.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:30 +01:00
Nicolai Hähnle
977625a693 locking/mutex: Initialize mutex_waiter::ww_ctx with poison when debugging
Help catch cases where mutex_lock is used directly on w/w mutexes, which
otherwise result in the w/w tasks reading uninitialized data.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-12-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:53 +01:00
Nicolai Hähnle
c516df978d locking/ww_mutex: Optimize ww-mutexes by yielding to other waiters from optimistic spin
Lock stealing is less beneficial for w/w mutexes since we may just end up
backing off if we stole from a thread with an earlier acquire stamp that
already holds another w/w mutex that we also need. So don't spin
optimistically unless we are sure that there is no other waiter that might
cause us to back off.

Median timings taken of a contention-heavy GPU workload:

Before:

  real    0m52.946s
  user    0m7.272s
  sys     1m55.964s

After:

  real    0m53.086s
  user    0m7.360s
  sys     1m46.204s

This particular workload still spends 20%-25% of CPU in mutex_spin_on_owner
according to perf, but my attempts to further reduce this spinning based on
various heuristics all lead to an increase in measured wall time despite
the decrease in sys time.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-11-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:52 +01:00
Nicolai Hähnle
25f13b4040 locking/ww_mutex: Re-check ww->ctx in the inner optimistic spin loop
In the following scenario, thread #1 should back off its attempt to lock
ww1 and unlock ww2 (assuming the acquire context stamps are ordered
accordingly).

    Thread #0               Thread #1
    ---------               ---------
                            successfully lock ww2
    set ww1->base.owner
                            attempt to lock ww1
                            confirm ww1->ctx == NULL
                            enter mutex_spin_on_owner
    set ww1->ctx

What was likely to happen previously is:

    attempt to lock ww2
    refuse to spin because
      ww2->ctx != NULL
    schedule()
                            detect thread #0 is off CPU
                            stop optimistic spin
                            return -EDEADLK
                            unlock ww2
                            wakeup thread #0
    lock ww2

Now, we are more likely to see:

                            detect ww1->ctx != NULL
                            stop optimistic spin
                            return -EDEADLK
                            unlock ww2
    successfully lock ww2

... because thread #1 will stop its optimistic spin as soon as possible.

The whole scenario is quite unlikely, since it requires thread #1 to get
between thread #0 setting the owner and setting the ctx. But since we're
idling here anyway, the additional check is basically free.

Found by inspection.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-10-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:51 +01:00
Peter Zijlstra
427b18207a locking/mutex: Improve inlining
Instead of inlining __mutex_lock_common() 5 times, once for each
{state,ww} variant. Reduce this to two, ww and !ww.

Then add __always_inline to mutex_optimistic_spin(), so that that will
get inlined all 4 remaining times, for all {waiter,ww} variants.

   text    data     bss     dec     hex filename

   6301       0       0    6301    189d defconfig-build/kernel/locking/mutex.o
   4053       0       0    4053     fd5 defconfig-build/kernel/locking/mutex.o
   4257       0       0    4257    10a1 defconfig-build/kernel/locking/mutex.o

This reduces total text size and better separates the ww and !ww mutex
code generation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:48 +01:00
Nicolai Hähnle
659cf9f582 locking/ww_mutex: Optimize ww-mutexes by waking at most one waiter for backoff when acquiring the lock
The wait list is sorted by stamp order, and the only waiting task that may
have to back off is the first waiter with a context.

The regular slow path does not have to wake any other tasks at all, since
all other waiters that would have to back off were either woken up when
the waiter was added to the list, or detected the condition before they
added themselves.

Median timings taken of a contention-heavy GPU workload:

Without this series:

  real    0m59.900s
  user    0m7.516s
  sys     2m16.076s

With changes up to and including this patch:

  real    0m52.946s
  user    0m7.272s
  sys     1m55.964s

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-9-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:44 +01:00
Nicolai Hähnle
200b187440 locking/ww_mutex: Notify waiters that have to back off while adding tasks to wait list
While adding our task as a waiter, detect if another task should back off
because of us.

With this patch, we establish the invariant that the wait list contains
at most one (sleeping) waiter with ww_ctx->acquired > 0, and this waiter
will be the first waiter with a context.

Since only waiters with ww_ctx->acquired > 0 have to back off, this allows
us to be much more economical with wakeups.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-8-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:43 +01:00
Nicolai Hähnle
6baa5c60a9 locking/ww_mutex: Add waiters in stamp order
Add regular waiters in stamp order. Keep adding waiters that have no
context in FIFO order and take care not to starve them.

While adding our task as a waiter, back off if we detect that there is
a waiter with a lower stamp in front of us.

Make sure to call lock_contended even when we back off early.

For w/w mutexes, being first in the wait list is only stable when
taking the lock without a context. Therefore, the purpose of the first
flag is split into two: 'first' remains to indicate whether we want to
spin optimistically, while 'handoff' indicates that we should be
prepared to accept a handoff.

For w/w locking with a context, we always accept handoffs after the
first schedule(), to handle the following sequence of events:

 1. Task #0 unlocks and hands off to Task #2 which is first in line

 2. Task #1 adds itself in front of Task #2

 3. Task #2 wakes up and must accept the handoff even though it is no
    longer first in line

Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: =?UTF-8?q?Nicolai=20H=C3=A4hnle?= <Nicolai.Haehnle@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-7-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:42 +01:00
Nicolai Hähnle
c5470b22d1 locking/ww_mutex: Remove the __ww_mutex_lock*() inline wrappers
Keep the documentation in the header file since there is no good place
for it in mutex.c: there are two rather different implementations with
different EXPORT_SYMBOLs for each function.

Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: =?UTF-8?q?Nicolai=20H=C3=A4hnle?= <Nicolai.Haehnle@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-6-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:41 +01:00
Nicolai Hähnle
ea9e0fb8fe locking/ww_mutex: Set use_ww_ctx even when locking without a context
We will add a new field to struct mutex_waiter.  This field must be
initialized for all waiters if any waiter uses the ww_use_ctx path.

So there is a trade-off: Keep ww_mutex locking without a context on
the faster non-use_ww_ctx path, at the cost of adding the
initialization to all mutex locks (including non-ww_mutexes), or avoid
the additional cost for non-ww_mutex locks, at the cost of adding
additional checks to the use_ww_ctx path.

We take the latter choice.  It may be worth eliminating the users of
ww_mutex_lock(lock, NULL), but there are a lot of them.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-5-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:40 +01:00
Nicolai Hähnle
3822da3ed0 locking/ww_mutex: Extract stamp comparison to __ww_mutex_stamp_after()
The function will be re-used in subsequent patches.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-4-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:39 +01:00
Peter Zijlstra
e274795ea7 locking/mutex: Fix mutex handoff
While reviewing the ww_mutex patches, I noticed that it was still
possible to (incorrectly) succeed for (incorrect) code like:

	mutex_lock(&a);
	mutex_lock(&a);

This was possible if the second mutex_lock() would block (as expected)
but then receive a spurious wakeup. At that point it would find itself
at the front of the queue, request a handoff and instantly claim
ownership and continue, since owner would point to itself.

Avoid this scenario and simplify the code by introducing a third low
bit to signal handoff pickup. So once we request handoff, unlock
clears the handoff bit and sets the pickup bit along with the new
owner.

This also removes the need for the .handoff argument to
__mutex_trylock(), since that becomes superfluous with PICKUP.

In order to guarantee enough low bits, ensure task_struct alignment is
at least L1_CACHE_BYTES (which seems a good ideal regardless).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9d659ae14b ("locking/mutex: Add lock handoff to avoid starvation")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:38 +01:00
Davidlohr Bueso
52b94129f2 locking/percpu-rwsem: Replace waitqueue with rcuwait
The use of any kind of wait queue is an overkill for pcpu-rwsems.
While one option would be to use the less heavy simple (swait)
flavor, this is still too much for what pcpu-rwsems needs. For one,
we do not care about any sort of queuing in that the only (rare) time
writers (and readers, for that matter) are queued is when trying to
acquire the regular contended rw_sem. There cannot be any further
queuing as writers are serialized by the rw_sem in the first place.

Given that percpu_down_write() must not be called after exit_notify(),
we can replace the bulky waitqueue with rcuwait such that a writer
can wait for its turn to take the lock. As such, we can avoid the
queue handling and locking overhead.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1484148146-14210-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:35 +01:00
Davidlohr Bueso
8f95c90ceb sched/wait, RCU: Introduce rcuwait machinery
rcuwait provides support for (single) RCU-safe task wait/wake functionality,
with the caveat that it must not be called after exit_notify(), such that
we avoid racing with rcu delayed_put_task_struct callbacks, task_struct
being rcu unaware in this context -- for which we similarly have
task_rcu_dereference() magic, but with different return semantics, which
can conflict with the wakeup side.

The interfaces are quite straightforward:

  rcuwait_wait_event()
  rcuwait_wake_up()

More details are in the comments, but it's perhaps worth mentioning at least,
that users must provide proper serialization when waiting on a condition, and
avoid corrupting a concurrent waiter. Also care must be taken between the task
and the condition for when calling the wakeup -- we cannot miss wakeups. When
porting users, this is for example, a given when using waitqueues in that
everything is done under the q->lock. As such, it can remove sources of non
preemptable unbounded work for realtime.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1484148146-14210-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:33 +01:00
Davidlohr Bueso
642fa448ae sched/core: Remove set_task_state()
This is a nasty interface and setting the state of a foreign task must
not be done. As of the following commit:

  be628be095 ("bcache: Make gc wakeup sane, remove set_task_state()")

... everyone in the kernel calls set_task_state() with current, allowing
the helper to be removed.

However, as the comment indicates, it is still around for those archs
where computing current is more expensive than using a pointer, at least
in theory. An important arch that is affected is arm64, however this has
been addressed now [1] and performance is up to par making no difference
with either calls.

Of all the callers, if any, it's the locking bits that would care most
about this -- ie: we end up passing a tsk pointer to a lot of the lock
slowpath, and setting ->state on that. The following numbers are based
on two tests: a custom ad-hoc microbenchmark that just measures
latencies (for ~65 million calls) between get_task_state() vs
get_current_state().

Secondly for a higher overview, an unlink microbenchmark was used,
which pounds on a single file with open, close,unlink combos with
increasing thread counts (up to 4x ncpus). While the workload is quite
unrealistic, it does contend a lot on the inode mutex or now rwsem.

[1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com

== 1. x86-64 ==

Avg runtime set_task_state():    601 msecs
Avg runtime set_current_state(): 552 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      36089.26 (  0.00%)    38977.33 (  8.00%)
Hmean    unlink1-processes-5      28555.01 (  0.00%)    29832.55 (  4.28%)
Hmean    unlink1-processes-8      37323.75 (  0.00%)    44974.57 ( 20.50%)
Hmean    unlink1-processes-12     43571.88 (  0.00%)    44283.01 (  1.63%)
Hmean    unlink1-processes-21     34431.52 (  0.00%)    38284.45 ( 11.19%)
Hmean    unlink1-processes-30     34813.26 (  0.00%)    37975.17 (  9.08%)
Hmean    unlink1-processes-48     37048.90 (  0.00%)    39862.78 (  7.59%)
Hmean    unlink1-processes-79     35630.01 (  0.00%)    36855.30 (  3.44%)
Hmean    unlink1-processes-110    36115.85 (  0.00%)    39843.91 ( 10.32%)
Hmean    unlink1-processes-141    32546.96 (  0.00%)    35418.52 (  8.82%)
Hmean    unlink1-processes-172    34674.79 (  0.00%)    36899.21 (  6.42%)
Hmean    unlink1-processes-203    37303.11 (  0.00%)    36393.04 ( -2.44%)
Hmean    unlink1-processes-224    35712.13 (  0.00%)    36685.96 (  2.73%)

== 2. ppc64le ==

Avg runtime set_task_state():  938 msecs
Avg runtime set_current_state: 940 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      19269.19 (  0.00%)    30704.50 ( 59.35%)
Hmean    unlink1-processes-5      20106.15 (  0.00%)    21804.15 (  8.45%)
Hmean    unlink1-processes-8      17496.97 (  0.00%)    17243.28 ( -1.45%)
Hmean    unlink1-processes-12     14224.15 (  0.00%)    17240.21 ( 21.20%)
Hmean    unlink1-processes-21     14155.66 (  0.00%)    15681.23 ( 10.78%)
Hmean    unlink1-processes-30     14450.70 (  0.00%)    15995.83 ( 10.69%)
Hmean    unlink1-processes-48     16945.57 (  0.00%)    16370.42 ( -3.39%)
Hmean    unlink1-processes-79     15788.39 (  0.00%)    14639.27 ( -7.28%)
Hmean    unlink1-processes-110    14268.48 (  0.00%)    14377.40 (  0.76%)
Hmean    unlink1-processes-141    14023.65 (  0.00%)    16271.69 ( 16.03%)
Hmean    unlink1-processes-172    13417.62 (  0.00%)    16067.55 ( 19.75%)
Hmean    unlink1-processes-203    15293.08 (  0.00%)    15440.40 (  0.96%)
Hmean    unlink1-processes-234    13719.32 (  0.00%)    16190.74 ( 18.01%)
Hmean    unlink1-processes-265    16400.97 (  0.00%)    16115.22 ( -1.74%)
Hmean    unlink1-processes-296    14388.60 (  0.00%)    16216.13 ( 12.70%)
Hmean    unlink1-processes-320    15771.85 (  0.00%)    15905.96 (  0.85%)

x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
and ppc64 (with paca) show similar improvements in the unlink microbenches.
The small delta for ppc64 (2ms), does not represent the gains on the unlink
runs. In the case of x86, there was a decent amount of variation in the
latency runs, but always within a 20 to 50ms increase), ppc was more constant.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:16 +01:00
Davidlohr Bueso
d269a8b8c5 kernel/locking: Compute 'current' directly
This patch effectively replaces the tsk pointer dereference
(which is obviously == current), to directly use get_current()
macro. This is to make the removal of setting foreign task
states smoother and painfully obvious. Performance win on some
archs such as x86-64 and ppc64. On a microbenchmark that calls
set_task_state() vs set_current_state() and an inode rwsem
pounding benchmark doing unlink:

== 1. x86-64 ==

Avg runtime set_task_state():    601 msecs
Avg runtime set_current_state(): 552 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      36089.26 (  0.00%)    38977.33 (  8.00%)
Hmean    unlink1-processes-5      28555.01 (  0.00%)    29832.55 (  4.28%)
Hmean    unlink1-processes-8      37323.75 (  0.00%)    44974.57 ( 20.50%)
Hmean    unlink1-processes-12     43571.88 (  0.00%)    44283.01 (  1.63%)
Hmean    unlink1-processes-21     34431.52 (  0.00%)    38284.45 ( 11.19%)
Hmean    unlink1-processes-30     34813.26 (  0.00%)    37975.17 (  9.08%)
Hmean    unlink1-processes-48     37048.90 (  0.00%)    39862.78 (  7.59%)
Hmean    unlink1-processes-79     35630.01 (  0.00%)    36855.30 (  3.44%)
Hmean    unlink1-processes-110    36115.85 (  0.00%)    39843.91 ( 10.32%)
Hmean    unlink1-processes-141    32546.96 (  0.00%)    35418.52 (  8.82%)
Hmean    unlink1-processes-172    34674.79 (  0.00%)    36899.21 (  6.42%)
Hmean    unlink1-processes-203    37303.11 (  0.00%)    36393.04 ( -2.44%)
Hmean    unlink1-processes-224    35712.13 (  0.00%)    36685.96 (  2.73%)

== 2. ppc64le ==

Avg runtime set_task_state():  938 msecs
Avg runtime set_current_state: 940 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      19269.19 (  0.00%)    30704.50 ( 59.35%)
Hmean    unlink1-processes-5      20106.15 (  0.00%)    21804.15 (  8.45%)
Hmean    unlink1-processes-8      17496.97 (  0.00%)    17243.28 ( -1.45%)
Hmean    unlink1-processes-12     14224.15 (  0.00%)    17240.21 ( 21.20%)
Hmean    unlink1-processes-21     14155.66 (  0.00%)    15681.23 ( 10.78%)
Hmean    unlink1-processes-30     14450.70 (  0.00%)    15995.83 ( 10.69%)
Hmean    unlink1-processes-48     16945.57 (  0.00%)    16370.42 ( -3.39%)
Hmean    unlink1-processes-79     15788.39 (  0.00%)    14639.27 ( -7.28%)
Hmean    unlink1-processes-110    14268.48 (  0.00%)    14377.40 (  0.76%)
Hmean    unlink1-processes-141    14023.65 (  0.00%)    16271.69 ( 16.03%)
Hmean    unlink1-processes-172    13417.62 (  0.00%)    16067.55 ( 19.75%)
Hmean    unlink1-processes-203    15293.08 (  0.00%)    15440.40 (  0.96%)
Hmean    unlink1-processes-234    13719.32 (  0.00%)    16190.74 ( 18.01%)
Hmean    unlink1-processes-265    16400.97 (  0.00%)    16115.22 ( -1.74%)
Hmean    unlink1-processes-296    14388.60 (  0.00%)    16216.13 ( 12.70%)
Hmean    unlink1-processes-320    15771.85 (  0.00%)    15905.96 (  0.85%)

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:14 +01:00
Davidlohr Bueso
0039962a14 kernel/exit: Compute 'current' directly
This patch effectively replaces the tsk pointer dereference (which is
obviously == current), to directly use get_current() macro. In this
case, do_exit() always passes current to exit_mm(), hence we can
simply get rid of the argument. This is also a performance win on some
archs such as x86-64 and ppc64 -- arm64 is no longer an issue.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:11 +01:00
Jiri Olsa
475113d937 perf/x86/intel: Account interrupts for PEBS errors
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:

    taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:

  NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
  ...
  RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
  ...
  Call Trace:
   ? trace_hardirqs_on_caller+0xf5/0x1b0
   ? perf_cgroup_attach+0x70/0x70
   perf_install_in_context+0x199/0x1b0
   ? ctx_resched+0x90/0x90
   SYSC_perf_event_open+0x641/0xf90
   SyS_perf_event_open+0x9/0x10
   do_syscall_64+0x6c/0x1f0
   entry_SYSCALL64_slow_path+0x25/0x25

Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.

We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:06:49 +01:00
Peter Zijlstra
321027c1fe perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.

The problem is exactly that described in commit:

  f63a8daa58 ("perf: Fix event->ctx locking")

... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.

That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.

So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).

Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.

Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias <joaodias@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Min Chong <mchong@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: f63a8daa58 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 10:56:11 +01:00
Peter Zijlstra
63cae12bce perf/core: Fix sys_perf_event_open() vs. hotplug
There is problem with installing an event in a task that is 'stuck' on
an offline CPU.

Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.

If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.

While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.

Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':

CPU0            CPU1            CPU2

                (current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

                switch(t, n);
                                migrate(t, 2);
                                switch(p, t);

                                ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

                generic_exec_single()
                  func();
                    spin_lock(ctx->lock);
                    if (task_curr(t)) // false

                    add_event_to_ctx();
                    spin_unlock(ctx->lock);

                                perf_event_context_sched_in();
                                  spin_lock(ctx->lock);
                                  // sees event

So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.

As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.

I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:

  C C-peterz

  {
  }

  P0(int *x, int *y)
  {
          int r1;

          WRITE_ONCE(*x, 1);
          smp_mb();
          r1 = READ_ONCE(*y);
  }

  P1(int *y, int *z)
  {
          WRITE_ONCE(*y, 1);
          smp_store_release(z, 1);
  }

  P2(int *x, int *z)
  {
          int r1;
          int r2;

          r1 = smp_load_acquire(z);
	  smp_mb();
          r2 = READ_ONCE(*x);
  }

  exists
  (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Where:
  x is perf_event_ctxp[],
  y is our tasks's CPU, and
  z is our task being placed on the rq of CPU2.

The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().

The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.

This litmus test evaluates into:

  Test C-peterz Allowed
  States 7
  0:r1=0; 2:r1=0; 2:r2=0;
  0:r1=0; 2:r1=0; 2:r2=1;
  0:r1=0; 2:r1=1; 2:r2=1;
  0:r1=1; 2:r1=0; 2:r2=0;
  0:r1=1; 2:r1=0; 2:r2=1;
  0:r1=1; 2:r1=1; 2:r2=0;
  0:r1=1; 2:r1=1; 2:r2=1;
  No
  Witnesses
  Positive: 0 Negative: 7
  Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
  Observation C-peterz Never 0 7
  Hash=e427f41d9146b2a5445101d3e2fcaa34

And the strong and weak model agree.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: jeremy.linton@arm.com
Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 10:56:10 +01:00
Frederic Weisbecker
c8d7dabf8f sched/cputime: Rename vtime_account_user() to vtime_flush()
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y used to accumulate user time and
account it on ticks and context switches only through the
vtime_account_user() function.

Now this model has been generalized on the 3 archs for all kind of
cputime (system, irq, ...) and all the cputime flushing happens under
vtime_account_user().

So let's rename this function to better reflect its new role.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-11-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:54:13 +01:00
Frederic Weisbecker
1213699ab4 sched/cputime: Export account_guest_time()
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, let's allow archs to account cputime
directly to gtime.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:54:11 +01:00
Frederic Weisbecker
c31cc6a518 sched/cputime: Allow accounting system time using cpustat index
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, let's provide APIs to account system
time to precise contexts: hardirq, softirq, pure system, ...

Inspired-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:54:11 +01:00
Masami Hiramatsu
5b485629ba kprobes, extable: Identify kprobes trampolines as kernel text area
Improve __kernel_text_address()/kernel_text_address() to return
true if the given address is on a kprobe's instruction slot
trampoline.

This can help stacktraces to determine the address is on a
text area or not.

To implement this atomically in is_kprobe_*_slot(), also change
the insn_cache page list to an RCU list.

This changes timings a bit (it delays page freeing to the RCU garbage
collection phase), but none of that is in the hot path.

Note: this change can add small overhead to stack unwinders because
it adds 2 additional checks to __kernel_text_address(). However, the
impact should be very small, because kprobe_insn_pages list has 1 entry
per 256 probes(on x86, on arm/arm64 it will be 1024 probes),
and kprobe_optinsn_pages has 1 entry per 32 probes(on x86).
In most use cases, the number of kprobe events may be less
than 20, which means that is_kprobe_*_slot() will check just one entry.

Tested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/148388747896.6869.6354262871751682264.stgit@devbox
[ Improved the changelog and coding style. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 08:38:05 +01:00
Linus Torvalds
af54efa4f5 VFIO fixes for v4.10-rc4
- Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)
  - Export and make use of has_capability() to fix incorrect use of
    ns_capable() for testing task capabilities (Jike Song)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJYeTWfAAoJECObm247sIsi1PEP/0lIkIQWBUlVWC1QA6bJN0Xx
 9c4pA34kLJwCpEtEoPxf6owjgK7kSBgIUUqBaNNDdKZQYttGgA+qiX3HhuvEigKL
 vEq5/TqwL6vv2aIUp/5uPP4NNTJD8RynwkfDI1B8DVQN6E1GM2zozpFUiZbDUxz/
 sgIuby9nuG3WTVLgOVayyMHlPTXG1+l+quRlAhMAseD7LMx7q/71NIjKggSUFRQG
 fkOVVTqfCnLJmIyq/cWbJt2cDgeWQq2/Ik6gje3SiOFtxi8fRdlzONUL+tHM1KgT
 r0htrq+r3B7BxI0CMZuoHIBt1SK443yu39xDzb0iXDSb5W9gwR14uFMuXv1ftfM0
 qkZnvpsXaT6wpKvK2ztmHgUiKJmOTgYrG77Dhz4oz6Mm0Y1mn6bV4yueoF/rQIn0
 GrM1Af/SVLf3Vhxw6i5a1s7kDgpySw8FfucKO5Xv3cOaIgNtlrrjxbKKa9DZ3wd7
 mnjD30XHwxEim8OCgv7CFswPsc5TiqYJTKGbnSJGo67ZCXWxXFHLIab0cn5yMd8G
 Qgw4mLnIv2rkRZOWpgMy4PedCNjZXNuQbW3I90kDb/VlPvRdCqUIsO0Ty10yaNhe
 s8Gwmxphoi3U/J7Y4T/BsfkCZ4Umut9gAt/WsG4kgWj3v0FOmxLgl39lC0cRigR6
 l7HSf0fOg/D9k6EN1xnc
 =yeS9
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio

Pull VFIO fixes from Alex Williamson:

 - Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)

 - Export and make use of has_capability() to fix incorrect use of
   ns_capable() for testing task capabilities (Jike Song)

* tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio:
  vfio/type1: Remove pid_namespace.h include
  vfio iommu type1: fix the testing of capability for remote task
  capability: export has_capability
  vfio-mdev: remove some dead code
  vfio-mdev: buffer overflow in ioctl()
  vfio-mdev: return -EFAULT if copy_to_user() fails
2017-01-13 17:35:43 -08:00
Linus Torvalds
406732c932 * fix for module unload vs. deferred jump labels (note: there might be
other buggy modules!)
 * two NULL pointer dereferences from syzkaller
 * CVE from syzkaller, very serious on 4.10-rc, "just" kernel memory
   leak on releases
 * CVE from security@kernel.org, somewhat serious on AMD, less so on
   Intel
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJYd7l5AAoJEL/70l94x66DLWYH/0GUg+lK9J/gj0kwqi6BwsOP
 Rrs5Y7XvyNLsy/piBrrHDHvRa+DfAkrU8nepwgygX/yuGmSDV/zmdIb8XA/dvKht
 MN285NFlVjTyznYlU/LH3etx11CHLMNclishiFHQbcnohtvhOe+fvN6RVNdfeRxm
 d9iBPOum15ikc1xDl2z8Op+ZXVjMxkgLkzIXFcDBpJf4BvUx0X+ZHZXIKdizVhgU
 ZMD2ds/MutMB8X1A52qp6kQvT7xE4rp87M0So4qDMTbAto5G4ZmMaWC5MlK2Oxe/
 o+3qnx4vVz4H6uYzg1N4diHiC+buhgtXCLwwkcUOKKUVqJRP9e0Bh7kw8JA52XU=
 =C+tM
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:

 - fix for module unload vs deferred jump labels (note: there might be
   other buggy modules!)

 - two NULL pointer dereferences from syzkaller

 - also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
   made worse during this merge window, "just" kernel memory leak on
   releases

 - fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: fix emulation of "MOV SS, null selector"
  KVM: x86: fix NULL deref in vcpu_scan_ioapic
  KVM: eventfd: fix NULL deref irqbypass consumer
  KVM: x86: Introduce segmented_write_std
  KVM: x86: flush pending lapic jump label updates on module unload
  jump_labels: API for flushing deferred jump label updates
2017-01-13 17:06:24 -08:00
Stephen Smalley
3a2f5a59a6 security,selinux,smack: kill security_task_wait hook
As reported by yangshukui, a permission denial from security_task_wait()
can lead to a soft lockup in zap_pid_ns_processes() since it only expects
sys_wait4() to return 0 or -ECHILD. Further, security_task_wait() can
in general lead to zombies; in the absence of some way to automatically
reparent a child process upon a denial, the hook is not useful.  Remove
the security hook and its implementations in SELinux and Smack.  Smack
already removed its check from its hook.

Reported-by: yangshukui <yangshukui@huawei.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-01-12 11:10:57 -05:00
Will Deacon
42d1a731ff Merge branch 'aarch64/for-next/debug-virtual' into aarch64/for-next/core
Merge core DEBUG_VIRTUAL changes from Laura Abbott. Later arm and arm64
support depends on these.

* aarch64/for-next/debug-virtual:
  drivers: firmware: psci: Use __pa_symbol for kernel symbol
  mm/usercopy: Switch to using lm_alias
  mm/kasan: Switch to using __pa_symbol and lm_alias
  kexec: Switch to __pa_symbol
  mm: Introduce lm_alias
  mm/cma: Cleanup highmem check
  lib/Kconfig.debug: Add ARCH_HAS_DEBUG_VIRTUAL
2017-01-12 15:04:29 +00:00
Daniel Borkmann
62c7989b24 bpf: allow b/h/w/dw access for bpf's cb in ctx
When structs are used to store temporary state in cb[] buffer that is
used with programs and among tail calls, then the generated code will
not always access the buffer in bpf_w chunks. We can ease programming
of it and let this act more natural by allowing for aligned b/h/w/dw
sized access for cb[] ctx member. Various test cases are attached as
well for the selftest suite. Potentially, this can also be reused for
other program types to pass data around.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-12 10:00:31 -05:00
Daniel Borkmann
6b8cc1d11e bpf: pass original insn directly to convert_ctx_access
Currently, when calling convert_ctx_access() callback for the various
program types, we pass in insn->dst_reg, insn->src_reg, insn->off from
the original instruction. This information is needed to rewrite the
instruction that is based on the user ctx structure into a kernel
representation for the ctx. As we'd like to allow access size beyond
just BPF_W, we'd need also insn->code for that in order to decode the
original access size. Given that, lets just pass insn directly to the
convert_ctx_access() callback and work on that to not clutter the
callback with even more arguments we need to pass when everything is
already contained in insn. So lets go through that once, no functional
change.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-12 10:00:31 -05:00
Jike Song
19c816e8e4 capability: export has_capability
has_capability() is sometimes needed by modules to test capability
for specified task other than current, so export it.

Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-12 07:01:56 -07:00
David Matlack
b6416e6101 jump_labels: API for flushing deferred jump label updates
Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.

Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:33:16 +01:00
Pan Xinhui
75437bb304 locking/pvqspinlock: Don't wait if vCPU is preempted
If prev node is not in running state or its vCPU is preempted, we can give
up our vCPU slices in pv_wait_node() ASAP.

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: longman@redhat.com
Link: http://lkml.kernel.org/r/1484035006-6787-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
[ Fixed typos in the changelog, removed ugly linebreak from the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-12 09:35:57 +01:00
Waiman Long
607904c357 locking/spinlocks: Remove the unused spin_lock_bh_nested() API
The spin_lock_bh_nested() API is defined but is not used anywhere
in the kernel. So all spin_lock_bh_nested() and related APIs are
now removed.

Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1483975612-16447-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-12 09:33:39 +01:00
David S. Miller
02ac5d1487 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two AF_* families adding entries to the lockdep tables
at the same time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 14:43:39 -05:00
Laura Abbott
b6e92aa810 kexec: Switch to __pa_symbol
__pa_symbol is the correct api to get the physical address of kernel
symbols. Switch to it to allow for better debug checking.

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-11 13:56:49 +00:00
Frederic Weisbecker
24b91e360e nohz: Fix collision between tick and other hrtimers
When the tick is stopped and an interrupt occurs afterward, we check on
that interrupt exit if the next tick needs to be rescheduled. If it
doesn't need any update, we don't want to do anything.

In order to check if the tick needs an update, we compare it against the
clockevent device deadline. Now that's a problem because the clockevent
device is at a lower level than the tick itself if it is implemented
on top of hrtimer.

Every hrtimer share this clockevent device. So comparing the next tick
deadline against the clockevent device deadline is wrong because the
device may be programmed for another hrtimer whose deadline collides
with the tick. As a result we may end up not reprogramming the tick
accidentally.

In a worst case scenario under full dynticks mode, the tick stops firing
as it is supposed to every 1hz, leaving /proc/stat stalled:

      Task in a full dynticks CPU
      ----------------------------

      * hrtimer A is queued 2 seconds ahead
      * the tick is stopped, scheduled 1 second ahead
      * tick fires 1 second later
      * on tick exit, nohz schedules the tick 1 second ahead but sees
        the clockevent device is already programmed to that deadline,
        fooled by hrtimer A, the tick isn't rescheduled.
      * hrtimer A is cancelled before its deadline
      * tick never fires again until an interrupt happens...

In order to fix this, store the next tick deadline to the tick_sched
local structure and reuse that value later to check whether we need to
reprogram the clock after an interrupt.

On the other hand, ts->sleep_length still wants to know about the next
clock event and not just the tick, so we want to improve the related
comment to avoid confusion.

Reported-by: James Hartsock <hartsjc@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-11 10:41:33 +01:00
Jamie Iles
2d39b3cd34 signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
Since commit 00cd5c37af ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes.  init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.

This can result in init becoming stoppable/killable after tracing.  For
example, running:

  while true; do kill -STOP 1; done &
  strace -p 1

and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.

Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.

Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:55 -08:00
Dan Williams
f931ab479d mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
Both arch_add_memory() and arch_remove_memory() expect a single threaded
context.

For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
not hold any locks over this check and branch:

    if (pgd_val(*pgd)) {
    	pud = (pud_t *)pgd_page_vaddr(*pgd);
    	paddr_last = phys_pud_init(pud, __pa(vaddr),
    				   __pa(vaddr_end),
    				   page_size_mask);
    	continue;
    }

    pud = alloc_low_page();
    paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
    			   page_size_mask);

The result is that two threads calling devm_memremap_pages()
simultaneously can end up colliding on pgd initialization.  This leads
to crash signatures like the following where the loser of the race
initializes the wrong pgd entry:

    BUG: unable to handle kernel paging request at ffff888ebfff0000
    IP: memcpy_erms+0x6/0x10
    PGD 2f8e8fc067 PUD 0 /* <---- Invalid PUD */
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
    task: ffff882fac290040 ti: ffff882f887a4000 task.ti: ffff882f887a4000
    RIP: memcpy_erms+0x6/0x10
    [..]
    Call Trace:
      ? pmem_do_bvec+0x205/0x370 [nd_pmem]
      ? blk_queue_enter+0x3a/0x280
      pmem_rw_page+0x38/0x80 [nd_pmem]
      bdev_read_page+0x84/0xb0

Hold the standard memory hotplug mutex over calls to
arch_{add,remove}_memory().

Fixes: 41e94a8513 ("add devm_memremap_pages")
Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:54 -08:00
Michal Hocko
7984c27c2c bpf: do not use KMALLOC_SHIFT_MAX
Commit 01b3f52157 ("bpf: fix allocation warnings in bpf maps and
integer overflow") has added checks for the maximum allocateable size.
It (ab)used KMALLOC_SHIFT_MAX for that purpose.

While this is not incorrect it is not very clean because we already have
KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
KMALLOC_MAX_SIZE instead.

The original motivation for using KMALLOC_SHIFT_MAX was to work around
an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
will fit into MAX_ORDER".

Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:54 -08:00
Tobias Klauser
3bf003335b bpf: Make unnecessarily global functions static
Make the functions __local_list_pop_free(), __local_list_pop_pending(),
bpf_common_lru_populate() and bpf_percpu_lru_populate() static as they
are not used outide of bpf_lru_list.c

This fixes the following GCC warnings when building with 'W=1':

  kernel/bpf/bpf_lru_list.c:363:22: warning: no previous prototype for ‘__local_list_pop_free’ [-Wmissing-prototypes]
  kernel/bpf/bpf_lru_list.c:376:22: warning: no previous prototype for ‘__local_list_pop_pending’ [-Wmissing-prototypes]
  kernel/bpf/bpf_lru_list.c:560:6: warning: no previous prototype for ‘bpf_common_lru_populate’ [-Wmissing-prototypes]
  kernel/bpf/bpf_lru_list.c:577:6: warning: no previous prototype for ‘bpf_percpu_lru_populate’ [-Wmissing-prototypes]

Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-10 21:00:59 -05:00
Tobias Klauser
a5ef01aaac bpf: Remove unused but set variable in __bpf_lru_list_shrink_inactive()
Remove the unused but set variable 'first_node' in
__bpf_lru_list_shrink_inactive() to fix the following GCC warning when
building with 'W=1':

  kernel/bpf/bpf_lru_list.c:216:41: warning: variable ‘first_node’ set but not used [-Wunused-but-set-variable]

Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-10 21:00:59 -05:00
Parav Pandit
7896dfb0a6 rdmacg: Fixed uninitialized current resource usage
Fixed warning reported by kbuild test robot.
When reading current resource usage value, when no resources are
allocated, its possible that it can report a uninitialized value
for current resource usage.
This fix avoids it by initializing it to zero as no resource is
allocated.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-01-10 12:52:32 -05:00
Parav Pandit
39d3e7584a rdmacg: Added rdma cgroup controller
Added rdma cgroup controller that does accounting, limit enforcement
on rdma/IB resources.

Added rdma cgroup header file which defines its APIs to perform
charging/uncharging functionality. It also defined APIs for RDMA/IB
stack for device registration. Devices which are registered will
participate in controller functions of accounting and limit
enforcements. It define rdmacg_device structure to bind IB stack
and RDMA cgroup controller.

RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup entity which allows setting up accounting limits
on per device basis.

Currently resources are defined by the RDMA cgroup.

Resource pool is created/destroyed dynamically whenever
charging/uncharging occurs respectively and whenever user
configuration is done. Its a tradeoff of memory vs little more code
space that creates resource pool object whenever necessary, instead of
creating them during cgroup creation and device registration time.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-01-10 11:14:27 -05:00
Andrei Vagin
add7c65ca4 pid: fix lockdep deadlock warning due to ucount_lock
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G        W
---------------------------------------------------------
swapper/1/0 just changed the state of lock:
 (&(&sighand->siglock)->rlock){-.....}, at: [<ffffffffbd0a1bc6>] __lock_task_sighand+0xb6/0x2c0
but this lock took another, HARDIRQ-unsafe lock in the past:
 (ucounts_lock){+.+...}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:                 &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
 Possible interrupt unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(ucounts_lock);
                               local_irq_disable();
                               lock(&(&sighand->siglock)->rlock);
                               lock(&(&tty->ctrl_lock)->rlock);
  <Interrupt>
    lock(&(&sighand->siglock)->rlock);

 *** DEADLOCK ***

This patch removes a dependency between rlock and ucount_lock.

Fixes: f333c700c6 ("pidns: Add a limit on the number of pid namespaces")
Cc: stable@vger.kernel.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-10 13:34:56 +13:00
Alexei Starovoitov
39f19ebbf5 bpf: rename ARG_PTR_TO_STACK
since ARG_PTR_TO_STACK is no longer just pointer to stack
rename it to ARG_PTR_TO_MEM and adjust comment.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09 16:56:27 -05:00
Gianluca Borello
06c1c04972 bpf: allow helpers access to variable memory
Currently, helpers that read and write from/to the stack can do so using
a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE.
ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so
that the verifier can safely check the memory access. However, requiring
the argument to be a constant can be limiting in some circumstances.

Since the current logic keeps track of the minimum and maximum value of
a register throughout the simulated execution, ARG_CONST_STACK_SIZE can
be changed to also accept an UNKNOWN_VALUE register in case its
boundaries have been set and the range doesn't cause invalid memory
accesses.

One common situation when this is useful:

int len;
char buf[BUFSIZE]; /* BUFSIZE is 128 */

if (some_condition)
	len = 42;
else
	len = 84;

some_helper(..., buf, len & (BUFSIZE - 1));

The compiler can often decide to assign the constant values 42 or 48
into a variable on the stack, instead of keeping it in a register. When
the variable is then read back from stack into the register in order to
be passed to the helper, the verifier will not be able to recognize the
register as constant (the verifier is not currently tracking all
constant writes into memory), and the program won't be valid.

However, by allowing the helper to accept an UNKNOWN_VALUE register,
this program will work because the bitwise AND operation will set the
range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE),
so the verifier can guarantee the helper call will be safe (assuming the
argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more
check against 0 would be needed). Custom ranges can be set not only with
ALU operations, but also by explicitly comparing the UNKNOWN_VALUE
register with constants.

Another very common example happens when intercepting system call
arguments and accessing user-provided data of variable size using
bpf_probe_read(). One can load at runtime the user-provided length in an
UNKNOWN_VALUE register, and then read that exact amount of data up to a
compile-time determined limit in order to fit into the proper local
storage allocated on the stack, without having to guess a suboptimal
access size at compile time.

Also, in case the helpers accepting the UNKNOWN_VALUE register operate
in raw mode, disable the raw mode so that the program is required to
initialize all memory, since there is no guarantee the helper will fill
it completely, leaving possibilities for data leak (just relevant when
the memory used by the helper is the stack, not when using a pointer to
map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will
be treated as ARG_PTR_TO_STACK.

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09 16:56:27 -05:00
Gianluca Borello
f0318d01b6 bpf: allow adjusted map element values to spill
commit 484611357c ("bpf: allow access into map value arrays")
introduces the ability to do pointer math inside a map element value via
the PTR_TO_MAP_VALUE_ADJ register type.

The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
is spilled into the stack, limiting several use cases, especially when
generating bpf code from a compiler.

Handle this case by explicitly enabling the register type
PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
max_value are reset just for BPF_LDX operations that don't result in a
restore of a spilled register from stack.

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09 16:56:27 -05:00
Gianluca Borello
5722569bb9 bpf: allow helpers access to map element values
Enable helpers to directly access a map element value by passing a
register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.

This enables several use cases. For example, a typical tracing program
might want to capture pathnames passed to sys_open() with:

struct trace_data {
	char pathname[PATHLEN];
};

SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
	struct trace_data data;
	bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);

	/* consume data.pathname, for example via
	 * bpf_trace_printk() or bpf_perf_event_output()
	 */
}

Such a program could easily hit the stack limit in case PATHLEN needs to
be large or more local variables need to exist, both of which are quite
common scenarios. Allowing direct helper access to map element values,
one could do:

struct bpf_map_def SEC("maps") scratch_map = {
	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
	.key_size = sizeof(u32),
	.value_size = sizeof(struct trace_data),
	.max_entries = 1,
};

SEC("kprobe/sys_open")
int bpf_sys_open(struct pt_regs *ctx)
{
	int id = 0;
	struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
	if (!p)
		return;
	bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);

	/* consume p->pathname, for example via
	 * bpf_trace_printk() or bpf_perf_event_output()
	 */
}

And wouldn't risk exhausting the stack.

Code changes are loosely modeled after commit 6841de8b0d ("bpf: allow
helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
trivial, but since there is not currently a use case for that, it's
reasonable to limit the set of changes.

Also, add new tests to make sure accesses to map element values from
helpers never go out of boundary, even when adjusted.

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09 16:56:26 -05:00
Gianluca Borello
dbcfe5f76d bpf: split check_mem_access logic for map values
Move the logic to check memory accesses to a PTR_TO_MAP_VALUE_ADJ from
check_mem_access() to a separate helper check_map_access_adj(). This
enables to use those checks in other parts of the verifier as well,
where boundaries on PTR_TO_MAP_VALUE_ADJ might need to be checked, for
example when checking helper function arguments. The same thing is
already happening for other types such as PTR_TO_PACKET and its
check_packet_access() helper.

The code has been copied verbatim, with the only difference of removing
the "off += reg->max_value" statement and moving the sum into the call
statement to check_map_access(), as that was only needed due to the
earlier common check_map_access() call.

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09 16:56:26 -05:00
Stephen Boyd
40d9f82750 timekeeping: Remove unused timekeeping_{get,set}_tai_offset()
The last caller to timekeeping_set_tai_offset() was in commit
0b5154fb90 (timekeeping: Simplify tai updating from
do_adjtimex, 2013-03-22) and the last caller to
timekeeping_get_tai_offset() was in commit 76f4108892 (hrtimer:
Cleanup hrtimer accessors to the timekepeing state, 2014-07-16).
Remove these unused functions now that we handle TAI offsets
differently.

Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-01-06 16:47:28 -08:00
Linus Torvalds
6989606a72 Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Two small fixes relating to audit's use of fsnotify.

  The first patch plugs a leak and the second fixes some lock
  shenanigans. The patches are small and I banged on this for an
  afternoon with our testsuite and didn't see anything odd"

* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
  audit: Fix sleep in atomic
  fsnotify: Remove fsnotify_duplicate_mark()
2017-01-05 23:06:06 -08:00
Jan Kara
be29d20f3f audit: Fix sleep in atomic
Audit tree code was happily adding new notification marks while holding
spinlocks. Since fsnotify_add_mark() acquires group->mark_mutex this can
lead to sleeping while holding a spinlock, deadlocks due to lock
inversion, and probably other fun. Fix the problem by acquiring
group->mark_mutex earlier.

CC: Paul Moore <paul@paul-moore.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-01-03 15:56:38 -05:00
Tejun Heo
e0aed7c74f cgroup: fix RCU related sparse warnings
kn->priv which is a void * is used as a RCU pointer by cgroup.  When
dereferencing it, it was passing kn->priv to rcu_derefreence() without
casting it into a RCU pointer triggering address space mismatch
warning from sparse.  Fix them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:09 -05:00
Tejun Heo
dcfe149b9f cgroup: move namespace code to kernel/cgroup/namespace.c
get/put_css_set() get exposed in cgroup-internal.h in the process.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:09 -05:00
Tejun Heo
d62beb7f3d cgroup: rename functions for consistency
Now that v1 functions are separated out, rename some functions for
consistency.

 cgroup_dfl_base_files		-> cgroup_base_files
 cgroup_legacy_base_files	-> cgroup1_base_files
 cgroup_ssid_no_v1()		-> cgroup1_ssid_disabled()
 cgroup_pidlist_destroy_all	-> cgroup1_pidlist_destroy_all()
 cgroup_release_agent()		-> cgroup1_release_agent()
 check_for_release()		-> cgroup1_check_for_release()

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:08 -05:00
Tejun Heo
1592c9b223 cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
Now that the v1 mount code is split into separate functions, move them
to kernel/cgroup/cgroup-v1.c along with the mount option handling
code.  As this puts all v1-only kernfs_syscall_ops in cgroup-v1.c,
move cgroup1_kf_syscall_ops to cgroup-v1.c too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:08 -05:00
Tejun Heo
fa069904dd cgroup: separate out cgroup1_kf_syscall_ops
Currently, cgroup_kf_syscall_ops is shared by v1 and v2 and the
specific methods test the version and take different actions.  Split
out v1 functions and put them in cgroup1_kf_syscall_ops and remove the
now unnecessary explicit branches in specific methods.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:07 -05:00
Tejun Heo
633feee310 cgroup: refactor mount path and clearly distinguish v1 and v2 paths
While sharing some mechanisms, the mount paths of v1 and v2 are
substantially different.  Their implementations were mixed in
cgroup_mount().  This patch splits them out so that they're easier to
follow and organize.

This patch causes one functional change - the WARN_ON(new_sb) gets
lost.  This is because the actual mounting gets moved to
cgroup_do_mount() and thus @new_sb is no longer accessible by default
to cgroup1_mount().  While we can add it as an explicit out parameter
to cgroup_do_mount(), this part of code hasn't changed and the warning
hasn't triggered for quite a while.  Dropping it should be fine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:07 -05:00
Tejun Heo
0a268dbd79 cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
cgroup.c is getting too unwieldy.  Let's move out cgroup v1 specific
code along with the debug controller into kernel/cgroup/cgroup-v1.c.

v2: cgroup_mutex and css_set_lock made available in cgroup-internal.h
    regardless of CONFIG_PROVE_RCU.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:06 -05:00
Tejun Heo
201af4c0fa cgroup: move cgroup files under kernel/cgroup/
They're growing to be too many and planned to get split further.  Move
them under their own directory.

 kernel/cgroup.c		-> kernel/cgroup/cgroup.c
 kernel/cgroup_freezer.c	-> kernel/cgroup/freezer.c
 kernel/cgroup_pids.c		-> kernel/cgroup/pids.c
 kernel/cpuset.c		-> kernel/cgroup/cpuset.c

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:05 -05:00
Tejun Heo
5f617ebbdf cgroup: reorder css_set fields
Reorder css_set fields so that they're roughly in the order of how hot
they are.  The rough order is

1. the actual csses
2. reference counter and the default cgroup pointer.
3. task lists and iterations
4. fields used during merge including css_set lookup
5. the rest

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:05 -05:00
Tejun Heo
2fae986343 cgroup: remove cgroup_pid_fry() and friends
cgroup_pid_fry() was added to mangle cgroup.procs pid listing order on
v2 to make it clear that the output is not sorted.  Now that v2 now
uses a separate "cgroup.procs" read method, this is no longer used.
Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:05 -05:00
Tejun Heo
b4b90a8e86 cgroup: reimplement reading "cgroup.procs" on cgroup v2
On v1, "tasks" and "cgroup.procs" are expected to be sorted which
makes the implementation expensive and unnecessarily complicated
involving result cache management.

v2 doesn't have the sorting requirement, so it can just iterate and
print processes one by one.  seq_files are either read sequentially or
reset to position zero, so the implementation doesn't even need to
worry about seeking.

This keeps the css_task_iter across multiple read(2) calls and
migrations of new processes always append won't miss processes which
are newly migrated in before each read(2).

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:04 -05:00
Tejun Heo
e90cbebc3f cgroup add cftype->open/release() callbacks
Pipe the newly added kernfs->open/release() callbacks through cftype.
While at it, as cleanup operations now can be performed from
->release() instead of ->seq_stop(), make the latter optional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:03 -05:00
Al Viro
f81dc7d7d5 splice_pipe_desc: kill ->flags
no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-26 23:53:38 -05:00
Thomas Gleixner
b9d9d6911b smp/hotplug: Undo tglxs brainfart
The attempt to prevent overwriting an active state resulted in a
disaster which effectively disables all dynamically allocated hotplug
states.

Cleanup the mess.

Fixes: dc280d9362 ("cpu/hotplug: Prevent overwriting of callbacks")
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-26 17:30:24 -08:00
Linus Torvalds
3ddc76dfc7 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer type cleanups from Thomas Gleixner:
 "This series does a tree wide cleanup of types related to
  timers/timekeeping.

   - Get rid of cycles_t and use a plain u64. The type is not really
     helpful and caused more confusion than clarity

   - Get rid of the ktime union. The union has become useless as we use
     the scalar nanoseconds storage unconditionally now. The 32bit
     timespec alike storage got removed due to the Y2038 limitations
     some time ago.

     That leaves the odd union access around for no reason. Clean it up.

  Both changes have been done with coccinelle and a small amount of
  manual mopping up"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ktime: Get rid of ktime_equal()
  ktime: Cleanup ktime_set() usage
  ktime: Get rid of the union
  clocksource: Use a plain u64 instead of cycle_t
2016-12-25 14:30:04 -08:00
Linus Torvalds
b272f732f8 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug notifier removal from Thomas Gleixner:
 "This is the final cleanup of the hotplug notifier infrastructure. The
  series has been reintgrated in the last two days because there came a
  new driver using the old infrastructure via the SCSI tree.

  Summary:

   - convert the last leftover drivers utilizing notifiers

   - fixup for a completely broken hotplug user

   - prevent setup of already used states

   - removal of the notifiers

   - treewide cleanup of hotplug state names

   - consolidation of state space

  There is a sphinx based documentation pending, but that needs review
  from the documentation folks"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/armada-xp: Consolidate hotplug state space
  irqchip/gic: Consolidate hotplug state space
  coresight/etm3/4x: Consolidate hotplug state space
  cpu/hotplug: Cleanup state names
  cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
  staging/lustre/libcfs: Convert to hotplug state machine
  scsi/bnx2i: Convert to hotplug state machine
  scsi/bnx2fc: Convert to hotplug state machine
  cpu/hotplug: Prevent overwriting of callbacks
  x86/msr: Remove bogus cleanup from the error path
  bus: arm-ccn: Prevent hotplug callback leak
  perf/x86/intel/cstate: Prevent hotplug callback leak
  ARM/imx/mmcd: Fix broken cpu hotplug handling
  scsi: qedi: Convert to hotplug state machine
2016-12-25 14:05:56 -08:00
Thomas Gleixner
8b0e195314 ktime: Cleanup ktime_set() usage
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Thomas Gleixner
2456e85535 ktime: Get rid of the union
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Thomas Gleixner
a5a1d1c291 clocksource: Use a plain u64 instead of cycle_t
There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
2016-12-25 11:04:12 +01:00
Thomas Gleixner
530e9b76ae cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
hotcpu_notifier(), cpu_notifier(), __hotcpu_notifier(), __cpu_notifier(),
register_hotcpu_notifier(), register_cpu_notifier(),
__register_hotcpu_notifier(), __register_cpu_notifier(),
unregister_hotcpu_notifier(), unregister_cpu_notifier(),
__unregister_hotcpu_notifier(), __unregister_cpu_notifier()

are unused now. Remove them and all related code.

Remove also the now pointless cpu notifier error injection mechanism. The
states can be executed step by step and error rollback is the same as cpu
down, so any state transition can be tested w/o requiring the notifier
error injection.

Some CPU hotplug states are kept as they are (ab)used for hotplug state
tracking.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161221192112.005642358@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-25 10:47:43 +01:00
Thomas Gleixner
dc280d9362 cpu/hotplug: Prevent overwriting of callbacks
Developers manage to overwrite states blindly without thought. That's fatal
and hard to debug. Add sanity checks to make it fail.

This requries to restructure the code so that the dynamic state allocation
happens in the same lock protected section as the actual store. Otherwise
the previous assignment of 'Reserved' to the name field would trigger the
overwrite check.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192111.675234535@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-25 10:47:42 +01:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds
00198dab3b Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "On the kernel side there's two x86 PMU driver fixes and a uprobes fix,
  plus on the tooling side there's a number of fixes and some late
  updates"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  perf sched timehist: Fix invalid period calculation
  perf sched timehist: Remove hardcoded 'comm_width' check at print_summary
  perf sched timehist: Enlarge default 'comm_width'
  perf sched timehist: Honour 'comm_width' when aligning the headers
  perf/x86: Fix overlap counter scheduling bug
  perf/x86/pebs: Fix handling of PEBS buffer overflows
  samples/bpf: Move open_raw_sock to separate header
  samples/bpf: Remove perf_event_open() declaration
  samples/bpf: Be consistent with bpf_load_program bpf_insn parameter
  tools lib bpf: Add bpf_prog_{attach,detach}
  samples/bpf: Switch over to libbpf
  perf diff: Do not overwrite valid build id
  perf annotate: Don't throw error for zero length symbols
  perf bench futex: Fix lock-pi help string
  perf trace: Check if MAP_32BIT is defined (again)
  samples/bpf: Make perf_event_read() static
  uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation
  samples/bpf: Make samples more libbpf-centric
  tools lib bpf: Add flags to bpf_create_map()
  tools lib bpf: use __u32 from linux/types.h
  ...
2016-12-23 16:49:12 -08:00
Jan Kara
e3ba730702 fsnotify: Remove fsnotify_duplicate_mark()
There are only two calls sites of fsnotify_duplicate_mark(). Those are
in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
for audit tree, inode pointer and group gets set in
fsnotify_add_mark_locked() later anyway, mask and free_mark are already
set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
actively harmful because following fsnotify_add_mark_locked() will leak
group reference by overwriting the group pointer. So just remove the two
calls to fsnotify_duplicate_mark() and the function.

Signed-off-by: Jan Kara <jack@suse.cz>
[PM: line wrapping to fit in 80 chars]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-23 16:40:32 -05:00
Linus Torvalds
a307d0a007 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull final vfs updates from Al Viro:
 "Assorted cleanups and fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sg_write()/bsg_write() is not fit to be called under KERNEL_DS
  ufs: fix function declaration for ufs_truncate_blocks
  fs: exec: apply CLOEXEC before changing dumpable task flags
  seq_file: reset iterator to first record for zero offset
  vfs: fix isize/pos/len checks for reflink & dedupe
  [iov_iter] fix iterate_all_kinds() on empty iterators
  move aio compat to fs/aio.c
  reorganize do_make_slave()
  clone_private_mount() doesn't need to touch namespace_sem
  remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()
2016-12-23 10:52:43 -08:00
Al Viro
c00d2c7e89 move aio compat to fs/aio.c
... and fix the minor buglet in compat io_submit() - native one
kills ioctx as cleanup when put_user() fails.  Get rid of
bogus compat_... in !CONFIG_AIO case, while we are at it - they
should simply fail with ENOSYS, same as for native counterparts.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-22 22:58:37 -05:00
Boris Ostrovsky
1358e038fa CPU/hotplug: Clarify description of __cpuhp_setup_state() return value
When ivoked with CPUHP_AP_ONLINE_DYN state __cpuhp_setup_state()
is expected to return positive value which is the hotplug state that
the routine assigns.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-12-21 02:52:51 +01:00
Alexander Popov
4983f0ab7f kcov: make kcov work properly with KASLR enabled
Subtract KASLR offset from the kernel addresses reported by kcov.
Tested on x86_64 and AArch64 (Hikey LeMaker).

Link: http://lkml.kernel.org/r/1481417456-28826-3-git-send-email-alex.popov@linux.com
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Jon Masters <jcm@redhat.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: James Morse <james.morse@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-20 09:48:47 -08:00
Mimi Zohar
7b8589cc29 ima: on soft reboot, save the measurement list
The TPM PCRs are only reset on a hard reboot.  In order to validate a
TPM's quote after a soft reboot (eg.  kexec -e), the IMA measurement
list of the running kernel must be saved and restored on boot.

This patch uses the kexec buffer passing mechanism to pass the
serialized IMA binary_runtime_measurements to the next kernel.

Link: http://lkml.kernel.org/r/1480554346-29071-7-git-send-email-zohar@linux.vnet.ibm.com
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: Andreas Steffen <andreas.steffen@strongswan.org>
Cc: Josh Sklar <sklar@linux.vnet.ibm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-20 09:48:44 -08:00
Linus Torvalds
451bb1a6b2 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "Prevent NULL pointer dereferencing in the tick broadcast code. Old
  bug, which got unearthed by the hotplug ordering problem"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/broadcast: Prevent NULL pointer dereference
2016-12-18 11:11:01 -08:00
Linus Torvalds
98da295b35 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug fixes from Thomas Gleixner:
 "Two fixlets for cpu hotplug:

   - Fix a subtle ordering problem with the dummy timer. This happened
     to work before the conversion by chance due to initcall ordering.

   - Fix the function comment for __cpuhp_setup_state()"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Clarify description of __cpuhp_setup_state() return value
  clocksource/dummy_timer: Move hotplug callback after the real timers
2016-12-18 11:06:05 -08:00
Linus Torvalds
eb3a3c0746 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "A fix for the irq affinity spread algorithm so it handles non linear
  node numbering nicely"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/affinity: Fix node generation from cpumask
2016-12-18 11:00:56 -08:00
Marcin Nowakowski
297e765e39 uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation
Commit:

  72e6ae285a ('ARM: 8043/1: uprobes need icache flush after xol write'

... has introduced an arch-specific method to ensure all caches are
flushed appropriately after an instruction is written to an XOL page.

However, when the XOL area is created and the out-of-line breakpoint
instruction is copied, caches are not flushed at all and stale data may
be found in icache.

Replace a simple copy_to_page() with arch_uprobe_copy_ixol() to allow
the arch to ensure all caches are updated accordingly.

This change fixes uprobes on MIPS InterAptiv (tested on Creator Ci40).

Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Victor Kamensky <victor.kamensky@linaro.org>
Cc: linux-mips@linux-mips.org
Link: http://lkml.kernel.org/r/1481625657-22850-1-git-send-email-marcin.nowakowski@imgtec.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-18 09:42:11 +01:00
Linus Torvalds
52f40e9d65 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes and cleanups from David Miller:

 1) Revert bogus nla_ok() change, from Alexey Dobriyan.

 2) Various bpf validator fixes from Daniel Borkmann.

 3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04
    drivers, from Dongpo Li.

 4) Several ethtool ksettings conversions from Philippe Reynes.

 5) Fix bugs in inet port management wrt. soreuseport, from Tom Herbert.

 6) XDP support for virtio_net, from John Fastabend.

 7) Fix NAT handling within a vrf, from David Ahern.

 8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits)
  net: mv643xx_eth: fix build failure
  isdn: Constify some function parameters
  mlxsw: spectrum: Mark split ports as such
  cgroup: Fix CGROUP_BPF config
  qed: fix old-style function definition
  net: ipv6: check route protocol when deleting routes
  r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected
  irda: w83977af_ir: cleanup an indent issue
  net: sfc: use new api ethtool_{get|set}_link_ksettings
  net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings
  net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
  net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings
  net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings
  bpf: fix mark_reg_unknown_value for spilled regs on map value marking
  bpf: fix overflow in prog accounting
  bpf: dynamically allocate digest scratch buffer
  gtp: Fix initialization of Flags octet in GTPv1 header
  gtp: gtp_check_src_ms_ipv4() always return success
  net/x25: use designated initializers
  isdn: use designated initializers
  ...
2016-12-17 20:17:04 -08:00
Linus Torvalds
0110c350c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "In this pile:

   - autofs-namespace series
   - dedupe stuff
   - more struct path constification"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  ocfs2: charge quota for reflinked blocks
  ocfs2: fix bad pointer cast
  ocfs2: always unlock when completing dio writes
  ocfs2: don't eat io errors during _dio_end_io_write
  ocfs2: budget for extent tree splits when adding refcount flag
  ocfs2: prohibit refcounted swapfiles
  ocfs2: add newlines to some error messages
  ocfs2: convert inode refcount test to a helper
  simple_write_end(): don't zero in short copy into uptodate
  exofs: don't mess with simple_write_{begin,end}
  9p: saner ->write_end() on failing copy into non-uptodate page
  fix gfs2_stuffed_write_end() on short copies
  fix ceph_write_end()
  nfs_write_end(): fix handling of short copies
  vfs: refactor clone/dedupe_file_range common functions
  fs: try to clone files first in vfs_copy_file_range
  vfs: misc struct path constification
  namespace.c: constify struct path passed to a bunch of primitives
  quota: constify struct path in quota_on
  ...
2016-12-17 18:44:00 -08:00
Daniel Borkmann
6760bf2ddd bpf: fix mark_reg_unknown_value for spilled regs on map value marking
Martin reported a verifier issue that hit the BUG_ON() for his
test case in the mark_reg_unknown_value() function:

  [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
  [...]
  [  203.291109] Call Trace:
  [  203.296501]  [<ffffffff811364d5>] mark_map_reg+0x45/0x50
  [  203.308225]  [<ffffffff81136558>] mark_map_regs+0x78/0x90
  [  203.320140]  [<ffffffff8113938d>] do_check+0x226d/0x2c90
  [  203.331865]  [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
  [  203.343403]  [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
  [  203.355705]  [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
  [  203.369158]  [<ffffffff812d8188>] ? security_capable+0x48/0x60
  [  203.382035]  [<ffffffff811351a4>] SyS_bpf+0x124/0x960
  [  203.393185]  [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
  [  203.406258]  [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94

This issue got uncovered after the fix in a08dd0da53 ("bpf: fix
regression on verifier pruning wrt map lookups"). The reason why it
wasn't noticed before was, because as mentioned in a08dd0da53,
mark_map_regs() was doing the id matching incorrectly based on the
uncached regs[regno].id. So, in the first loop, we walked all regs
and as soon as we found regno == i, then this reg's id was cleared
when calling mark_reg_unknown_value() thus that every subsequent
register was probed against id of 0 (which, in combination with the
PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
register state can hold), and therefore wasn't type transitioned such
as in the spilled register case for the second loop.

Now since that got fixed, it turned out that 57a09bf0a4 ("bpf:
Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
mark_reg_unknown_value() incorrectly for the spilled regs, and thus
hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.

Although spilled regs have the same type as the non-spilled regs
for the verifier state, that is, struct bpf_reg_state, they are
semantically different from the non-spilled regs. In other words,
there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
in the stack, for example, register R<x> could have been spilled by
the program to stack location X, Y, Z, and in mark_map_regs() we
need to scan these stack slots of type STACK_SPILL for potential
registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
Therefore, depending on the location, the spilled_regs regno can
be a lot higher than just MAX_BPF_REG's value since we operate on
stack instead. The reset in mark_reg_unknown_value() itself is
just fine, only that the BUG_ON() was inappropriate for this. Fix
it by making a __mark_reg_unknown_value() version that can be
called from mark_map_reg() generically; we know for the non-spilled
case that the regno is always < MAX_BPF_REG anyway.

Fixes: 57a09bf0a4 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17 21:27:44 -05:00
Daniel Borkmann
5ccb071e97 bpf: fix overflow in prog accounting
Commit aaac3ba95e ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().

This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.

There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.

Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17 21:27:44 -05:00
Daniel Borkmann
aafe6ae9ce bpf: dynamically allocate digest scratch buffer
Geert rightfully complained that 7bd509e311 ("bpf: add prog_digest
and expose it via fdinfo/netlink") added a too large allocation of
variable 'raw' from bss section, and should instead be done dynamically:

  # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
  add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
  function                                     old     new   delta
  raw                                            -   32832  +32832
  [...]

Since this is only relevant during program creation path, which can be
considered slow-path anyway, lets allocate that dynamically and be not
implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
the beginning of replace_map_fd_with_map_ptr() and also error handling
stays straight forward.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17 21:27:44 -05:00
Daniel Borkmann
a08dd0da53 bpf: fix regression on verifier pruning wrt map lookups
Commit 57a09bf0a4 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
registers") introduced a regression where existing programs stopped
loading due to reaching the verifier's maximum complexity limit,
whereas prior to this commit they were loading just fine; the affected
program has roughly 2k instructions.

What was found is that state pruning couldn't be performed effectively
anymore due to mismatches of the verifier's register state, in particular
in the id tracking. It doesn't mean that 57a09bf0a4 is incorrect per
se, but rather that verifier needs to perform a lot more work for the
same program with regards to involved map lookups.

Since commit 57a09bf0a4 is only about tracking registers with type
PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
until they are promoted through pattern matching with a NULL check to
either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
id becomes irrelevant for the transitioned types.

For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
transferred further into other types that don't make use of it. Among
others, one example is where UNKNOWN_VALUE is set on function call
return with RET_INTEGER return type.

states_equal() will then fall through the memcmp() on register state;
note that the second memcmp() uses offsetofend(), so the id is part of
that since d2a4dd37f6 ("bpf: fix state equivalence"). But the bisect
pointed already to 57a09bf0a4, where we really reach beyond complexity
limit. What I found was that states_equal() often failed in this
case due to id mismatches in spilled regs with registers in type
PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
a memcmp() on their reg state and don't have any other optimizations
in place, therefore also id was relevant in this case for making a
pruning decision.

We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
For the affected program, it resulted in a ~17 fold reduction of
complexity and let the program load fine again. Selftest suite also
runs fine. The only other place where env->id_gen is used currently is
through direct packet access, but for these cases id is long living, thus
a different scenario.

Also, the current logic in mark_map_regs() is not fully correct when
marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
it's id is reset and any subsequent registers that hold the original id
and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
anymore, since mark_map_reg() reuses the uncached regs[regno].id that
was just overridden. Note, we don't need to cache it outside of
mark_map_regs(), since it's called once on this_branch and the other
time on other_branch, which are both two independent verifier states.
A test case for this is added here, too.

Fixes: 57a09bf0a4 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17 10:51:31 -05:00
Al Viro
3c55d6bcfe Merge remote-tracking branch 'djwong/ocfs2-vfs-reflink-6' into for-linus 2016-12-16 16:21:05 -05:00
Linus Torvalds
9a19a6db37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - more ->d_init() stuff (work.dcache)

 - pathname resolution cleanups (work.namei)

 - a few missing iov_iter primitives - copy_from_iter_full() and
   friends. Either copy the full requested amount, advance the iterator
   and return true, or fail, return false and do _not_ advance the
   iterator. Quite a few open-coded callers converted (and became more
   readable and harder to fuck up that way) (work.iov_iter)

 - several assorted patches, the big one being logfs removal

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  logfs: remove from tree
  vfs: fix put_compat_statfs64() does not handle errors
  namei: fold should_follow_link() with the step into not-followed link
  namei: pass both WALK_GET and WALK_MORE to should_follow_link()
  namei: invert WALK_PUT logics
  namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
  namei: saner calling conventions for mountpoint_last()
  namei.c: get rid of user_path_parent()
  switch getfrag callbacks to ..._full() primitives
  make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
  [iov_iter] new primitives - copy_from_iter_full() and friends
  don't open-code file_inode()
  ceph: switch to use of ->d_init()
  ceph: unify dentry_operations instances
  lustre: switch to use of ->d_init()
2016-12-16 10:24:44 -08:00
Linus Torvalds
de399813b5 powerpc updates for 4.10
Highlights include:
 
  - Support for the kexec_file_load() syscall, which is a prereq for secure and
    trusted boot.
 
  - Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN).
 
  - Sort the exception tables at build time, to save time at boot, and store
    them as relative offsets to save space in the kernel image & memory.
 
  - Allow building the kernel with thin archives, which should allow us to build
    an allyesconfig once some other fixes land.
 
  - Build fixes to allow us to correctly rebuild when changing the kernel endian
    from big to little or vice versa.
 
  - Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix.
 
  - Initial stack protector support (-fstack-protector).
 
  - Support for dumping the radix (aka. Linux) and hash page tables via debugfs.
 
  - Fix an oops in cxl coredump generation when cxl_get_fd() is used.
 
  - Freescale updates from Scott: "Highlights include 8xx hugepage support,
    qbman fixes/cleanup, device tree updates, and some misc cleanup."
 
  - Many and varied fixes and minor enhancements as always.
 
 Thanks to:
   Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual,
   Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet,
   Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat,
   Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold,
   Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan
   Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin,
   Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
   Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYU4YSAAoJEFHr6jzI4aWAC4gQALtIAqqPon0Cd5b/FVVcMbW7
 mMqB2b/0FGEl5GoRTzGUDaQqElilm6AEVfHO86C7DFji/a6olneFfw87iz+mtWuZ
 JvrNq68ZiSnoeszdUy4MgtXFLb5sTzNMev4skaHfjI9E5CepWBoR0zH4G+kNVnd5
 WSgudv8Cq4Px+MEuTOigt3QYjHzZ3cw/XNOOm9c+oGj+PDW4O9UItVI+S1WLoey4
 rAB2nRcLMDPuwfRQC9XsF3zEbkv4h1dEXo/EBRuRpcF+0lLTzFw1lv1WE8OxlUmS
 kAXbty3dIytBfSbtJT0c0Ps6sfQ4HFhu6ZV2fjnxNTz2KDkBIN7LBYHmBYiqY9oZ
 9zvbUWtfiTu5ocfRtTq7rC/Hcj4Kbr9S9F/FvXR0WyDsKgu4xxAovqC3gcn6YjYK
 Rr1tcCI4nUzyhVJVmd+OEhUvc5JbFy9aGage+YeOyejfvvSbXIunaxWlPjoDkvim
 Vjl+UKU8gw51XFssqY5ZBi/HNlMFKYedLpMFp/fItnLglhj50V0eFWkpDgdSCYom
 vo9ifPLZx8n8m8De3H7TV4E0F4gCHcTeqZdu7tW9AAUVM6iLJcDLm3asGmtNh21t
 snOHNOJ5QSIno6ezUUg29T6VBjbPh46fdJJSlIZrEe8OzLZ1haGyttf0tD00PQvY
 Z2W/m3gxafnOeGgBqvyv
 =xOzf
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for the kexec_file_load() syscall, which is a prereq for
     secure and trusted boot.

   - Prevent kernel execution of userspace on P9 Radix (similar to
     SMEP/PXN).

   - Sort the exception tables at build time, to save time at boot, and
     store them as relative offsets to save space in the kernel image &
     memory.

   - Allow building the kernel with thin archives, which should allow us
     to build an allyesconfig once some other fixes land.

   - Build fixes to allow us to correctly rebuild when changing the
     kernel endian from big to little or vice versa.

   - Plumbing so that we can avoid doing a full mm TLB flush on P9
     Radix.

   - Initial stack protector support (-fstack-protector).

   - Support for dumping the radix (aka. Linux) and hash page tables via
     debugfs.

   - Fix an oops in cxl coredump generation when cxl_get_fd() is used.

   - Freescale updates from Scott: "Highlights include 8xx hugepage
     support, qbman fixes/cleanup, device tree updates, and some misc
     cleanup."

   - Many and varied fixes and minor enhancements as always.

  Thanks to:
    Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
    Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
    Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
    Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
    Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
    Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
    Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
    Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
    Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"

[ And thanks to Michael, who took time off from a new baby to get this
  pull request done.   - Linus ]

* tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
  powerpc/fsl/dts: add FMan node for t1042d4rdb
  powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
  powerpc/fsl/dts: add QMan and BMan nodes on t1024
  powerpc/fsl/dts: add QMan and BMan nodes on t1023
  soc/fsl/qman: test: use DEFINE_SPINLOCK()
  powerpc/fsl-lbc: use DEFINE_SPINLOCK()
  powerpc/8xx: Implement support of hugepages
  powerpc: get hugetlbpage handling more generic
  powerpc: port 64 bits pgtable_cache to 32 bits
  powerpc/boot: Request no dynamic linker for boot wrapper
  soc/fsl/bman: Use resource_size instead of computation
  soc/fsl/qe: use builtin_platform_driver
  powerpc/fsl_pmc: use builtin_platform_driver
  powerpc/83xx/suspend: use builtin_platform_driver
  powerpc/ftrace: Fix the comments for ftrace_modify_code
  powerpc/perf: macros for power9 format encoding
  powerpc/perf: power9 raw event format encoding
  powerpc/perf: update attribute_group data structure
  powerpc/perf: factor out the event format field
  powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  ...
2016-12-16 09:26:42 -08:00
Linus Torvalds
179a7ba680 This release has a few updates:
o STM can hook into the function tracer
  o Function filtering now supports more advance glob matching
  o Ftrace selftests updates and added tests
  o Softirq tag in traces now show only softirqs
  o ARM nop added to non traced locations at compile time
  o New trace_marker_raw file that allows for binary input
  o Optimizations to the ring buffer
  o Removal of kmap in trace_marker
  o Wakeup and irqsoff tracers now adhere to the set_graph_notrace file
  o Other various fixes and clean ups
 
 Note, there are two patches marked for stable. These were discovered
 near the end of the 4.9 rc release cycle. By the time I had them tested
 it was just a matter of days before 4.9 would be released, and I
 figured I would just submit them in the merge window. They are old
 bugs and not critical. Nothing non-root could abuse.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYUrFHFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 2+AIAIr20kSQV/nA5htGAeCTobVk3WUxY6bvjd9mIJDKPP19akNLyREW0G3KnfCr
 yhx4aFRZG98fRu/6F8qieRosyN36lADDVYHelMFHMpcTOpE2aZGjaaOuNGxOEA9v
 FmMPTX+K3+dzKyFP4l68R3+5JuQ1/AqLTioTWeLW8IDQ2OOVsjD8+0BuXrNKMJDY
 o6U4Hk5U/vn+zHc6BmgBzloAXemBd7iJ1t5V3FRRGvm8yv3HU85Twc5ofGeYTWvB
 J8PboEywRlIzxg0Kd8mxnMI5PgaKZSEc2ub8E7cY/CZ5PYpDE2xDA2hJmJgfYp00
 1VW+DHRpRZfElsCcya6S6P4bs5Y=
 =MGZ/
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "This release has a few updates:

   - STM can hook into the function tracer
   - Function filtering now supports more advance glob matching
   - Ftrace selftests updates and added tests
   - Softirq tag in traces now show only softirqs
   - ARM nop added to non traced locations at compile time
   - New trace_marker_raw file that allows for binary input
   - Optimizations to the ring buffer
   - Removal of kmap in trace_marker
   - Wakeup and irqsoff tracers now adhere to the set_graph_notrace file
   - Other various fixes and clean ups"

* tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (42 commits)
  selftests: ftrace: Shift down default message verbosity
  kprobes/trace: Fix kprobe selftest for newer gcc
  tracing/kprobes: Add a helper method to return number of probe hits
  tracing/rb: Init the CPU mask on allocation
  tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
  tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too
  fgraph: Handle a case where a tracer ignores set_graph_notrace
  tracing: Replace kmap with copy_from_user() in trace_marker writing
  ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps to it
  tracing: Allow benchmark to be enabled at early_initcall()
  tracing: Have system enable return error if one of the events fail
  tracing: Do not start benchmark on boot up
  tracing: Have the reg function allow to fail
  ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline
  ring-buffer: Froce rb_update_write_stamp() to be inlined
  ring-buffer: Force inline of hotpath helper functions
  tracing: Make __buffer_unlock_commit() always_inline
  tracing: Make tracepoint_printk a static_key
  ring-buffer: Always inline rb_event_data()
  ring-buffer: Make rb_reserve_next_event() always inlined
  ...
2016-12-15 13:49:34 -08:00
Geert Uytterhoeven
8fa9a697ab printk: Remove no longer used second struct cont
If CONFIG_PRINTK=n:

    kernel/printk/printk.c:1893: warning: ‘cont’ defined but not used

Note that there are actually two different struct cont definitions and
objects: the first one is used if CONFIG_PRINTK=y, the second one became
unused by removing console_cont_flush().

Fixes: 5c2992ee7f ("printk: remove console flushing special cases for partial buffered lines")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Petr Mladek <pmladek@suse.com>
[ I do the occasional "allnoconfig" builds, but apparently not often
  enough  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-15 10:52:37 -08:00
Boris Ostrovsky
512f09801b cpu/hotplug: Clarify description of __cpuhp_setup_state() return value
When invoked with CPUHP_AP_ONLINE_DYN state __cpuhp_setup_state()
is expected to return positive value which is the hotplug state that
the routine assigns.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-pm@vger.kernel.org
Cc: viresh.kumar@linaro.org
Cc: bigeasy@linutronix.de
Cc: rjw@rjwysocki.net
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1481814058-4799-2-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-15 17:48:20 +01:00
Guilherme G. Piccoli
c0af524372 genirq/affinity: Fix node generation from cpumask
Commit 34c3d9819f ("genirq/affinity: Provide smarter irq spreading
infrastructure") introduced a better IRQ spreading mechanism, taking
account of the available NUMA nodes in the machine.

Problem is that the algorithm of retrieving the nodemask iterates
"linearly" based on the number of online nodes - some architectures
present non-linear node distribution among the nodemask, like PowerPC.
If this is the case, the algorithm lead to a wrong node count number
and therefore to a bad/incomplete IRQ affinity distribution.

For example, this problem were found in a machine with 128 CPUs and two
nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly
distributed). This led to a wrong affinity distribution which then led to
a bad mq allocation for nvme driver.

Finally, we take the opportunity to fix a comment regarding the affinity
distribution when we have _more_ nodes than vectors.

Fixes: 34c3d9819f ("genirq/affinity: Provide smarter irq spreading infrastructure")
Reported-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: hch@lst.de
Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-15 12:32:35 +01:00
Thomas Gleixner
c1a9eeb938 tick/broadcast: Prevent NULL pointer dereference
When a disfunctional timer, e.g. dummy timer, is installed, the tick core
tries to setup the broadcast timer.

If no broadcast device is installed, the kernel crashes with a NULL pointer
dereference in tick_broadcast_setup_oneshot() because the function has no
sanity check.

Reported-by: Mason <slash.tmp@free.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Richard Cochran <rcochran@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Sebastian Frias <sf84@laposte.net>
Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
2016-12-15 12:25:13 +01:00
Al Viro
c4364f837c Merge branches 'work.namei', 'work.dcache' and 'work.iov_iter' into for-linus 2016-12-15 01:07:29 -05:00
Linus Torvalds
5cc60aeedf xfs: updates for 4.10-rc1
Contained in this update:
 - DAX PMD vaults via iomap infrastructure
 - Direct-io support in iomap infrastructure
 - removal of now-redundant XFS inode iolock, replaced with VFS i_rwsem
 - synchronisation with fixes and changes in userspace libxfs code
 - extent tree lookup helpers
 - lots of little corruption detection improvements to verifiers
 - optimised CRC calculations
 - faster buffer cache lookups
 - deprecation of barrier/nobarrier mount options - we always use
   REQ_FUA/REQ_FLUSH where appropriate for data integrity now
 - cleanups to speculative preallocation
 - miscellaneous minor bug fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYUgqdAAoJEK3oKUf0dfodQgsP/1dJ4qUc6cRk8kL+f10FoIek
 oFzdViRHZj8cROGe2n2YTBJtPa9KjU5DNHnxaxWZBN4ZpItp/uN1sAQhgtNQ4/cN
 C3JF6B/+/dIbNSbd7DwvSl0dMWknzmrB+Myfs2ZPpMA1S4GInk1MOJSj7AQdYAvJ
 dS0dQWAuIB20cahwuGA4y7zUniYL1IcF/BH8hlmzpcUNUoJ9AkR1hTg5/aVfmga3
 w2p1vZyT2E4xs/Ff4FYW5MzPGxLVQMZVNIAXAcJl+c61z46ndXqidSmVHGvc+Tlt
 ouxftHy/7KqowZlCFss1pSXg9HlXHhjS+iLbZerfcjO2qldriZS+QqQyASmQzPAz
 +PpnMfVOj+yjsXKyIHWuS1G35aV16pPWwdA0ECeU6yv9iZ7tSz5rvSrsPZPLFz4x
 RVhcKbmXR3y8DugkmtznU5ozxPt5hbbstEV3leCzxJpZu5reRJThUW7nYkSd0CEJ
 ZyT/GP6Aq/MM8O/hOgVutAH409dsrYok8m/lq1J7VbNUt8inylcsMWsBeX/0/AHY
 aC7I2Vx8bnbfL+C8wYKYhuShOGSch93O5hDUXdH2K/Sm5cK4y2asWge6MfFsS6Lu
 waVYwd5aYBlNbzkvUMm2I5EV4cCCR3YwWYwfBEP7kPYUDxN14huOz6lVXnQPDLQ1
 qsV1aNfK9PPiw6Fcaop0
 =HwDG
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "There is quite a varied bunch of stuff in this update, and some of it
  you will have already merged through the ext4 tree which imported the
  dax-4.10-iomap-pmd topic branch from the XFS tree.

  There is also a new direct IO implementation that uses the iomap
  infrastructure. It's much simpler, faster, and has lower IO latency
  than the existing direct IO infrastructure.

  Summary:
   - DAX PMD faults via iomap infrastructure
   - Direct-io support in iomap infrastructure
   - removal of now-redundant XFS inode iolock, replaced with VFS
     i_rwsem
   - synchronisation with fixes and changes in userspace libxfs code
   - extent tree lookup helpers
   - lots of little corruption detection improvements to verifiers
   - optimised CRC calculations
   - faster buffer cache lookups
   - deprecation of barrier/nobarrier mount options - we always use
     REQ_FUA/REQ_FLUSH where appropriate for data integrity now
   - cleanups to speculative preallocation
   - miscellaneous minor bug fixes and cleanups"

* tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
  xfs: nuke unused tracepoint definitions
  xfs: use GPF_NOFS when allocating btree cursors
  xfs: use xfs_vn_setattr_size to check on new size
  xfs: deprecate barrier/nobarrier mount option
  xfs: Always flush caches when integrity is required
  xfs: ignore leaf attr ichdr.count in verifier during log replay
  xfs: use rhashtable to track buffer cache
  xfs: optimise CRC updates
  xfs: make xfs btree stats less huge
  xfs: don't cap maximum dedupe request length
  xfs: don't allow di_size with high bit set
  xfs: error out if trying to add attrs and anextents > 0
  xfs: don't crash if reading a directory results in an unexpected hole
  xfs: complain if we don't get nextents bmap records
  xfs: check for bogus values in btree block headers
  xfs: forbid AG btrees with level == 0
  xfs: several xattr functions can be void
  xfs: handle cow fork in xfs_bmap_trace_exlist
  xfs: pass state not whichfork to trace_xfs_extlist
  xfs: Move AGI buffer type setting to xfs_read_agi
  ...
2016-12-14 21:35:31 -08:00
Linus Torvalds
5c2992ee7f printk: remove console flushing special cases for partial buffered lines
It actively hurts proper merging, and makes for a lot of special cases.
There was a good(ish) reason for doing it originally, but it's getting
too painful to maintain.  And most of the original reasons for it are
long gone.

So instead of having special code to flush partial lines to the console
(as opposed to the record buffers), do _all_ the console writing from
the record buffer, and be done with it.

If an oops happens (or some other synchronous event), we will flush the
partial lines due to the oops printing activity, so this does not affect
that.  It does mean that if you have a completely hung machine, a
partial preceding line may not have been printed out.

That was some of the original reason for this complexity, in fact, back
when we used to test for the historical i386 "halt" instruction problem
by doing

	pr_info("Checking 'hlt' instruction... ");

	if (!boot_cpu_data.hlt_works_ok) {
		pr_cont("disabled\n");
		return;
	}
	halt();
	halt();
	halt();
	halt();
	pr_cont("OK\n");

and that model no longer works (it the 'hlt' instruction kills the
machine, the partial line won't have been flushed, so you won't even see
it).

Of course, that was also back in the days when people actually had
textual console output rather than a graphical splash-screen at bootup.
How times change..

Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Petr Mladek <pmladek@suse.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 21:08:18 -08:00
Linus Torvalds
5aa068ea40 printk: remove games with previous record flags
The record logging code looks at the previous record flags in various
ways, and they are all wrong.

You can't use the previous record flags to determine anything about the
next record, because they may simply not be related.  In particular, the
reason the previous record was a continuation record may well be exactly
_because_ the new record was printed by a different process, which is
why the previous record was flushed.

So all those games are simply wrong, and make the code hard to
understand (because the code fundamentally cdoes not make sense).

So remove it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 21:02:49 -08:00
Linus Torvalds
4d98ead183 Modules updates for v4.10
Summary of modules changes for the 4.10 merge window:
 
 * The rodata= cmdline parameter has been extended to additionally
   apply to module mappings
 
 * Fix a hard to hit race between module loader error/clean up
   handling and ftrace registration
 
 * Some code cleanups, notably panic.c and modules code use a
   unified taint_flags table now. This is much cleaner than
   duplicating the taint flag code in modules.c
 
 Signed-off-by: Jessica Yu <jeyu@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYUf6/AAoJEMBFfjjOO8Fy5NoP+gOIus26yWWGymI495jVnX7n
 wCga5JgwOL0SLBIPmiDVI7K+jz4eoQZb94eJcwkWDuw2/IvOdF1kB8ha1EOBRMSg
 nb9HfIDlWiAPKkyUxe+k6XDb+BMPN3FUSYmBAKD3utsQkD1JWBLY8Id4e234y8Fo
 sb3a6rLJbvIEXANrMeU7zO4/y1bVxQAeQPQbVPwlid5s76RKYH6JdGXoo6FKK0uE
 Z3I8uQjqjmJ5U4vpjjWl0w+Qa7hIm/x05GpirtNxN6ztxjR+98c/4uRIry8oOX+I
 KqRXDOnJ1l/rCwhp+pGLwPfCoDds+V3bknyOwYoxK3hqVVUAd8H0qd1JQ8XClwyJ
 jnE0+EQpTt9brOO1Oq2XC+EDjpiuyYm3u91TFwE2VFmP98daBZsX6qY7bm03/GQq
 ZLRthWPILNX9glGj4nbHQgdAKmRvYDO3SzWjFZNA75Mr2hbRKLJoWNvfgupDgjsF
 giawxV/OcWXvEX92fzkwoUszpfWwoDhGsbimG2SCKYB87vNniG7wrgdjp5aWHhOL
 qCUpUhCvE9/dO7kPRinqk5tnpAUGY2jMZ0QgVbpToF6FiHJJSyDjWHR9n0Bl1QTX
 uAEZB/Hoav9frZ+MQC/1Yzhq5ejDbEm1ByjolJgbjl6YHBlQceL6NQpFmyEkrn7c
 Tx+Q/PvG7/gfxFGMirf1
 =bhCS
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:
 "Summary of modules changes for the 4.10 merge window:

   - The rodata= cmdline parameter has been extended to additionally
     apply to module mappings

   - Fix a hard to hit race between module loader error/clean up
     handling and ftrace registration

   - Some code cleanups, notably panic.c and modules code use a unified
     taint_flags table now. This is much cleaner than duplicating the
     taint flag code in modules.c"

* tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: fix DEBUG_SET_MODULE_RONX typo
  module: extend 'rodata=off' boot cmdline parameter to module mappings
  module: Fix a comment above strong_try_module_get()
  module: When modifying a module's text ignore modules which are going away too
  module: Ensure a module's state is set accordingly during module coming cleanup code
  module: remove trailing whitespace
  taint/module: Clean up global and module taint flags handling
  modpost: free allocated memory
2016-12-14 20:12:43 -08:00
Linus Torvalds
a57cb1c1d7 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - a few misc things

 - kexec updates

 - DMA-mapping updates to better support networking DMA operations

 - IPC updates

 - various MM changes to improve DAX fault handling

 - lots of radix-tree changes, mainly to the test suite. All leading up
   to reimplementing the IDA/IDR code to be a wrapper layer over the
   radix-tree. However the final trigger-pulling patch is held off for
   4.11.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  radix tree test suite: delete unused rcupdate.c
  radix tree test suite: add new tag check
  radix-tree: ensure counts are initialised
  radix tree test suite: cache recently freed objects
  radix tree test suite: add some more functionality
  idr: reduce the number of bits per level from 8 to 6
  rxrpc: abstract away knowledge of IDR internals
  tpm: use idr_find(), not idr_find_slowpath()
  idr: add ida_is_empty
  radix tree test suite: check multiorder iteration
  radix-tree: fix replacement for multiorder entries
  radix-tree: add radix_tree_split_preload()
  radix-tree: add radix_tree_split
  radix-tree: add radix_tree_join
  radix-tree: delete radix_tree_range_tag_if_tagged()
  radix-tree: delete radix_tree_locate_item()
  radix-tree: improve multiorder iterators
  btrfs: fix race in btrfs_free_dummy_fs_info()
  radix-tree: improve dump output
  radix-tree: make radix_tree_find_next_bit more useful
  ...
2016-12-14 17:25:18 -08:00
Lorenzo Stoakes
5b56d49fc3 mm: add locked parameter to get_user_pages_remote()
Patch series "mm: unexport __get_user_pages_unlocked()".

This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.

It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality.  This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.

Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)

This patch (of 2):

Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().

[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
  Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Babu Moger
73ce0511c4 kernel/watchdog.c: move hardlockup detector to separate file
Separate hardlockup code from watchdog.c and move it to watchdog_hld.c.
It is mostly straight forward.  Remove everything inside
CONFIG_HARDLOCKUP_DETECTORS.  This code will go to file watchdog_hld.c.
Also update the makefile accordigly.

Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.com
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Josh Hunt <johunt@akamai.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Babu Moger
249e52e355 kernel/watchdog.c: move shared definitions to nmi.h
Patch series "Clean up watchdog handlers", v2.

This is an attempt to cleanup watchdog handlers.  Right now,
kernel/watchdog.c implements both softlockup and hardlockup detectors.
Softlockup code is generic.  Hardlockup code is arch specific.  Some
architectures don't use hardlockup detectors.  They use their own
watchdog detectors.  To make both these combination work, we have
numerous #ifdefs in kernel/watchdog.c.

We are trying here to make these handlers independent of each other.
Also provide an interface for architectures to implement their own
handlers.  watchdog_nmi_enable and watchdog_nmi_disable will be defined
as weak such that architectures can override its definitions.

Thanks to Don Zickus for his suggestions.
Here are our previous discussions
http://www.spinics.net/lists/sparclinux/msg16543.html
http://www.spinics.net/lists/sparclinux/msg16441.html

This patch (of 3):

Move shared macros and definitions to nmi.h so that watchdog.c, new file
watchdog_hld.c or any other architecture specific handler can use those
definitions.

Link: http://lkml.kernel.org/r/1478034826-43888-2-git-send-email-babu.moger@oracle.com
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Josh Hunt <johunt@akamai.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Nicolas Pitre
b6f8a92c9c posix-timers: give lazy compilers some help optimizing code away
The OpenRISC compiler (so far) fails to optimize away a large portion of
code containing a reference to posix_timer_event in alarmtimer.c when
CONFIG_POSIX_TIMERS is unset.  Let's give it a direct clue to let the
build succeed.

This fixes
[linux-next:master 6682/7183] alarmtimer.c:undefined reference to `posix_timer_event'
reported by kbuild test robot.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Petr Mladek
34aaff40b4 kdb: call vkdb_printf() from vprintk_default() only when wanted
kdb_trap_printk allows to pass normal printk() messages to kdb via
vkdb_printk().  For example, it is used to get backtrace using the
classic show_stack(), see kdb_show_stack().

vkdb_printf() tries to avoid a potential infinite loop by disabling the
trap.  But this approach is racy, for example:

CPU1					CPU2

vkdb_printf()
  // assume that kdb_trap_printk == 0
  saved_trap_printk = kdb_trap_printk;
  kdb_trap_printk = 0;

					kdb_show_stack()
					  kdb_trap_printk++;

Problem1: Now, a nested printk() on CPU0 calls vkdb_printf()
	  even when it should have been disabled. It will not
	  cause a deadlock but...

   // using the outdated saved value: 0
   kdb_trap_printk = saved_trap_printk;

					  kdb_trap_printk--;

Problem2: Now, kdb_trap_printk == -1 and will stay like this.
   It means that all messages will get passed to kdb from
   now on.

This patch removes the racy saved_trap_printk handling.  Instead, the
recursion is prevented by a check for the locked CPU.

The solution is still kind of racy.  A non-related printk(), from
another process, might get trapped by vkdb_printf().  And the wanted
printk() might not get trapped because kdb_printf_cpu is assigned.  But
this problem existed even with the original code.

A proper solution would be to get_cpu() before setting kdb_trap_printk
and trap messages only from this CPU.  I am not sure if it is worth the
effort, though.

In fact, the race is very theoretical.  When kdb is running any of the
commands that use kdb_trap_printk there is a single active CPU and the
other CPUs should be in a holding pen inside kgdb_cpu_enter().

The only time this is violated is when there is a timeout waiting for
the other CPUs to report to the holding pen.

Finally, note that the situation is a bit schizophrenic.  vkdb_printf()
explicitly allows recursion but only from KDB code that calls
kdb_printf() directly.  On the other hand, the generic printk()
recursion is not allowed because it might cause an infinite loop.  This
is why we could not hide the decision inside vkdb_printf() easily.

Link: http://lkml.kernel.org/r/1480412276-16690-4-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Petr Mladek
d5d8d3d0d4 kdb: properly synchronize vkdb_printf() calls with other CPUs
kdb_printf_lock does not prevent other CPUs from entering the critical
section because it is ignored when KDB_STATE_PRINTF_LOCK is set.

The problematic situation might look like:

CPU0					CPU1

vkdb_printf()
  if (!KDB_STATE(PRINTF_LOCK))
    KDB_STATE_SET(PRINTF_LOCK);
    spin_lock_irqsave(&kdb_printf_lock, flags);

					vkdb_printf()
					  if (!KDB_STATE(PRINTF_LOCK))

BANG: The PRINTF_LOCK state is set and CPU1 is entering the critical
section without spinning on the lock.

The problem is that the code tries to implement locking using two state
variables that are not handled atomically.  Well, we need a custom
locking because we want to allow reentering the critical section on the
very same CPU.

Let's use solution from Petr Zijlstra that was proposed for a similar
scenario, see
https://lkml.kernel.org/r/20161018171513.734367391@infradead.org

This patch uses the same trick with cmpxchg().  The only difference is
that we want to handle only recursion from the same context and
therefore we disable interrupts.

In addition, KDB_STATE_PRINTF_LOCK is removed.  In fact, we are not able
to set it a non-racy way.

Link: http://lkml.kernel.org/r/1480412276-16690-3-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Petr Mladek
d1bd8ead12 kdb: remove unused kdb_event handling
kdb_event state variable is only set but never checked in the kernel
code.

http://www.spinics.net/lists/kdb/msg01733.html suggests that this
variable affected WARN_CONSOLE_UNLOCKED() in the original
implementation.  But this check never went upstream.

The semantic is unclear and racy.  The value is updated after the
kdb_printf_lock is acquired and after it is released.  It should be
symmetric at minimum.  The value should be manipulated either inside or
outside the locked area.

Fortunately, it seems that the original function is gone and we could
simply remove the state variable.

Link: http://lkml.kernel.org/r/1480412276-16690-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Douglas Anderson
2d13bb6494 kernel/debug/debug_core.c: more properly delay for secondary CPUs
We've got a delay loop waiting for secondary CPUs.  That loop uses
loops_per_jiffy.  However, loops_per_jiffy doesn't actually mean how
many tight loops make up a jiffy on all architectures.  It is quite
common to see things like this in the boot log:

  Calibrating delay loop (skipped), value calculated using timer
  frequency.. 48.00 BogoMIPS (lpj=24000)

In my case I was seeing lots of cases where other CPUs timed out
entering the debugger only to print their stack crawls shortly after the
kdb> prompt was written.

Elsewhere in kgdb we already use udelay(), so that should be safe enough
to use to implement our timeout.  We'll delay 1 ms for 1000 times, which
should give us a full second of delay (just like the old code wanted)
but allow us to notice that we're done every 1 ms.

[akpm@linux-foundation.org: simplifications, per Daniel]
Link: http://lkml.kernel.org/r/1477091361-2039-1-git-send-email-dianders@chromium.org
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Brian Norris <briannorris@chromium.org>
Cc: <stable@vger.kernel.org>	[4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Kefeng Wang
db862358a4 kcov: add more missing includes
It is fragile that some definitions acquired via transitive
dependencies, as shown in below:

atomic_*        (<linux/atomic.h>)
ENOMEM/EN*      (<linux/errno.h>)
EXPORT_SYMBOL   (<linux/export.h>)
device_initcall (<linux/init.h>)
preempt_*       (<linux/preempt.h>)

Include them to prevent possible issues.

Link: http://lkml.kernel.org/r/1481163221-40170-1-git-send-email-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Dan Carpenter
9a29d0fbc2 relay: check array offset before using it
Smatch complains that we started using the array offset before we
checked that it was valid.

Fixes: 017c59c042 ('relay: Use per CPU constructs for the relay channel buffer pointers')
Link: http://lkml.kernel.org/r/20161013084947.GC16198@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Tetsuo Handa
7560ef39dc sysctl: add KERN_CONT to deprecated_sysctl_warning()
Do not break lines while printk()ing values.

  kernel: warning: process `tomoyo_file_tes' used the deprecated sysctl system call with
  kernel: 3.
  kernel: 5.
  kernel: 56.
  kernel:

Link: http://lkml.kernel.org/r/1480814833-4976-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:07 -08:00
zhong jiang
8e53c073a4 kexec: add cond_resched into kimage_alloc_crash_control_pages
A soft lookup will occur when I run trinity in syscall kexec_load.  the
corresponding stack information is as follows.

  BUG: soft lockup - CPU#6 stuck for 22s! [trinity-c6:13859]
  Kernel panic - not syncing: softlockup: hung tasks
  CPU: 6 PID: 13859 Comm: trinity-c6 Tainted: G           O L ----V-------   3.10.0-327.28.3.35.zhongjiang.x86_64 #1
  Hardware name: Huawei Technologies Co., Ltd. Tecal BH622 V2/BC01SRSA0, BIOS RMIBV386 06/30/2014
  Call Trace:
   <IRQ>  dump_stack+0x19/0x1b
   panic+0xd8/0x214
   watchdog_timer_fn+0x1cc/0x1e0
   __hrtimer_run_queues+0xd2/0x260
   hrtimer_interrupt+0xb0/0x1e0
   ? call_softirq+0x1c/0x30
   local_apic_timer_interrupt+0x37/0x60
   smp_apic_timer_interrupt+0x3f/0x60
   apic_timer_interrupt+0x6d/0x80
   <EOI>  ? kimage_alloc_control_pages+0x80/0x270
   ? kmem_cache_alloc_trace+0x1ce/0x1f0
   ? do_kimage_alloc_init+0x1f/0x90
   kimage_alloc_init+0x12a/0x180
   SyS_kexec_load+0x20a/0x260
   system_call_fastpath+0x16/0x1b

the first time allocation of control pages may take too much time
because crash_res.end can be set to a higher value.  we need to add
cond_resched to avoid the issue.

The patch have been tested and above issue is not appear.

Link: http://lkml.kernel.org/r/1481164674-42775-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Xunlei Pang <xpang@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:07 -08:00
Baoquan He
401721ecd1 kexec: export the value of phys_base instead of symbol address
Currently in x86_64, the symbol address of phys_base is exported to
vmcoreinfo.  Dave Anderson complained this is really useless for his
Crash implementation.  Because in user-space utility Crash and
Makedumpfile which exported vmcore information is mainly used for, value
of phys_base is needed to covert virtual address of exported kernel
symbol to physical address.  Especially init_level4_pgt, if we want to
access and go over the page table to look up a PA corresponding to VA,
firstly we need calculate

  page_dir = SYMBOL(init_level4_pgt) - __START_KERNEL_map + phys_base;

Now in Crash and Makedumpfile, we have to analyze the vmcore elf program
header to get value of phys_base.  As Dave said, it would be preferable
if it were readily availabl in vmcoreinfo rather than depending upon the
PT_LOAD semantics.

Hence in this patch change to export the value of phys_base instead of
its virtual address.

And people also complained that KERNEL_IMAGE_SIZE exporting is x86_64
only, should be moved into arch dependent function
arch_crash_save_vmcoreinfo.  Do the moving in this patch.

Link: http://lkml.kernel.org/r/1478568596-30060-2-git-send-email-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Xunlei Pang <xlpang@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Eugene Surovegin <surovegin@google.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Atsushi Kumagai <ats-kumagai@wm.jp.nec.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:07 -08:00
Alexey Dobriyan
760c6a9139 coredump: clarify "unsafe core_pattern" warning
I was amused to find "unsafe core_pattern" warning having these lines in
/etc/sysctl.conf:

	fs.suid_dumpable=2
	kernel.core_pattern=/core/core-%e-%p-%E
	kernel.core_uses_pid=0

Turns out kernel is formally right.  Default core_pattern is just "core",
which doesn't qualify for secure path while setting suid.dumpable.

Hint admins about solution, clarify sysctl names, delete unnecessary '\'
characters (string literals are concatenated regardless) and reformat for
easier grepping.

Link: http://lkml.kernel.org/r/20161029152124.GA1258@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:07 -08:00
Waiman Long
c7be96af89 signals: avoid unnecessary taking of sighand->siglock
When running certain database workload on a high-end system with many
CPUs, it was found that spinlock contention in the sigprocmask syscalls
became a significant portion of the overall CPU cycles as shown below.

  9.30%  9.30%  905387  dataserver  /proc/kcore 0x7fff8163f4d2
  [k] _raw_spin_lock_irq
            |
            ---_raw_spin_lock_irq
               |
               |--99.34%-- __set_current_blocked
               |          sigprocmask
               |          sys_rt_sigprocmask
               |          system_call_fastpath
               |          |
               |          |--50.63%-- __swapcontext
               |          |          |
               |          |          |--99.91%-- upsleepgeneric
               |          |
               |          |--49.36%-- __setcontext
               |          |          ktskRun

Looking further into the swapcontext function in glibc, it was found that
the function always call sigprocmask() without checking if there are
changes in the signal mask.

A check was added to the __set_current_blocked() function to avoid taking
the sighand->siglock spinlock if there is no change in the signal mask.
This will prevent unneeded spinlock contention when many threads are
trying to call sigprocmask().

With this patch applied, the spinlock contention in sigprocmask() was
gone.

Link: http://lkml.kernel.org/r/1474979209-11867-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stas Sergeev <stsp@list.ru>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:07 -08:00
Konstantin Khlebnikov
4d1f0fb096 kernel/watchdog: use nmi registers snapshot in hardlockup handler
NMI handler doesn't call set_irq_regs(), it's set only by normal IRQ.
Thus get_irq_regs() returns NULL or stale registers snapshot with IP/SP
pointing to the code interrupted by IRQ which was interrupted by NMI.
NULL isn't a problem: in this case watchdog calls dump_stack() and
prints full stack trace including NMI.  But if we're stuck in IRQ
handler then NMI watchlog will print stack trace without IRQ part at
all.

This patch uses registers snapshot passed into NMI handler as arguments:
these registers point exactly to the instruction interrupted by NMI.

Fixes: 55537871ef ("kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup")
Link: http://lkml.kernel.org/r/146771764784.86724.6006627197118544150.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: <stable@vger.kernel.org>	[4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:07 -08:00
Linus Torvalds
412ac77a9d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "After a lot of discussion and work we have finally reachanged a basic
  understanding of what is necessary to make unprivileged mounts safe in
  the presence of EVM and IMA xattrs which the last commit in this
  series reflects. While technically it is a revert the comments it adds
  are important for people not getting confused in the future. Clearing
  up that confusion allows us to seriously work on unprivileged mounts
  of fuse in the next development cycle.

  The rest of the fixes in this set are in the intersection of user
  namespaces, ptrace, and exec. I started with the first fix which
  started a feedback cycle of finding additional issues during review
  and fixing them. Culiminating in a fix for a bug that has been present
  since at least Linux v1.0.

  Potentially these fixes were candidates for being merged during the rc
  cycle, and are certainly backport candidates but enough little things
  turned up during review and testing that I decided they should be
  handled as part of the normal development process just to be certain
  there were not any great surprises when it came time to backport some
  of these fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC"
  exec: Ensure mm->user_ns contains the execed files
  ptrace: Don't allow accessing an undumpable mm
  ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
  mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
2016-12-14 14:09:48 -08:00
Linus Torvalds
dcdaa2f948 Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "After the small number of patches for v4.9, we've got a much bigger
  pile for v4.10.

  The bulk of these patches involve a rework of the audit backlog queue
  to enable us to move the netlink multicasting out of the task/thread
  that generates the audit record and into the kernel thread that emits
  the record (just like we do for the audit unicast to auditd).

  While we were playing with the backlog queue(s) we fixed a number of
  other little problems with the code, and from all the testing so far
  things look to be in much better shape now. Doing this also allowed us
  to re-enable disabling IRQs for some netns operations ("netns: avoid
  disabling irq for netns id").

  The remaining patches fix some small problems that are well documented
  in the commit descriptions, as well as adding session ID filtering
  support"

* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
  audit: use proper refcount locking on audit_sock
  netns: avoid disabling irq for netns id
  audit: don't ever sleep on a command record/message
  audit: handle a clean auditd shutdown with grace
  audit: wake up kauditd_thread after auditd registers
  audit: rework audit_log_start()
  audit: rework the audit queue handling
  audit: rename the queues and kauditd related functions
  audit: queue netlink multicast sends just like we do for unicast sends
  audit: fixup audit_init()
  audit: move kaudit thread start from auditd registration to kaudit init (#2)
  audit: add support for session ID user filter
  audit: fix formatting of AUDIT_CONFIG_CHANGE events
  audit: skip sessionid sentinel value when auto-incrementing
  audit: tame initialization warning len_abuf in audit_log_execve_info
  audit: less stack usage for /proc/*/loginuid
2016-12-14 14:06:40 -08:00
Linus Torvalds
683b96f4d1 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Generally pretty quiet for this release. Highlights:

  Yama:
   - allow ptrace access for original parent after re-parenting

  TPM:
   - add documentation
   - many bugfixes & cleanups
   - define a generic open() method for ascii & bios measurements

  Integrity:
   - Harden against malformed xattrs

  SELinux:
   - bugfixes & cleanups

  Smack:
   - Remove unnecessary smack_known_invalid label
   - Do not apply star label in smack_setprocattr hook
   - parse mnt opts after privileges check (fixes unpriv DoS vuln)"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (56 commits)
  Yama: allow access for the current ptrace parent
  tpm: adjust return value of tpm_read_log
  tpm: vtpm_proxy: conditionally call tpm_chip_unregister
  tpm: Fix handling of missing event log
  tpm: Check the bios_dir entry for NULL before accessing it
  tpm: return -ENODEV if np is not set
  tpm: cleanup of printk error messages
  tpm: replace of_find_node_by_name() with dev of_node property
  tpm: redefine read_log() to handle ACPI/OF at runtime
  tpm: fix the missing .owner in tpm_bios_measurements_ops
  tpm: have event log use the tpm_chip
  tpm: drop tpm1_chip_register(/unregister)
  tpm: replace dynamically allocated bios_dir with a static array
  tpm: replace symbolic permission with octal for securityfs files
  char: tpm: fix kerneldoc tpm2_unseal_trusted name typo
  tpm_tis: Allow tpm_tis to be bound using DT
  tpm, tpm_vtpm_proxy: add kdoc comments for VTPM_PROXY_IOC_NEW_DEV
  tpm: Only call pm_runtime_get_sync if device has a parent
  tpm: define a generic open() method for ascii & bios measurements
  Documentation: tpm: add the Physical TPM device tree binding documentation
  ...
2016-12-14 13:57:44 -08:00
Linus Torvalds
0f1d6dfe03 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "Here is the crypto update for 4.10:

  API:
   - add skcipher walk interface
   - add asynchronous compression (acomp) interface
   - fix algif_aed AIO handling of zero buffer

  Algorithms:
   - fix unaligned access in poly1305
   - fix DRBG output to large buffers

  Drivers:
   - add support for iMX6UL to caam
   - fix givenc descriptors (used by IPsec) in caam
   - accelerated SHA256/SHA512 for ARM64 from OpenSSL
   - add SSE CRCT10DIF and CRC32 to ARM/ARM64
   - add AEAD support to Chelsio chcr
   - add Armada 8K support to omap-rng"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (148 commits)
  crypto: testmgr - fix overlap in chunked tests again
  crypto: arm/crc32 - accelerated support based on x86 SSE implementation
  crypto: arm64/crc32 - accelerated support based on x86 SSE implementation
  crypto: arm/crct10dif - port x86 SSE implementation to ARM
  crypto: arm64/crct10dif - port x86 SSE implementation to arm64
  crypto: testmgr - add/enhance test cases for CRC-T10DIF
  crypto: testmgr - avoid overlap in chunked tests
  crypto: chcr - checking for IS_ERR() instead of NULL
  crypto: caam - check caam_emi_slow instead of re-lookup platform
  crypto: algif_aead - fix AIO handling of zero buffer
  crypto: aes-ce - Make aes_simd_algs static
  crypto: algif_skcipher - set error code when kcalloc fails
  crypto: caam - make aamalg_desc a proper module
  crypto: caam - pass key buffers with typesafe pointers
  crypto: arm64/aes-ce-ccm - Fix AEAD decryption length
  MAINTAINERS: add crypto headers to crypto entry
  crypt: doc - remove misleading mention of async API
  crypto: doc - fix header file name
  crypto: api - fix comment typo
  crypto: skcipher - Add separate walker for AEAD decryption
  ..
2016-12-14 13:31:29 -08:00
Steve Grubb
89670affa2 audit: Make AUDIT_ANOM_ABEND event normalized
The audit event specification asks for certain fields to exist in
all events. Running 'ausearch -m anom_abend -sv yes' returns no
events. This patch adds the result field so that the
AUDIT_ANOM_ABEND event conforms to the rules.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 16:00:13 -05:00
Steve Grubb
7c397d01e4 audit: Make AUDIT_KERNEL event conform to the specification
The AUDIT_KERNEL event is not following name=value format. This causes
some information to get lost. The event has been reformatted to follow
the convention. Additionally the audit_enabled value was added for
troubleshooting purposes. The following is an example of the new event:

  type=KERNEL audit(1480621249.833:1): state=initialized
              audit_enabled=0 res=1

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
[PM: commit tweaks to make checkpatch.pl happy]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 15:59:46 -05:00
Linus Torvalds
a9042defa2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  NTB: correct ntb_spad_count comment typo
  misc: ibmasm: fix typo in error message
  Remove references to dead make variable LINUX_INCLUDE
  Remove last traces of ikconfig.h
  treewide: Fix printk() message errors
  Documentation/device-mapper: s/getsize/getsz/
2016-12-14 11:12:25 -08:00
Richard Guy Briggs
533c7b69c7 audit: use proper refcount locking on audit_sock
Resetting audit_sock appears to be racy.

audit_sock was being copied and dereferenced without using a refcount on
the source sock.

Bump the refcount on the underlying sock when we store a refrence in
audit_sock and release it when we reset audit_sock.  audit_sock
modification needs the audit_cmd_mutex.

See: https://lkml.org/lkml/2016/11/26/232

Thanks to Eric Dumazet <edumazet@google.com> and Cong Wang
<xiyou.wangcong@gmail.com> on ideas how to fix it.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
[PM: fixed the comment block text formatting for auditd_reset()]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Moore
a09cfa4708 audit: don't ever sleep on a command record/message
Sleeping on a command record/message in audit_log_start() could slow
something, e.g. auditd, from doing something important, e.g. clean
shutdown, which could present problems on a heavily loaded system.
This patch allows tasks to bypass any queue restrictions if they are
logging a command record/message.

Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Moore
6c54e78996 audit: handle a clean auditd shutdown with grace
When auditd stops cleanly it sets 'auditd_pid' to 0 with an
AUDIT_SET message, in this case we should reset our backlog
queues via the auditd_reset() function.  This patch also adds
a 'auditd_pid' check to the top of kauditd_send_unicast_skb()
so we can fail quicker.

Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Moore
e1d1662128 audit: wake up kauditd_thread after auditd registers
This patch was suggested by Richard Briggs back in 2015, see the link
to the mail archive below.  Unfortunately, that patch is no longer
even remotely valid due to other changes to the code.

* https://www.redhat.com/archives/linux-audit/2015-October/msg00075.html

Suggested-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Moore
3197542482 audit: rework audit_log_start()
The backlog queue handling in audit_log_start() is a little odd with
some questionable design decisions, this patch attempts to rectify
this with the following changes:

* Never make auditd wait, ignore any backlog limits as we need auditd
awake so it can drain the backlog queue.

* When we hit a backlog limit and start dropping records, don't wake
all the tasks sleeping on the backlog, that's silly.  Instead, let
kauditd_thread() take care of waking everyone once it has had a chance
to drain the backlog queue.

* Don't keep a global backlog timeout countdown, make it per-task.  A
per-task timer means we won't have all the sleeping tasks waking at
the same time and hammering on an already stressed backlog queue.

Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Moore
c6480207fd audit: rework the audit queue handling
The audit record backlog queue has always been a bit of a mess, and
the moving the multicast send into kauditd_thread() from
audit_log_end() only makes things worse.  This patch attempts to fix
the backlog queue with a better design that should hold up better
under load and have less of a performance impact at syscall
invocation time.

While it looks like there is a log going on in this patch, the main
change is the move from a single backlog queue to three queues:

* A queue for holding records generated from audit_log_end() that
haven't been consumed by kauditd_thread() (audit_queue).

* A queue for holding records that have been sent via multicast but
had a temporary failure when sending via unicast and need a resend
(audit_retry_queue).

* A queue for holding records that haven't been sent via unicast
because no one is listening (audit_hold_queue).

Special care is taken in this patch to ensure that the proper
record ordering is preserved, e.g. we send everything in the hold
queue first, then the retry queue, and finally the main queue.

Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Moore
af8b824f28 audit: rename the queues and kauditd related functions
The audit queue names can be shortened and the record sending
helpers associated with the kauditd task could be named better, do
these small cleanups now to make life easier once we start reworking
the queues and kauditd code.

Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Moore
4aa83872d3 audit: queue netlink multicast sends just like we do for unicast sends
Sending audit netlink multicast messages is bad for all the same
reasons that sending audit netlink unicast messages is bad, so this
patch reworks things so that we don't do the multicast send in
audit_log_end(), we do it from the dedicated kauditd_thread thread just
as we do for unicast messages.

See the GitHub issues below for more information/history:

 * https://github.com/linux-audit/audit-kernel/issues/23
 * https://github.com/linux-audit/audit-kernel/issues/22

Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Moore
6c92556453 audit: fixup audit_init()
Make sure everything is initialized before we start the kauditd_thread
and don't emit the "initialized" record until everything is finished.
We also panic with a descriptive message if we can't start the
kauditd_thread.

Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Richard Guy Briggs
55a6f170a4 audit: move kaudit thread start from auditd registration to kaudit init (#2)
Richard made this change some time ago but Eric backed it out because
the rest of the supporting code wasn't ready.  In order to move the
netlink multicast send to kauditd_thread we need to ensure the
kauditd_thread is always running, so restore commit 6ff5e459 ("audit:
move kaudit thread start from auditd registration to kaudit init").

Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com>
[PM: brought forward and merged based on Richard's old patch]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Paul Bolle
d06505b2a9 Remove last traces of ikconfig.h
The build system stopped generating ikconfig.h in v2.6.8. Remove an entry
for it in dontdiff. There's also a reference to it in a small comment.
Remove that comment too, as it is of little help in any case.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-12-14 10:54:28 +01:00
Linus Torvalds
c11a6cfb01 Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Mostly patches to initialize workqueue subsystem earlier and get rid
  of keventd_up().

  The patches were headed for the last merge cycle but got delayed due
  to a bug found late minute, which is fixed now.

  Also, to help debugging, destroy_workqueue() is more chatty now on a
  sanity check failure."

* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: move wq_numa_init() to workqueue_init()
  workqueue: remove keventd_up()
  debugobj, workqueue: remove keventd_up() usage
  slab, workqueue: remove keventd_up() usage
  power, workqueue: remove keventd_up() usage
  tty, workqueue: remove keventd_up() usage
  mce, workqueue: remove keventd_up() usage
  workqueue: make workqueue available early during boot
  workqueue: dump workqueue state on sanity check failures in destroy_workqueue()
2016-12-13 12:59:57 -08:00
Linus Torvalds
7b9dc3f75f Power management material for v4.10-rc1
- New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
    for it (Markus Mayer).
 
  - Support for ARM Integrator/AP and Integrator/CP in the generic
    DT cpufreq driver and elimination of the old Integrator cpufreq
    driver (Linus Walleij).
 
  - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
    and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
    Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik).
 
  - cpufreq core fix to eliminate races that may lead to using
    inactive policy objects and related cleanups (Rafael Wysocki).
 
  - cpufreq schedutil governor update to make it use SCHED_FIFO
    kernel threads (instead of regular workqueues) for doing delayed
    work (to reduce the response latency in some cases) and related
    cleanups (Viresh Kumar).
 
  - New cpufreq sysfs attribute for resetting statistics (Markus
    Mayer).
 
  - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
    Viresh Kumar).
 
  - Support for using generic cpufreq governors in the intel_pstate
    driver (Rafael Wysocki).
 
  - Support for per-logical-CPU P-state limits and the EPP/EPB
    (Energy Performance Preference/Energy Performance Bias) knobs
    in the intel_pstate driver (Srinivas Pandruvada).
 
  - New CPU ID for Knights Mill in intel_pstate (Piotr Luc).
 
  - intel_pstate driver modification to use the P-state selection
    algorithm based on CPU load on platforms with the system profile
    in the ACPI tables set to "mobile" (Srinivas Pandruvada).
 
  - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
    Srinivas Pandruvada).
 
  - cpufreq powernv driver updates including fast switching support
    (for the schedutil governor), fixes and cleanus (Akshay Adiga,
    Andrew Donnellan, Denis Kirjanov).
 
  - acpi-cpufreq driver rework to switch it over to the new CPU
    offline/online state machine (Sebastian Andrzej Siewior).
 
  - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
    Prakash).
 
  - Idle injection rework (to make it use the regular idle path
    instead of a home-grown custom one) and related powerclamp
    thermal driver updates (Peter Zijlstra, Jacob Pan, Petr Mladek,
    Sebastian Andrzej Siewior).
 
  - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
    Shevchenko, Piotr Luc).
 
  - intel_idle driver cleanups and switch over to using the new CPU
    offline/online state machine (Anna-Maria Gleixner, Sebastian
    Andrzej Siewior).
 
  - cpuidle DT driver update to support suspend-to-idle properly
    (Sudeep Holla).
 
  - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
    Rafael Wysocki).
 
  - Preliminary support for power domains including CPUs in the
    generic power domains (genpd) framework and related DT bindings
    (Lina Iyer).
 
  - Assorted fixes and cleanups in the generic power domains (genpd)
    framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven).
 
  - Preliminary support for devices with multiple voltage regulators
    and related fixes and cleanups in the Operating Performance Points
    (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd).
 
  - System sleep state selection interface rework to make it easier
    to support suspend-to-idle as the default system suspend method
    (Rafael Wysocki).
 
  - PM core fixes and cleanups, mostly related to the interactions
    between the system suspend and runtime PM frameworks (Ulf Hansson,
    Sahitya Tummala, Tony Lindgren).
 
  - Latency tolerance PM QoS framework imorovements (Andrew
    Lutomirski).
 
  - New Knights Mill CPU ID for the Intel RAPL power capping driver
    (Piotr Luc).
 
  - Intel RAPL power capping driver fixes, cleanups and switch over
    to using the new CPU offline/online state machine (Jacob Pan,
    Thomas Gleixner, Sebastian Andrzej Siewior).
 
  - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
    rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
    Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh
    Kumar).
 
  - Fix for false-positive KASAN warnings during resume from ACPI S3
    (suspend-to-RAM) on x86 (Josh Poimboeuf).
 
  - Memory map verification during resume from hibernation on x86 to
    ensure a consistent address space layout (Chen Yu).
 
  - Wakeup sources debugging enhancement (Xing Wei).
 
  - rockchip-io AVS driver cleanup (Shawn Lin).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYTx4+AAoJEILEb/54YlRx9f8P/2SlNHUENW5qh6FtCw00oC2u
 UqJerQJ2L38UgbgxbE/0VYblma9rFABDWC1eO2xN2XdcdW5UPBKPVvNcOgNe1Clh
 gjy3RxZXVpmjfzt2kGfsTLEuGnHqwvx51hTUkeA2LwvkOal45xb8ZESmy8opCtiv
 iG4LwmPHoxdX5Za5nA9ItFKzxyO1EoyNSnBYAVwALDHxmNOfxEcRevfurASt/0M9
 brCCZJA0/sZxeL0lBdy8fNQPIBTUfCoTJG/MtmzGrObJ9wMFvEDfXrVEyZiWs/zA
 AAZ4kQL77enrIKgrLN8e0G6LzTLHoVcvn38Xjf24dKUqhd7ACBhYcnW+jK3+7EAd
 gjZ8efObQsiuyK/EDLUNw35tt96CHOqfrQCj2tIwRVvk9EekLqAGXdIndTCr2kYW
 RpefmP5kMljnm/nQFOVLwMEUQMuVkvUE7EgxADy7DoDmepBFC4ICRDWPye70R2kC
 0O1Tn2PAQq4Fd1tyI9TYYz0YQQkRoaRb5rfYUSzbRbeCdsphUopp4Vhsiyn6IcnF
 XnLbg6pRAat82MoS9n4pfO/VCo8vkErKA8tut9G7TDakkrJoEE7l31PdKW0hP3f6
 sBo6xXy6WTeivU/o/i8TbM6K4mA37pBaj78ooIkWLgg5fzRaS2+0xSPVy2H9x1m5
 LymHcobCK9rSZ1l208Fe
 =vhxI
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "Again, cpufreq gets more changes than the other parts this time (one
  new driver, one old driver less, a bunch of enhancements of the
  existing code, new CPU IDs, fixes, cleanups)

  There also are some changes in cpuidle (idle injection rework, a
  couple of new CPU IDs, online/offline rework in intel_idle, fixes and
  cleanups), in the generic power domains framework (mostly related to
  supporting power domains containing CPUs), and in the Operating
  Performance Points (OPP) library (mostly related to supporting devices
  with multiple voltage regulators)

  In addition to that, the system sleep state selection interface is
  modified to make it easier for distributions with unchanged user space
  to support suspend-to-idle as the default system suspend method, some
  issues are fixed in the PM core, the latency tolerance PM QoS
  framework is improved a bit, the Intel RAPL power capping driver is
  cleaned up and there are some fixes and cleanups in the devfreq
  subsystem

  Specifics:

   - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
     for it (Markus Mayer)

   - Support for ARM Integrator/AP and Integrator/CP in the generic DT
     cpufreq driver and elimination of the old Integrator cpufreq driver
     (Linus Walleij)

   - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
     and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
     Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik)

   - cpufreq core fix to eliminate races that may lead to using inactive
     policy objects and related cleanups (Rafael Wysocki)

   - cpufreq schedutil governor update to make it use SCHED_FIFO kernel
     threads (instead of regular workqueues) for doing delayed work (to
     reduce the response latency in some cases) and related cleanups
     (Viresh Kumar)

   - New cpufreq sysfs attribute for resetting statistics (Markus Mayer)

   - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
     Viresh Kumar)

   - Support for using generic cpufreq governors in the intel_pstate
     driver (Rafael Wysocki)

   - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy
     Performance Preference/Energy Performance Bias) knobs in the
     intel_pstate driver (Srinivas Pandruvada)

   - New CPU ID for Knights Mill in intel_pstate (Piotr Luc)

   - intel_pstate driver modification to use the P-state selection
     algorithm based on CPU load on platforms with the system profile in
     the ACPI tables set to "mobile" (Srinivas Pandruvada)

   - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
     Srinivas Pandruvada)

   - cpufreq powernv driver updates including fast switching support
     (for the schedutil governor), fixes and cleanus (Akshay Adiga,
     Andrew Donnellan, Denis Kirjanov)

   - acpi-cpufreq driver rework to switch it over to the new CPU
     offline/online state machine (Sebastian Andrzej Siewior)

   - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
     Prakash)

   - Idle injection rework (to make it use the regular idle path instead
     of a home-grown custom one) and related powerclamp thermal driver
     updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej
     Siewior)

   - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
     Shevchenko, Piotr Luc)

   - intel_idle driver cleanups and switch over to using the new CPU
     offline/online state machine (Anna-Maria Gleixner, Sebastian
     Andrzej Siewior)

   - cpuidle DT driver update to support suspend-to-idle properly
     (Sudeep Holla)

   - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
     Rafael Wysocki)

   - Preliminary support for power domains including CPUs in the generic
     power domains (genpd) framework and related DT bindings (Lina Iyer)

   - Assorted fixes and cleanups in the generic power domains (genpd)
     framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven)

   - Preliminary support for devices with multiple voltage regulators
     and related fixes and cleanups in the Operating Performance Points
     (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd)

   - System sleep state selection interface rework to make it easier to
     support suspend-to-idle as the default system suspend method
     (Rafael Wysocki)

   - PM core fixes and cleanups, mostly related to the interactions
     between the system suspend and runtime PM frameworks (Ulf Hansson,
     Sahitya Tummala, Tony Lindgren)

   - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski)

   - New Knights Mill CPU ID for the Intel RAPL power capping driver
     (Piotr Luc)

   - Intel RAPL power capping driver fixes, cleanups and switch over to
     using the new CPU offline/online state machine (Jacob Pan, Thomas
     Gleixner, Sebastian Andrzej Siewior)

   - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
     rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
     Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar)

   - Fix for false-positive KASAN warnings during resume from ACPI S3
     (suspend-to-RAM) on x86 (Josh Poimboeuf)

   - Memory map verification during resume from hibernation on x86 to
     ensure a consistent address space layout (Chen Yu)

   - Wakeup sources debugging enhancement (Xing Wei)

   - rockchip-io AVS driver cleanup (Shawn Lin)"

* tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits)
  devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks
  devfreq: rk3399_dmc: Remove dangling rcu_read_unlock()
  devfreq: exynos: Don't use OPP structures outside of RCU locks
  Documentation: intel_pstate: Document HWP energy/performance hints
  cpufreq: intel_pstate: Support for energy performance hints with HWP
  cpufreq: intel_pstate: Add locking around HWP requests
  PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
  PM / core: Fix bug in the error handling of async suspend
  PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
  PM / Domains: Fix compatible for domain idle state
  PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators()
  PM / OPP: Allow platform specific custom set_opp() callbacks
  PM / OPP: Separate out _generic_set_opp()
  PM / OPP: Add infrastructure to manage multiple regulators
  PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage()
  PM / OPP: Manage supply's voltage/current in a separate structure
  PM / OPP: Don't use OPP structure outside of rcu protected section
  PM / OPP: Reword binding supporting multiple regulators per device
  PM / OPP: Fix incorrect cpu-supply property in binding
  cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
  ..
2016-12-13 10:41:53 -08:00
Linus Torvalds
36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Linus Torvalds
52281b38bc Improvements and fixes to pstore subsystem:
- Add additional checks for bad platform data
 
 - Remove bounce buffer in console writer
 
 - Protect read/unlink race with a mutex
 
 - Correctly give up during dump locking failures
 
 - Increase ftrace bandwidth by splitting ftrace buffers per CPU
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJYSJxYAAoJEIly9N/cbcAmYBsQAIAmHDgk3ootLQhyatZ9H2X0
 Nyl24xA7UCPaz13ddF1tUaItI4mYBWfY4gde+3fIVXDitgmFxZZqb8YV68CvFgUt
 Hb8tlTiM0F2z/muGBIgJ5TN5XiB4dO0WgvcKvnQdzyNGPVlAXvowHPkaM9X+iEA1
 y4U2Le7iK9+9fvkH7RM4O3hMiTmpKeUITYTWo1Y8n9LaZo3w5+pqhS+TPu75uyD0
 pLb53EOzZmg1nu9hcac5t4G5W1Lr4ji2EekDXemi/571HAzQnMXxJWc6ZVYLDNfP
 W4D0UGcHAERDzrYwWcGn8HIThYlpbnVw9atSTTodJTiIubtsRt4haycUH1hqMS5o
 4R2myhbAoM0A3zYBqrhwtQHg8apNes2hOR2WycAqgvylZZl1o6zaEs9zc7aafYuy
 N/M0x5tlya3fOgkvkJsmERT5jtqDVMhtBZ2xa8NYfJCHgULaUmjEx25eTr1kF3nW
 ERIX/3IayMvqHwYptP9dOzy2owLpXC8yZlM34AeM+ub93hHj1ELLfG7aN0bklD/+
 wfmIX8HpOA2XGWflOk5fiHLHro6pwRU9zOIIHFJ4Tf60PMoN+rjRfej1fjz+KOhO
 gxUYaCb+/4BlCqLqdFvF54qhQO2qmVuOAg/1BLu+hnZtXSyhVJxePthSs5shyoE8
 owL8rVXDGapjF1xO6WCR
 =UmFL
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "Improvements and fixes to pstore subsystem:

   - add additional checks for bad platform data

   - remove bounce buffer in console writer

   - protect read/unlink race with a mutex

   - correctly give up during dump locking failures

   - increase ftrace bandwidth by splitting ftrace buffers per CPU"

* tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  ramoops: add pdata NULL check to ramoops_probe
  pstore: Convert console write to use ->write_buf
  pstore: Protect unlink with read_mutex
  pstore: Use global ftrace filters for function trace filtering
  ftrace: Provide API to use global filtering for ftrace ops
  pstore: Clarify context field przs as dprzs
  pstore: improve error report for failed setup
  pstore: Merge per-CPU ftrace records into one
  pstore: Add ftrace timestamp counter
  ramoops: Split ftrace buffer space into per-CPU zones
  pstore: Make ramoops_init_przs generic for other prz arrays
  pstore: Allow prz to control need for locking
  pstore: Warn on PSTORE_TYPE_PMSG using deprecated function
  pstore: Make spinlock per zone instead of global
  pstore: Actually give up during locking failure
2016-12-13 09:16:11 -08:00
Linus Torvalds
e34bac726d Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - most of MM (quite a lot of MM material is awaiting the merge of
   linux-next dependencies)

 - kasan

 - printk updates

 - procfs updates

 - MAINTAINERS

 - /lib updates

 - checkpatch updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits)
  init: reduce rootwait polling interval time to 5ms
  binfmt_elf: use vmalloc() for allocation of vma_filesz
  checkpatch: don't emit unified-diff error for rename-only patches
  checkpatch: don't check c99 types like uint8_t under tools
  checkpatch: avoid multiple line dereferences
  checkpatch: don't check .pl files, improve absolute path commit log test
  scripts/checkpatch.pl: fix spelling
  checkpatch: don't try to get maintained status when --no-tree is given
  lib/ida: document locking requirements a bit better
  lib/rbtree.c: fix typo in comment of ____rb_erase_color
  lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
  MAINTAINERS: add drm and drm/i915 irc channels
  MAINTAINERS: add "C:" for URI for chat where developers hang out
  MAINTAINERS: add drm and drm/i915 bug filing info
  MAINTAINERS: add "B:" for URI where to file bugs
  get_maintainer: look for arbitrary letter prefixes in sections
  printk: add Kconfig option to set default console loglevel
  printk/sound: handle more message headers
  printk/btrfs: handle more message headers
  printk/kdb: handle more message headers
  ...
2016-12-12 20:50:02 -08:00
Linus Torvalds
f082f02c47 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The irq department provides:

   - a major update to the auto affinity management code, which is used
     by multi-queue devices

   - move of the microblaze irq chip driver into the common driver code
     so it can be shared between microblaze, powerpc and MIPS

   - a series of updates to the ARM GICV3 interrupt controller

   - the usual pile of fixes and small improvements all over the place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  powerpc/virtex: Use generic xilinx irqchip driver
  irqchip/xilinx: Try to fall back if xlnx,kind-of-intr not provided
  irqchip/xilinx: Add support for parent intc
  irqchip/xilinx: Rename get_irq to xintc_get_irq
  irqchip/xilinx: Restructure and use jump label api
  irqchip/xilinx: Clean up print messages
  microblaze/irqchip: Move intc driver to irqchip
  ARM: virt: Select ARM_GIC_V3_ITS
  ARM: gic-v3-its: Add 32bit support to GICv3 ITS
  irqchip/gic-v3-its: Specialise readq and writeq accesses
  irqchip/gic-v3-its: Specialise flush_dcache operation
  irqchip/gic-v3-its: Narrow down Entry Size when used as a divider
  irqchip/gic-v3-its: Change unsigned types for AArch32 compatibility
  irqchip/gic-v3: Use nops macro for Cavium ThunderX erratum 23154
  irqchip/gic-v3: Convert arm64 GIC accessors to {read,write}_sysreg_s
  genirq/msi: Drop artificial PCI dependency
  irqchip/bcm7038-l1: Implement irq_cpu_offline() callback
  genirq/affinity: Use default affinity mask for reserved vectors
  genirq/affinity: Take reserved vectors into account when spreading irqs
  PCI: Remove the irq_affinity mask from struct pci_dev
  ...
2016-12-12 20:23:11 -08:00
Linus Torvalds
9465d9cc31 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The time/timekeeping/timer folks deliver with this update:

   - Fix a reintroduced signed/unsigned issue and cleanup the whole
     signed/unsigned mess in the timekeeping core so this wont happen
     accidentaly again.

   - Add a new trace clock based on boot time

   - Prevent injection of random sleep times when PM tracing abuses the
     RTC for storage

   - Make posix timers configurable for real tiny systems

   - Add tracepoints for the alarm timer subsystem so timer based
     suspend wakeups can be instrumented

   - The usual pile of fixes and updates to core and drivers"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  timekeeping: Use mul_u64_u32_shr() instead of open coding it
  timekeeping: Get rid of pointless typecasts
  timekeeping: Make the conversion call chain consistently unsigned
  timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
  alarmtimer: Add tracepoints for alarm timers
  trace: Update documentation for mono, mono_raw and boot clock
  trace: Add an option for boot clock as trace clock
  timekeeping: Add a fast and NMI safe boot clock
  timekeeping/clocksource_cyc2ns: Document intended range limitation
  timekeeping: Ignore the bogus sleep time if pm_trace is enabled
  selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
  clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
  clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
  arm64: dts: rockchip: Arch counter doesn't tick in system suspend
  clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
  posix-timers: Make them configurable
  posix_cpu_timers: Move the add_device_randomness() call to a proper place
  timer: Move sys_alarm from timer.c to itimer.c
  ptp_clock: Allow for it to be optional
  Kconfig: Regenerate *.c_shipped files after previous changes
  ...
2016-12-12 19:56:15 -08:00
Linus Torvalds
e71c3978d6 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
 "This is the final round of converting the notifier mess to the state
  machine. The removal of the notifiers and the related infrastructure
  will happen around rc1, as there are conversions outstanding in other
  trees.

  The whole exercise removed about 2000 lines of code in total and in
  course of the conversion several dozen bugs got fixed. The new
  mechanism allows to test almost every hotplug step standalone, so
  usage sites can exercise all transitions extensively.

  There is more room for improvement, like integrating all the
  pointlessly different architecture mechanisms of synchronizing,
  setting cpus online etc into the core code"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  tracing/rb: Init the CPU mask on allocation
  soc/fsl/qbman: Convert to hotplug state machine
  soc/fsl/qbman: Convert to hotplug state machine
  zram: Convert to hotplug state machine
  KVM/PPC/Book3S HV: Convert to hotplug state machine
  arm64/cpuinfo: Convert to hotplug state machine
  arm64/cpuinfo: Make hotplug notifier symmetric
  mm/compaction: Convert to hotplug state machine
  iommu/vt-d: Convert to hotplug state machine
  mm/zswap: Convert pool to hotplug state machine
  mm/zswap: Convert dst-mem to hotplug state machine
  mm/zsmalloc: Convert to hotplug state machine
  mm/vmstat: Convert to hotplug state machine
  mm/vmstat: Avoid on each online CPU loops
  mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
  tracing/rb: Convert to hotplug state machine
  oprofile/nmi timer: Convert to hotplug state machine
  net/iucv: Use explicit clean up labels in iucv_init()
  x86/pci/amd-bus: Convert to hotplug state machine
  x86/oprofile/nmi: Convert to hotplug state machine
  ...
2016-12-12 19:25:04 -08:00
Petr Mladek
497957576c printk/kdb: handle more message headers
Commit 4bcc595ccd ("printk: reinstate KERN_CONT for printing
continuation lines") allows to define more message headers for a single
message.  The motivation is that continuous lines might get mixed.
Therefore it make sense to define the right log level for every piece of
a cont line.

This patch introduces printk_skip_headers() that will skip all headers
and uses it in the kdb code instead of printk_skip_level().

This approach helps to fix other printk_skip_level() users
independently.

Link: http://lkml.kernel.org/r/1478695291-12169-3-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Petr Mladek
22c2c7b2ef printk/NMI: handle continuous lines and missing newline
Commit 4bcc595ccd ("printk: reinstate KERN_CONT for printing
continuation lines") added back KERN_CONT message header.  As a result
it might appear in the middle of the line when the parts are squashed
via the temporary NMI buffer.

A reasonable solution seems to be to split the text in the NNI temporary
not only by newlines but also by the message headers.

Another solution would be to filter out KERN_CONT when writing to the
temporary buffer.  But this would complicate the lockless handling.
Also it would not solve problems with a missing newline that was there
even before the KERN_CONT stuff.

This patch moves the temporary buffer handling into separate function.
I played with it and it seems that using the char pointers make the code
easier to read.

Also it prints the final newline as a continuous line.

Finally, it moves handling of the s->len overflow into the paranoid
check.  And allows to recover from the disaster.

Link: http://lkml.kernel.org/r/1478695291-12169-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Petr Mladek
4a998e322a printk/NMI: fix up handling of the full nmi log buffer
vsnprintf() adds the trailing '\0' but it does not count it into the
number of printed characters.  The result is that there is one byte less
space for the real characters in the buffer.

The broken check for the free space might cause that we will repeatedly
try to print 1 character into the buffer, never reach the full buffer,
and do not count the messages as missed.

Also vsnprintf() returns the number of characters that would be printed
if the buffer was big enough.  As a result, s->len might be bigger than
the size of the buffer[*].  And the printk() function might return
bigger len than it really printed.  Both problems are fixed by using
vscnprintf() instead.

Note that I though about increasing the number of missed messages even
when the message was shrunken.  But it made the code even more
complicated.  I think that it is not worth it.  Shrunken messages are
usually easy to recognize.  And it should be a corner case.

[*] The overflown s->len value is crazy and unexpected.  I "made a
mistake" and reported this situation as an internal error when fixed
handling of PR_CONT headers in some other patch.

Link: http://lkml.kernel.org/r/20161208174912.GA17042@linux.suse
Signed-off-by: Petr Mladek <pmladek@suse.com>
CcL Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Tetsuo Handa
4ca5ede07c hung_task: decrement sysctl_hung_task_warnings only if it is positive
Since sysctl_hung_task_warnings == -1 is allowed (infinite warnings),
commit 48a6d64eda ("hung_task: allow hung_task_panic when
hung_task_warnings is 0") should decrement it only when it is not -1.

This prevents the kernel from ceasing warnings after the first
4294967295 ;)

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: John Siddle <jsiddle@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Andrey Ryabinin
0f110a9b95 kernel/fork: use vfree_atomic() to free thread stack
vfree() is going to use sleeping lock.  Thread stack freed in atomic
context, therefore we must use vfree_atomic() here.

Link: http://lkml.kernel.org/r/1479474236-4139-6-git-send-email-hch@lst.de
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Jisheng Zhang <jszhang@marvell.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: John Dias <joaodias@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Stanislav Kinsburskiy
3fb4afd9a5 prctl: remove one-shot limitation for changing exe link
This limitation came with the reason to remove "another way for
malicious code to obscure a compromised program and masquerade as a
benign process" by allowing "security-concious program can use this
prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

    http://marc.info/?l=linux-kernel&m=133160684517468&w=2

This explanation doesn't look sufficient.  The only thing "exe" link is
indicating is the file, used to execve, which is basically nothing and
not reliable immediately after process has returned from execve system
call.

Moreover, to use this feture, all the mappings to previous exe file have
to be unmapped and all the new exe file permissions must be satisfied.

Which means, that changing exe link is very similar to calling execve on
the binary.

The need to remove this limitations comes from migration of NFS mount
point, which is not accessible during restore and replaced by other file
system.  Because of this exe link has to be changed twice.

[akpm@linux-foundation.org: fix up comment]
Link: http://lkml.kernel.org/r/20160927153755.9337.69650.stgit@localhost.localdomain
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Nicolas Iooss
c0b942a763 kthread: add __printf attributes
When commit fbae2d44aa ("kthread: add kthread_create_worker*()")
introduced some kthread_create_...() functions which were taking
printf-like parametter, it introduced __printf attributes to some
functions (e.g.  kthread_create_worker()).  Nevertheless some new
functions were forgotten (they have been detected thanks to
-Wmissing-format-attribute warning flag).

Add the missing __printf attributes to the newly-introduced functions in
order to detect formatting issues at build-time with -Wformat flag.

Link: http://lkml.kernel.org/r/20161126193543.22672-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Marcin Nowakowski
d4d7ccc834 kprobes/trace: Fix kprobe selftest for newer gcc
Commit 265a5b7ee3 ("kprobes/trace: Fix kprobe selftest for gcc 4.6")
has added __used attribute to kprobe_trace_selftest_target to ensure
that the method is listed in kallsyms table.

However, even though the method remains in the kernel image, the actual
call is optimized away as there are no side effects and the return value
is never checked.

Add a return value check and a 'noinline' attribute to ensure that an
inlined copy of the method is not used by the caller. Also add checks
that verify that the kprobe was really hit, as at the moment the tests
show positive results despite the test method being optimized away.

Finally, add __init annotations to find_trace_probe_file() and
kprobe_trace_selftest_target() as they are only called from within an
__init method.

Link: http://lkml.kernel.org/r/1481293178-3128-2-git-send-email-marcin.nowakowski@imgtec.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-12 21:21:43 -05:00
Marcin Nowakowski
f18f97ac43 tracing/kprobes: Add a helper method to return number of probe hits
The number of probe hits is stored in a percpu variable and therefore
can't be read directly. Add a helper method trace_kprobe_nhit() that
performs the required calculation.

It will be used in a follow-up commit that changes kprobe selftests to
verify the number of probe hits.

Link: http://lkml.kernel.org/r/1481293178-3128-1-git-send-email-marcin.nowakowski@imgtec.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-12 21:17:44 -05:00
Sebastian Andrzej Siewior
99e6f6e813 tracing/rb: Init the CPU mask on allocation
Before commit b32614c034 ("tracing/rb: Convert to hotplug state
machine") the allocated cpumask was initialized to the mask of ONLINE or
POSSIBLE CPUs. After the CPU hotplug changes the buffer initialisation
moved to trace_rb_cpu_prepare() but I forgot to initially set the
cpumask to zero. This is done now.

Link: http://lkml.kernel.org/r/20161207133133.hzkcqfllxcdi3joz@linutronix.de

Fixes: b32614c034 ("tracing/rb: Convert to hotplug state machine")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-12 17:57:26 -05:00
Linus Torvalds
5645688f9d Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
 "The main changes in this development cycle were:

   - a large number of call stack dumping/printing improvements: higher
     robustness, better cross-context dumping, improved output, etc.
     (Josh Poimboeuf)

   - vDSO getcpu() performance improvement for future Intel CPUs with
     the RDPID instruction (Andy Lutomirski)

   - add two new Intel AVX512 features and the CPUID support
     infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela,
     He Chen)

   - more copy-user unification (Borislav Petkov)

   - entry code assembly macro simplifications (Alexander Kuleshov)

   - vDSO C/R support improvements (Dmitry Safonov)

   - misc fixes and cleanups (Borislav Petkov, Paul Bolle)"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  scripts/decode_stacktrace.sh: Fix address line detection on x86
  x86/boot/64: Use defines for page size
  x86/dumpstack: Make stack name tags more comprehensible
  selftests/x86: Add test_vdso to test getcpu()
  x86/vdso: Use RDPID in preference to LSL when available
  x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl()
  x86/cpufeatures: Enable new AVX512 cpu features
  x86/cpuid: Provide get_scattered_cpuid_leaf()
  x86/cpuid: Cleanup cpuid_regs definitions
  x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants
  x86/unwind: Ensure stack grows down
  x86/vdso: Set vDSO pointer only after success
  x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE
  x86/unwind: Detect bad stack return address
  x86/dumpstack: Warn on stack recursion
  x86/unwind: Warn on bad frame pointer
  x86/decoder: Use stderr if insn sanity test fails
  x86/decoder: Use stdout if insn decoder test is successful
  mm/page_alloc: Remove kernel address exposure in free_reserved_area()
  x86/dumpstack: Remove raw stack dump
  ...
2016-12-12 13:49:57 -08:00
Linus Torvalds
cbaa1576c4 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull hotplug API fix from Ingo Molnar:
 "Late breaking fix from the v4.9 cycle: fix a hotplug register/
  unregister notifier API asymmetry bug that can cause kernel warnings
  (and worse) with certain Kconfig combinations"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hotplug: Make register and unregister notifier API symmetric
2016-12-12 12:53:54 -08:00
Linus Torvalds
92c020d08d Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main scheduler changes in this cycle were:

   - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
     notion of 'better cores', which the scheduler will prefer to
     schedule single threaded workloads on. (Tim Chen, Srinivas
     Pandruvada)

   - enhance the handling of asymmetric capacity CPUs further (Morten
     Rasmussen)

   - improve/fix load handling when moving tasks between task groups
     (Vincent Guittot)

   - simplify and clean up the cputime code (Stanislaw Gruszka)

   - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
     Guittot)

   - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)

   - add uaccess atomicity debugging (when using access_ok() in the
     wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)

   - implement various fixes, cleanups and other enhancements (Daniel
     Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  sched/core: Use load_avg for selecting idlest group
  sched/core: Fix find_idlest_group() for fork
  kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
  kthread: Don't use to_live_kthread() in kthread_[un]park()
  kthread: Don't use to_live_kthread() in kthread_stop()
  Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
  kthread: Make struct kthread kmalloc'ed
  x86/uaccess, sched/preempt: Verify access_ok() context
  sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
  sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
  x86/sched: Use #include <linux/mutex.h> instead of #include <asm/mutex.h>
  cpufreq/intel_pstate: Use CPPC to get max performance
  acpi/bus: Set _OSC for diverse core support
  acpi/bus: Enable HWP CPPC objects
  x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
  x86/sysctl: Add sysctl for ITMT scheduling feature
  x86: Enable Intel Turbo Boost Max Technology 3.0
  x86/topology: Define x86's arch_update_cpu_topology
  sched: Extend scheduler's asym packing
  sched/fair: Clean up the tunable parameter definitions
  ...
2016-12-12 12:15:10 -08:00
Rafael J. Wysocki
631ddaba59 Merge branches 'pm-sleep' and 'powercap'
* pm-sleep:
  PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
  x86/suspend: fix false positive KASAN warning on suspend/resume
  PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag
  PM / sleep: System sleep state selection interface rework
  PM / hibernate: Verify the consistent of e820 memory map by md5 digest

* powercap:
  powercap / RAPL: Add Knights Mill CPUID
  powercap/intel_rapl: fix and tidy up error handling
  powercap/intel_rapl: Track active CPUs internally
  powercap/intel_rapl: Cleanup duplicated init code
  powercap/intel rapl: Convert to hotplug state machine
  powercap/intel_rapl: Propagate error code when registration fails
  powercap/intel_rapl: Add missing domain data update on hotplug
2016-12-12 20:46:35 +01:00
Rafael J. Wysocki
b19ad3b9f1 Merge branch 'pm-cpuidle'
* pm-cpuidle:
  cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
  cpuidle: fix improper return value on error
  intel_idle: Convert to hotplug state machine
  intel_idle: Remove superfluous SMP fuction call
  MAINTAINERS: Add Jacob Pan as a new intel_idle maintainer
  MAINTAINERS: Add bug tracking system location entries for cpuidle
  x86/intel_idle: Add Knights Mill CPUID
  x86/intel_idle: Add CPU model 0x4a (Atom Z34xx series)
  thermal/intel_powerclamp: stop sched tick in forced idle
  thermal/intel_powerclamp: Convert to CPU hotplug state
  thermal/intel_powerclamp: Convert the kthread to kthread worker API
  thermal/intel_powerclamp: Remove duplicated code that starts the kthread
  sched/idle: Add support for tasks that inject idle
  cpuidle: Allow enforcing deepest idle state selection
  cpuidle/powernv: staticise powernv_idle_driver
  cpuidle: dt: assign ->enter_freeze to same as ->enter callback function
  cpuidle: governors: Remove remaining old module code
2016-12-12 20:46:15 +01:00
Rafael J. Wysocki
fecc8c0ebd Merge branch 'pm-cpufreq'
* pm-cpufreq: (51 commits)
  Documentation: intel_pstate: Document HWP energy/performance hints
  cpufreq: intel_pstate: Support for energy performance hints with HWP
  cpufreq: intel_pstate: Add locking around HWP requests
  cpufreq: ondemand: Set MIN_FREQUENCY_UP_THRESHOLD to 1
  cpufreq: intel_pstate: Add Knights Mill CPUID
  MAINTAINERS: Add bug tracking system location entry for cpufreq
  cpufreq: dt: Add support for zx296718
  cpufreq: acpi-cpufreq: drop rdmsr_on_cpus() usage
  cpufreq: acpi-cpufreq: Convert to hotplug state machine
  cpufreq: intel_pstate: fix intel_pstate_exit_perf_limits() prototype
  cpufreq: intel_pstate: Set EPP/EPB to 0 in performance mode
  cpufreq: schedutil: Rectify comment in sugov_irq_work() function
  cpufreq: intel_pstate: increase precision of performance limits
  cpufreq: intel_pstate: round up min_perf limits
  cpufreq: Make cpufreq_update_policy() void
  ACPI / processor: Make acpi_processor_ppc_has_changed() void
  cpufreq: Avoid using inactive policies
  cpufreq: intel_pstate: Generic governors support
  cpufreq: intel_pstate: Request P-states control from SMM if needed
  cpufreq: dt: Add support for r8a7743 and r8a7745
  ...
2016-12-12 20:45:01 +01:00
Pavankumar Kondeti
c59f29cb14 tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
The 's' flag is supposed to indicate that a softirq is running. This
can be detected by testing the preempt_count with SOFTIRQ_OFFSET.

The current code tests the preempt_count with SOFTIRQ_MASK, which
would be true even when softirqs are disabled but not serving a
softirq.

Link: http://lkml.kernel.org/r/1481300417-3564-1-git-send-email-pkondeti@codeaurora.org

Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-12 13:51:02 -05:00
Linus Torvalds
6cdf89b1ca Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The tree got pretty big in this development cycle, but the net effect
  is pretty good:

    115 files changed, 673 insertions(+), 1522 deletions(-)

  The main changes were:

   - Rework and generalize the mutex code to remove per arch mutex
     primitives. (Peter Zijlstra)

   - Add vCPU preemption support: add an interface to query the
     preemption status of vCPUs and use it in locking primitives - this
     optimizes paravirt performance. (Pan Xinhui, Juergen Gross,
     Christian Borntraeger)

   - Introduce cpu_relax_yield() and remov cpu_relax_lowlatency() to
     clean up and improve the s390 lock yielding machinery and its core
     kernel impact. (Christian Borntraeger)

   - Micro-optimize mutexes some more. (Waiman Long)

   - Reluctantly add the to-be-deprecated mutex_trylock_recursive()
     interface on a temporary basis, to give the DRM code more time to
     get rid of its locking hacks. Any other users will be NAK-ed on
     sight. (We turned off the deprecation warning for the time being to
     not pollute the build log.) (Peter Zijlstra)

   - Improve the rtmutex code a bit, in light of recent long lived
     bugs/races. (Thomas Gleixner)

   - Misc fixes, cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/paravirt: Fix bool return type for PVOP_CALL()
  x86/paravirt: Fix native_patch()
  locking/ww_mutex: Use relaxed atomics
  locking/rtmutex: Explain locking rules for rt_mutex_proxy_unlock()/init_proxy_locked()
  locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALL
  x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()
  locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted
  locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()
  Documentation/virtual/kvm: Support the vCPU preemption check
  x86/xen: Support the vCPU preemption check
  x86/kvm: Support the vCPU preemption check
  x86/kvm: Support the vCPU preemption check
  kvm: Introduce kvm_write_guest_offset_cached()
  locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests
  locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)
  locking/core, powerpc: Implement vcpu_is_preempted(cpu)
  sched/core: Introduce the vcpu_is_preempted(cpu) interface
  sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q
  locking/core: Provide common cpu_relax_yield() definition
  locking/mutex: Don't mark mutex_trylock_recursive() as deprecated, temporarily
  ...
2016-12-12 10:48:02 -08:00
Linus Torvalds
9ad1aeecdb Merge branch 'core-smp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP bootup updates from Ingo Molnar:
 "Three changes to unify/standardize some of the bootup message printing
  in kernel/smp.c between architectures"

* 'core-smp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kernel/smp: Tell the user we're bringing up secondary CPUs
  kernel/smp: Make the SMP boot message common on all arches
  kernel/smp: Define pr_fmt() for smp.c
2016-12-12 10:02:01 -08:00
Linus Torvalds
718c0ddd6a Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main RCU changes in this development cycle were:

   - Miscellaneous fixes, including a change to call_rcu()'s rcu_head
     alignment check.

   - Security-motivated list consistency checks, which are disabled by
     default behind DEBUG_LIST.

   - Torture-test updates.

   - Documentation updates, yet again just simple changes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  torture: Prevent jitter from delaying build-only runs
  torture: Remove obsolete files from rcutorture .gitignore
  rcu: Don't kick unless grace period or request
  rcu: Make expedited grace periods recheck dyntick idle state
  torture: Trace long read-side delays
  rcu: RCU_TRACE enables event tracing as well as debugfs
  rcu: Remove obsolete comment from __call_rcu()
  rcu: Remove obsolete rcu_check_callbacks() header comment
  rcu: Tighten up __call_rcu() rcu_head alignment check
  Documentation/RCU: Fix minor typo
  documentation: Present updated RCU guarantee
  bug: Avoid Kconfig warning for BUG_ON_DATA_CORRUPTION
  lib/Kconfig.debug: Fix typo in select statement
  lkdtm: Add tests for struct list corruption
  bug: Provide toggle for BUG on data corruption
  list: Split list_del() debug checking into separate function
  rculist: Consolidate DEBUG_LIST for list_add_rcu()
  list: Split list_add() debug checking into separate function
2016-12-12 09:09:54 -08:00
Vincent Guittot
6b94780e45 sched/core: Use load_avg for selecting idlest group
find_idlest_group() only compares the runnable_load_avg when looking
for the least loaded group. But on fork intensive use case like
hackbench where tasks blocked quickly after the fork, this can lead to
selecting the same CPU instead of other CPUs, which have similar
runnable load but a lower load_avg.

When the runnable_load_avg of 2 CPUs are close, we now take into
account the amount of blocked load as a 2nd selection factor. There is
now 3 zones for the runnable_load of the rq:

 - [0 .. (runnable_load - imbalance)]:
	Select the new rq which has significantly less runnable_load

 - [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
	The runnable loads are close so we use load_avg to chose
	between the 2 rq

 - [(runnable_load + imbalance) .. ULONG_MAX]:
	Keep the current rq which has significantly less runnable_load

The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a
scaling factor fails as soon as this_runnable_load == 0 because we
always select local rq even if min_runnable_load is only 1, which
doesn't really make sense because they are just the same. So instead
of scaling factor, we use an absolute margin for runnable_load to
detect CPUs with similar runnable_load and we keep using scaling
factor for blocked load.

For use case like hackbench, this enable the scheduler to select
different CPUs during the fork sequence and to spread tasks across the
system.

Tests have been done on a Hikey board (ARM based octo cores) for
several kernel. The result below gives min, max, avg and stdev values
of 18 runs with each configuration.

The patches depend on the "no missing update_rq_clock()" work.

hackbench -P -g 1

         ea86cb4b76  7dc603c902  v4.8        v4.8+patches
  min    0.049         0.050         0.051       0,048
  avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
  max    0.066         0.068         0.070       0,063
  stdev  +/-9%         +/-9%         +/-8%       +/-9%

More performance numbers here:

  https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:10:57 +01:00
Vincent Guittot
f519a3f1c6 sched/core: Fix find_idlest_group() for fork
During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to
set the utilization of the fork task. As the task's utilization is
still 0 at this step of the fork sequence, it doesn't make sense to
look for some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test:

   hackbench -P -g 1

because the least loaded policy is always bypassed and tasks are not
spread during fork.

With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test
until a smarter solution is found because we can't simply remove the
test which is useful for others benchmarks

| @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
|
|	avg_cost = this_sd->avg_scan_cost;
|
| -	/*
| -	 * Due to large variance we need a large fuzz factor; hackbench in
| -	 * particularly is sensitive here.
| -	 */
| -	if ((avg_idle / 512) < avg_cost)
| -		return -1;
| -
|	time = local_clock();
|
|	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:10:56 +01:00
Ingo Molnar
6643aab30f Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:10:40 +01:00
Ingo Molnar
6f38751510 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:07:13 +01:00
David S. Miller
821781a9f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-12-10 16:21:55 -05:00
Steven Rostedt (Red Hat)
1a41442864 tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too
Currently both the wakeup and irqsoff traces do not handle set_graph_notrace
well. The ftrace infrastructure will ignore the return paths of all
functions leaving them hanging without an end:

  # echo '*spin*' > set_graph_notrace
  # cat trace
  [...]
          _raw_spin_lock() {
            preempt_count_add() {
            do_raw_spin_lock() {
          update_rq_clock();

Where the '*spin*' functions should have looked like this:

          _raw_spin_lock() {
            preempt_count_add();
            do_raw_spin_lock();
          }
          update_rq_clock();

Instead, have the wakeup and irqsoff tracers ignore the functions that are
set by the set_graph_notrace like the function_graph tracer does. Move
the logic in the function_graph tracer into a header to allow wakeup and
irqsoff tracers to use it as well.

Cc: Namhyung Kim <namhyung.kim@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-09 09:21:35 -05:00
Steven Rostedt (Red Hat)
794de08a16 fgraph: Handle a case where a tracer ignores set_graph_notrace
Both the wakeup and irqsoff tracers can use the function graph tracer when
the display-graph option is set. The problem is that they ignore the notrace
file, and record the entry of functions that would be ignored by the
function_graph tracer. This causes the trace->depth to be recorded into the
ring buffer. The set_graph_notrace uses a trick by adding a large negative
number to the trace->depth when a graph function is to be ignored.

On trace output, the graph function uses the depth to record a stack of
functions. But since the depth is negative, it accesses the array with a
negative number and causes an out of bounds access that can cause a kernel
oops or corrupt data.

Have the print functions handle cases where a tracer still records functions
even when they are in set_graph_notrace.

Also add warnings if the depth is below zero before accessing the array.

Note, the function graph logic will still prevent the return of these
functions from being recorded, which means that they will be left hanging
without a return. For example:

   # echo '*spin*' > set_graph_notrace
   # echo 1 > options/display-graph
   # echo wakeup > current_tracer
   # cat trace
   [...]
      _raw_spin_lock() {
        preempt_count_add() {
        do_raw_spin_lock() {
      update_rq_clock();

Where it should look like:

      _raw_spin_lock() {
        preempt_count_add();
        do_raw_spin_lock();
      }
      update_rq_clock();

Cc: stable@vger.kernel.org
Cc: Namhyung Kim <namhyung.kim@lge.com>
Fixes: 29ad23b004 ("ftrace: Add set_graph_notrace filter")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-09 09:19:28 -05:00
Steven Rostedt (Red Hat)
656c7f0d2d tracing: Replace kmap with copy_from_user() in trace_marker writing
Instead of using get_user_pages_fast() and kmap_atomic() when writing
to the trace_marker file, just allocate enough space on the ring buffer
directly, and write into it via copy_from_user().

Writing into the trace_marker file use to allocate a temporary buffer
to perform the copy_from_user(), as we didn't want to write into the
ring buffer if the copy failed. But as a trace_marker write is suppose
to be extremely fast, and allocating memory causes other tracepoints to
trigger, Peter Zijlstra suggested using get_user_pages_fast() and
kmap_atomic() to keep the user space pages in memory and reading it
directly. But Henrik Austad had issues with this because it required taking
the mm->mmap_sem and causing long delays with the write.

Instead, just allocate the space in the ring buffer and use
copy_from_user() directly. If it faults, return -EFAULT and write
"<faulted>" into the ring buffer.

Link: http://lkml.kernel.org/r/20161208124018.72dd0f86@gandalf.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Henrik Austad <henrik@austad.us>
Cc: Peter Zijlstra <peterz@infradead.org>
Updates: d696b58ca2 "tracing: Do not allocate buffer for trace_marker"
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-09 09:18:14 -05:00
Steven Rostedt (Red Hat)
9c1f6bb8c8 tracing: Allow benchmark to be enabled at early_initcall()
The trace event start up selftests fails when the trace benchmark is
enabled, because it is disabled during boot. It really only needs to be
disabled before scheduling is set up, as it creates a thread.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-09 09:16:15 -05:00
Steven Rostedt (Red Hat)
989a0a3d24 tracing: Have system enable return error if one of the events fail
If one of the events within a system fails to enable when "1" is written
to the system "enable" file, it should return an error. Note, some events
may still be enabled, but the user should know that something did go wrong.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-09 09:15:41 -05:00
Steven Rostedt (Red Hat)
1dd349ab74 tracing: Do not start benchmark on boot up
Trace events are enabled very early on boot up via the boot command line
parameter. The benchmark tool creates a new thread to perform the trace
event benchmarking. But at start up, it is called before scheduling is set
up and because it creates a new thread before the init thread is created,
this crashes the kernel.

Have the benchmark fail to register when started via the kernel command
line.

Also, since the registering of a tracepoint now can handle failure cases,
return -ENOMEM instead of warning if the thread cannot be created.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-09 09:14:00 -05:00
Steven Rostedt (Red Hat)
8cf868affd tracing: Have the reg function allow to fail
Some tracepoints have a registration function that gets enabled when the
tracepoint is enabled. There may be cases that the registraction function
must fail (for example, can't allocate enough memory). In this case, the
tracepoint should also fail to register, otherwise the user would not know
why the tracepoint is not working.

Cc: David Howells <dhowells@redhat.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-09 09:13:30 -05:00
Thomas Gleixner
c029a2bec6 timekeeping: Use mul_u64_u32_shr() instead of open coding it
The resume code must deal with a clocksource delta which is potentially big
enough to overflow the 64bit mult.

Replace the open coded handling with the proper function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Parit Bhargava <prarit@redhat.com>
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Liav Rehana <liavr@mellanox.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20161208204228.921674404@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-09 12:06:42 +01:00
Thomas Gleixner
cbd99e3b28 timekeeping: Get rid of pointless typecasts
cycle_t is defined as u64, so casting it to u64 is a pointless and
confusing exercise. cycle_t should simply go away and be replaced with a
plain u64 to avoid further confusion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Parit Bhargava <prarit@redhat.com>
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Liav Rehana <liavr@mellanox.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20161208204228.844699737@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-09 12:06:42 +01:00
Thomas Gleixner
acc89612a7 timekeeping: Make the conversion call chain consistently unsigned
Propagating a unsigned value through signed variables and functions makes
absolutely no sense and is just prone to (re)introduce subtle signed
vs. unsigned issues as happened recently.

Clean it up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Parit Bhargava <prarit@redhat.com>
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Liav Rehana <liavr@mellanox.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20161208204228.765843099@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-09 12:06:41 +01:00
Thomas Gleixner
9c1645727b timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
The clocksource delta to nanoseconds conversion is using signed math, but
the delta is unsigned. This makes the conversion space smaller than
necessary and in case of a multiplication overflow the conversion can
become negative. The conversion is done with scaled math:

    s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift;

Shifting a signed integer right obvioulsy preserves the sign, which has
interesting consequences:
 
 - Time jumps backwards
 
 - __iter_div_u64_rem() which is used in one of the calling code pathes
   will take forever to piecewise calculate the seconds/nanoseconds part.

This has been reported by several people with different scenarios:

David observed that when stopping a VM with a debugger:

 "It was essentially the stopped by debugger case.  I forget exactly why,
  but the guest was being explicitly stopped from outside, it wasn't just
  scheduling lag.  I think it was something in the vicinity of 10 minutes
  stopped."

 When lifting the stop the machine went dead.

The stopped by debugger case is not really interesting, but nevertheless it
would be a good thing not to die completely.

But this was also observed on a live system by Liav:

 "When the OS is too overloaded, delta will get a high enough value for the
  msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so
  after the shift the nsec variable will gain a value similar to
  0xffffffffff000000."

Unfortunately this has been reintroduced recently with commit 6bd58f09e1
("time: Add cycles to nanoseconds translation"). It had been fixed a year
ago already in commit 35a4933a89 ("time: Avoid signed overflow in
timekeeping_get_ns()").

Though it's not surprising that the issue has been reintroduced because the
function itself and the whole call chain uses s64 for the result and the
propagation of it. The change in this recent commit is subtle:

   s64 nsec;

-  nsec = (d * m + n) >> s:
+  nsec = d * m + n;
+  nsec >>= s;

d being type of cycle_t adds another level of obfuscation.

This wouldn't have happened if the previous change to unsigned computation
would have made the 'nsec' variable u64 right away and a follow up patch
had cleaned up the whole call chain.

There have been patches submitted which basically did a revert of the above
patch leaving everything else unchanged as signed. Back to square one. This
spawned a admittedly pointless discussion about potential users which rely
on the unsigned behaviour until someone pointed out that it had been fixed
before. The changelogs of said patches added further confusion as they made
finally false claims about the consequences for eventual users which expect
signed results.

Despite delta being cycle_t, aka. u64, it's very well possible to hand in
a signed negative value and the signed computation will happily return the
correct result. But nobody actually sat down and analyzed the code which
was added as user after the propably unintended signed conversion.

Though in sensitive code like this it's better to analyze it proper and
make sure that nothing relies on this than hunting the subtle wreckage half
a year later. After analyzing all call chains it stands that no caller can
hand in a negative value (which actually would work due to the s64 cast)
and rely on the signed math to do the right thing.

Change the conversion function to unsigned math. The conversion of all call
chains is done in a follow up patch.

This solves the starvation issue, which was caused by the negative result,
but it does not solve the underlying problem. It merily procrastinates
it. When the timekeeper update is deferred long enough that the unsigned
multiplication overflows, then time going backwards is observable again.

It does neither solve the issue of clocksources with a small counter width
which will wrap around possibly several times and cause random time stamps
to be generated. But those are usually not found on systems used for
virtualization, so this is likely a non issue.

I took the liberty to claim authorship for this simply because
analyzing all callsites and writing the changelog took substantially
more time than just making the simple s/s64/u64/ change and ignore the
rest.

Fixes: 6bd58f09e1 ("time: Add cycles to nanoseconds translation")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Reported-by: Liav Rehana <liavr@mellanox.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Parit Bhargava <prarit@redhat.com>
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20161208204228.688545601@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-09 12:06:41 +01:00
Martin KaFai Lau
17bedab272 bpf: xdp: Allow head adjustment in XDP prog
This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header).  It is
done by adding a new XDP helper bpf_xdp_adjust_head().

It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.

This patch adds one "xdp_adjust_head" bit to bpf_prog for the
XDP-capable driver to check if the XDP prog requires
bpf_xdp_adjust_head() support.  The driver can then decide
to error out during XDP_SETUP_PROG.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:25:13 -05:00
Alexei Starovoitov
d2a4dd37f6 bpf: fix state equivalence
Commmits 57a09bf0a4 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
and 484611357c ("bpf: allow access into map value arrays") by themselves
are correct, but in combination they make state equivalence ignore 'id' field
of the register state which can lead to accepting invalid program.

Fixes: 57a09bf0a4 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
Fixes: 484611357c ("bpf: allow access into map value arrays")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 13:31:11 -05:00
Oleg Nesterov
8fb9dcbdc3 kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
kthread_create_on_cpu() sets KTHREAD_IS_PER_CPU and kthread->cpu, this
only makes sense if this kthread can be parked/unparked by cpuhp code.
kthread workers never call kthread_parkme() so this has no effect.

Change __kthread_create_worker() to simply call kthread_bind(task, cpu).
The very fact that kthread_create_on_cpu() doesn't accept a generic fmt
shows that it should not be used outside of smpboot.c.

Now, the only reason we can not unexport this helper and move it into
smpboot.c is that it sets kthread->cpu and struct kthread is not exported.
And the only reason we can not kill kthread->cpu is that kthread_unpark()
is used by drivers/gpu/drm/amd/scheduler/gpu_scheduler.c and thus we can
not turn _unpark into kthread_unpark(struct smp_hotplug_thread *, cpu).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Chunming Zhou <David1.Zhou@amd.com>
Cc: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161129175110.GA5342@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08 14:36:20 +01:00
Oleg Nesterov
cf380a4a96 kthread: Don't use to_live_kthread() in kthread_[un]park()
Now that to_kthread() is always validm change kthread_park() and
kthread_unpark() to use it and kill to_live_kthread().

The conversion of kthread_unpark() is trivial. If KTHREAD_IS_PARKED is set
then the task has called complete(&self->parked) and there the function
cannot race against a concurrent kthread_stop() and exit.

kthread_park() is more tricky, because its semantics are not well
defined. It returns -ENOSYS if the thread exited but this can never happen
and as Roman pointed out kthread_park() can obviously block forever if it
would race with the exiting kthread.

The usage of kthread_park() in cpuhp code (cpu.c, smpboot.c, stop_machine.c)
is fine. It can never see an exiting/exited kthread, smpboot_destroy_threads()
clears *ht->store, smpboot_park_thread() checks it is not NULL under the same
smpboot_threads_lock. cpuhp_threads and cpu_stop_threads never exit, so other
callers are fine too.

But it has two more users:

- watchdog_park_threads():

  The code is actually correct, get_online_cpus() ensures that
  kthread_park() can't race with itself (note that kthread_park() can't
  handle this race correctly), but it should not use kthread_park()
  directly.

- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c should not use
  kthread_park() either.

  kthread_park() must not be called after amd_sched_fini() which does
  kthread_stop(), otherwise even to_live_kthread() is not safe because
  task_struct can be already freed and sched->thread can point to nowhere.

The usage of kthread_park/unpark should either be restricted to core code
which is properly protected against the exit race or made more robust so it
is safe to use it in drivers.

To catch eventual exit issues, add a WARN_ON(PF_EXITING) for now.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chunming Zhou <David1.Zhou@amd.com>
Cc: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161129175107.GA5339@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08 14:36:19 +01:00
Oleg Nesterov
efb29fbfa5 kthread: Don't use to_live_kthread() in kthread_stop()
kthread_stop() had to use to_live_kthread() simply because it was not
possible to access kthread->exited after the exiting task clears
task_struct->vfork_done. Now that to_kthread() is always valid,
wake_up_process() + wait_for_completion() can be done
ununconditionally. It's not an issue anymore if the task has already issued
complete_vfork_done() or died.

The exiting task can get the spurious wakeup after mm_release() but this is
possible without this change too and is fine; do_task_dead() ensures that
this can't make any harm.

As a further enhancement this could be converted to task_work_add() later,
so ->vfork_done can be avoided completely.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chunming Zhou <David1.Zhou@amd.com>
Cc: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161129175103.GA5336@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08 14:36:19 +01:00
Oleg Nesterov
eff9662547 Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
This reverts commit 23196f2e5f.

Now that struct kthread is kmalloc'ed and not longer on the task stack
there is no need anymore to pin the stack.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chunming Zhou <David1.Zhou@amd.com>
Cc: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161129175100.GA5333@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08 14:36:18 +01:00
Oleg Nesterov
1da5c46fa9 kthread: Make struct kthread kmalloc'ed
commit 23196f2e5f "kthread: Pin the stack via try_get_task_stack() /
put_task_stack() in to_live_kthread() function" is a workaround for the
fragile design of struct kthread being allocated on the task stack.

struct kthread in its current form should be removed, but this needs
cleanups outside of kthread.c.

As a first step move struct kthread away from the task stack by making it
kmalloc'ed. This allows to access kthread.exited without the magic of
trying to pin task stack and the try logic in to_live_kthread().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chunming Zhou <David1.Zhou@amd.com>
Cc: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161129175057.GA5330@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08 14:36:18 +01:00
Michal Hocko
777c6e0dae hotplug: Make register and unregister notifier API symmetric
Yu Zhao has noticed that __unregister_cpu_notifier only unregisters its
notifiers when HOTPLUG_CPU=y while the registration might succeed even
when HOTPLUG_CPU=n if MODULE is enabled. This means that e.g. zswap
might keep a stale notifier on the list on the manual clean up during
the pool tear down and thus corrupt the list. Resulting in the following

[  144.964346] BUG: unable to handle kernel paging request at ffff880658a2be78
[  144.971337] IP: [<ffffffffa290b00b>] raw_notifier_chain_register+0x1b/0x40
<snipped>
[  145.122628] Call Trace:
[  145.125086]  [<ffffffffa28e5cf8>] __register_cpu_notifier+0x18/0x20
[  145.131350]  [<ffffffffa2a5dd73>] zswap_pool_create+0x273/0x400
[  145.137268]  [<ffffffffa2a5e0fc>] __zswap_param_set+0x1fc/0x300
[  145.143188]  [<ffffffffa2944c1d>] ? trace_hardirqs_on+0xd/0x10
[  145.149018]  [<ffffffffa2908798>] ? kernel_param_lock+0x28/0x30
[  145.154940]  [<ffffffffa2a3e8cf>] ? __might_fault+0x4f/0xa0
[  145.160511]  [<ffffffffa2a5e237>] zswap_compressor_param_set+0x17/0x20
[  145.167035]  [<ffffffffa2908d3c>] param_attr_store+0x5c/0xb0
[  145.172694]  [<ffffffffa290848d>] module_attr_store+0x1d/0x30
[  145.178443]  [<ffffffffa2b2b41f>] sysfs_kf_write+0x4f/0x70
[  145.183925]  [<ffffffffa2b2a5b9>] kernfs_fop_write+0x149/0x180
[  145.189761]  [<ffffffffa2a99248>] __vfs_write+0x18/0x40
[  145.194982]  [<ffffffffa2a9a412>] vfs_write+0xb2/0x1a0
[  145.200122]  [<ffffffffa2a9a732>] SyS_write+0x52/0xa0
[  145.205177]  [<ffffffffa2ff4d97>] entry_SYSCALL_64_fastpath+0x12/0x17

This can be even triggered manually by changing
/sys/module/zswap/parameters/compressor multiple times.

Fix this issue by making unregister APIs symmetric to the register so
there are no surprises.

Fixes: 47e627bc8c ("[PATCH] hotplug: Allow modules to use the cpu hotplug notifiers even if !CONFIG_HOTPLUG_CPU")
Reported-and-tested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Link: http://lkml.kernel.org/r/20161207135438.4310-1-mhocko@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-08 10:08:41 +01:00
Kefeng Wang
166ad0e1e2 kcov: add missing #include <linux/sched.h>
In __sanitizer_cov_trace_pc we use task_struct and fields within it, but
as we haven't included <linux/sched.h>, it is not guaranteed to be
defined.  While we usually happen to acquire the definition through a
transitive include, this is fragile (and hasn't been true in the past,
causing issues with backports).

Include <linux/sched.h> to avoid any fragility.

[mark.rutland@arm.com: rewrote changelog]
Link: http://lkml.kernel.org/r/1481007384-27529-1-git-send-email-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-07 17:10:00 -08:00
Linus Torvalds
68f5503bdc Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "An autogroup nice level adjustment bug fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/autogroup: Fix 64-bit kernel nice level adjustment
2016-12-07 11:35:55 -08:00
Linus Torvalds
bf7f1c7e2f Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A bogus warning fix, a counter width handling fix affecting certain
  machines, plus a oneliner hw-enablement patch for Knights Mill CPUs"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Remove invalid warning from list_update_cgroup_even()t
  perf/x86: Fix full width counter, counter overflow
  perf/x86/intel: Enable C-state residency events for Knights Mill
2016-12-07 11:32:19 -08:00
Linus Torvalds
5b43f97f3f Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Two rtmutex race fixes (which miraculously never triggered, that we
  know of), plus two lockdep printk formatting regression fixes"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Fix report formatting
  locking/rtmutex: Use READ_ONCE() in rt_mutex_owner()
  locking/rtmutex: Prevent dequeue vs. unlock race
  locking/selftest: Fix output since KERN_CONT changes
2016-12-07 11:27:33 -08:00
Daniel Borkmann
ef0915cacd bpf: fix loading of BPF_MAXINSNS sized programs
General assumption is that single program can hold up to BPF_MAXINSNS,
that is, 4096 number of instructions. It is the case with cBPF and
that limit was carried over to eBPF. When recently testing digest, I
noticed that it's actually not possible to feed 4096 instructions
via bpf(2).

The check for > BPF_MAXINSNS was added back then to bpf_check() in
cbd3570086 ("bpf: verifier (add ability to receive verification log)").
However, 09756af468 ("bpf: expand BPF syscall with program load/unload")
added yet another check that comes before that into bpf_prog_load(),
but this time bails out already in case of >= BPF_MAXINSNS.

Fix it up and perform the check early in bpf_prog_load(), so we can drop
the second one in bpf_check(). It makes sense, because also a 0 insn
program is useless and we don't want to waste any resources doing work
up to bpf_check() point. The existing bpf(2) man page documents E2BIG
as the official error for such cases, so just stick with it as well.

Fixes: 09756af468 ("bpf: expand BPF syscall with program load/unload")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-07 13:16:04 -05:00
Murali Karicheri
5304121ada clocksource: export the clocks_calc_mult_shift to use by timestamp code
The CPSW CPTS driver is capable of doing timestamping on tx/rx packets and
requires to know mult and shift factors for timestamp conversion from raw
value to nanoseconds (ptp clock). Now these mult and shift factors are
calculated manually and provided through DT, which makes very hard to
support of a lot number of platforms, especially if CPTS refclk is not the
same for some kind of boards and depends on efuse settings (Keystone 2
platforms). Hence, export clocks_calc_mult_shift() to allow drivers like
CPSW CPTS (and other ptp drivesr) to benefit from automaitc calculation of
mult and shift factors.

Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-07 11:13:48 -05:00
Sebastian Andrzej Siewior
b18cc3de00 tracing/rb: Init the CPU mask on allocation
Before commit b32614c034 ("tracing/rb: Convert to hotplug state machine")
the allocated cpumask was initialized to the mask of online or possible
CPUs. After the CPU hotplug changes the buffer initialization moved to
trace_rb_cpu_prepare() but the cpumask is allocated with alloc_cpumask()
and therefor has random content. As a consequence the cpu buffers are not
initialized and a later access dereferences a NULL pointer.

Use zalloc_cpumask() instead so trace_rb_cpu_prepare() initializes the
buffers properly.

Fixes: b32614c034 ("tracing/rb: Convert to hotplug state machine")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20161207133133.hzkcqfllxcdi3joz@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-07 14:36:21 +01:00
Dmitry Vyukov
f943fe0faf lockdep: Fix report formatting
Since commit:

  4bcc595ccd ("printk: reinstate KERN_CONT for printing continuation lines")

printk() requires KERN_CONT to continue log messages. Lots of printk()
in lockdep.c and print_ip_sym() don't have it. As the result lockdep
reports are completely messed up.

Add missing KERN_CONT and inline print_ip_sym() where necessary.

Example of a messed up report:

  0-rc5+ #41 Not tainted
  -------------------------------------------------------
  syz-executor0/5036 is trying to acquire lock:
   (
  rtnl_mutex
  ){+.+.+.}
  , at:
  [<ffffffff86b3d6ac>] rtnl_lock+0x1c/0x20
  but task is already holding lock:
   (
  &net->packet.sklist_lock
  ){+.+...}
  , at:
  [<ffffffff873541a6>] packet_diag_dump+0x1a6/0x1920
  which lock already depends on the new lock.
  the existing dependency chain (in reverse order) is:
  -> #3
   (
  &net->packet.sklist_lock
  +.+...}
  ...

Without this patch all scripts that parse kernel bug reports are broken.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andreyknvl@google.com
Cc: aryabinin@virtuozzo.com
Cc: joe@perches.com
Cc: syzkaller@googlegroups.com
Link: http://lkml.kernel.org/r/1480343083-48731-1-git-send-email-dvyukov@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-06 10:40:08 +01:00
David Carrillo-Cisneros
8fc31ce889 perf/core: Remove invalid warning from list_update_cgroup_even()t
The warning introduced in commit:

  864c2357ca ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")

assumed that a cgroup switch always precedes list_del_event. This is
not the case. Remove warning.

Make sure that cpuctx->cgrp is NULL until a cgroup event is sched in
or ctx->nr_cgroups == 0.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1480841177-27299-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-06 09:44:29 +01:00
Al Viro
8bd107633b audit_log_{name,link_denied}: constify struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:00:38 -05:00
Al Viro
3cd5eca8d7 fsnotify: constify 'data' passed to ->handle_event()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 18:58:31 -05:00
Daniel Borkmann
7bd509e311 bpf: add prog_digest and expose it via fdinfo/netlink
When loading a BPF program via bpf(2), calculate the digest over
the program's instruction stream and store it in struct bpf_prog's
digest member. This is done at a point in time before any instructions
are rewritten by the verifier. Any unstable map file descriptor
number part of the imm field will be zeroed for the hash.

fdinfo example output for progs:

  # cat /proc/1590/fdinfo/5
  pos:          0
  flags:        02000002
  mnt_id:       11
  prog_type:    1
  prog_jited:   1
  prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
  memlock:      4096

When programs are pinned and retrieved by an ELF loader, the loader
can check the program's digest through fdinfo and compare it against
one that was generated over the ELF file's program section to see
if the program needs to be reloaded. Furthermore, this can also be
exposed through other means such as netlink in case of a tc cls/act
dump (or xdp in future), but also through tracepoints or other
facilities to identify the program. Other than that, the digest can
also serve as a base name for the work in progress kallsyms support
of programs. The digest doesn't depend/select the crypto layer, since
we need to keep dependencies to a minimum. iproute2 will get support
for this facility.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:33:11 -05:00
Al Viro
cbbd26b8b1 [iov_iter] new primitives - copy_from_iter_full() and friends
copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.

Convert some obvious users.  *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case.  Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 14:33:36 -05:00
Gianluca Borello
3c839744b3 bpf: Preserve const register type on const OR alu ops
Occasionally, clang (e.g. version 3.8.1) translates a sum between two
constant operands using a BPF_OR instead of a BPF_ADD. The verifier is
currently not handling this scenario, and the destination register type
becomes UNKNOWN_VALUE even if it's still storing a constant. As a result,
the destination register cannot be used as argument to a helper function
expecting a ARG_CONST_STACK_*, limiting some use cases.

Modify the verifier to handle this case, and add a few tests to make sure
all combinations are supported, and stack boundaries are still verified
even with BPF_OR.

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:40:05 -05:00
Al Viro
450630975d don't open-code file_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-04 18:29:28 -05:00
David S. Miller
2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Linus Torvalds
8bca927f13 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Lots more phydev and probe error path leaks in various drivers by
    Johan Hovold.

 2) Fix race in packet_set_ring(), from Philip Pettersson.

 3) Use after free in dccp_invalid_packet(), from Eric Dumazet.

 4) Signnedness overflow in SO_{SND,RCV}BUFFORCE, also from Eric
    Dumazet.

 5) When tunneling between ipv4 and ipv6 we can be left with the wrong
    skb->protocol value as we enter the IPSEC engine and this causes all
    kinds of problems. Set it before the output path does any
    dst_output() calls, from Eli Cooper.

 6) bcmgenet uses wrong device struct pointer in DMA API calls, fix from
    Florian Fainelli.

 7) Various netfilter nat bug fixes from FLorian Westphal.

 8) Fix memory leak in ipvlan_link_new(), from Gao Feng.

 9) Locking fixes, particularly wrt. socket lookups, in l2tp from
    Guillaume Nault.

10) Avoid invoking rhash teardowns in atomic context by moving netlink
    cb->done() dump completion from a worker thread. Fix from Herbert
    Xu.

11) Buffer refcount problems in tun and macvtap on errors, from Jason
    Wang.

12) We don't set Kconfig symbol DEFAULT_TCP_CONG properly when the user
    selects BBR. Fix from Julian Wollrath.

13) Fix deadlock in transmit path on altera TSE driver, from Lino
    Sanfilippo.

14) Fix unbalanced reference counting in dsa_switch_tree, from Nikita
    Yushchenko.

15) tc_tunnel_key needs to be properly exported to userspace via uapi,
    fix from Roi Dayan.

16) rds_tcp_init_net() doesn't unregister notifier in error path, fix
    from Sowmini Varadhan.

17) Stale packet header pointer access after pskb_expand_head() in
    genenve driver, fix from Sabrina Dubroca.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
  net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
  geneve: avoid use-after-free of skb->data
  tipc: check minimum bearer MTU
  net: renesas: ravb: unintialized return value
  sh_eth: remove unchecked interrupts for RZ/A1
  net: bcmgenet: Utilize correct struct device for all DMA operations
  NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
  cdc_ether: Fix handling connection notification
  ip6_offload: check segs for NULL in ipv6_gso_segment.
  RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net
  Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
  ipv6: Set skb->protocol properly for local output
  ipv4: Set skb->protocol properly for local output
  packet: fix race condition in packet_set_ring
  net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler
  net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers
  net: ethernet: stmmac: fix of-node and fixed-link-phydev leaks
  net: ethernet: stmmac: platform: fix outdated function header
  net: ethernet: stmmac: dwmac-meson8b: fix probe error path
  net: ethernet: stmmac: dwmac-generic: fix probe error path
  ...
2016-12-02 11:45:27 -08:00
David Ahern
6102365876 bpf: Add new cgroup attach type to enable sock modifications
Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:08 -05:00
David Ahern
b2cd12574a bpf: Refactor cgroups code in prep for new type
Code move and rename only; no functional change intended.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:44:56 -05:00
Thomas Graf
3a0af8fd61 bpf: BPF for lightweight tunnel infrastructure
Registers new BPF program types which correspond to the LWT hooks:
  - BPF_PROG_TYPE_LWT_IN   => dst_input()
  - BPF_PROG_TYPE_LWT_OUT  => dst_output()
  - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

The separate program types are required to differentiate between the
capabilities each LWT hook allows:

 * Programs attached to dst_input() or dst_output() are restricted and
   may only read the data of an skb. This prevent modification and
   possible invalidation of already validated packet headers on receive
   and the construction of illegal headers while the IP headers are
   still being assembled.

 * Programs attached to lwtunnel_xmit() are allowed to modify packet
   content as well as prepending an L2 header via a newly introduced
   helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
   invoked after the IP header has been assembled completely.

All BPF programs receive an skb with L3 headers attached and may return
one of the following error codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                (Only valid in lwtunnel_xmit() context)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:51:49 -05:00
Thomas Gleixner
84d82ec5b9 locking/rtmutex: Explain locking rules for rt_mutex_proxy_unlock()/init_proxy_locked()
While debugging the unlock vs. dequeue race which resulted in state
corruption of futexes the lockless nature of rt_mutex_proxy_unlock()
caused some confusion.

Add commentry to explain why it is safe to do this lockless. Add matching
comments to rt_mutex_init_proxy_locked() for completeness sake.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Daney <ddaney@caviumnetworks.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20161130210030.591941927@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02 11:13:57 +01:00
Thomas Gleixner
b5016e8203 locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALL
This is a left over from the original rtmutex implementation which used
both bit0 and bit1 in the owner pointer. Commit:

  8161239a8b ("rtmutex: Simplify PI algorithm and make highest prio task get lock")

... removed the usage of bit1, but kept the extra mask around. This is
confusing at best.

Remove it and just use RT_MUTEX_HAS_WAITERS for the masking.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Daney <ddaney@caviumnetworks.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20161130210030.509567906@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02 11:13:57 +01:00
Ingo Molnar
1b95b1a06c Merge branch 'locking/urgent' into locking/core, to pick up dependent fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02 11:13:44 +01:00
Thomas Gleixner
1be5d4fa0a locking/rtmutex: Use READ_ONCE() in rt_mutex_owner()
While debugging the rtmutex unlock vs. dequeue race Will suggested to use
READ_ONCE() in rt_mutex_owner() as it might race against the
cmpxchg_release() in unlock_rt_mutex_safe().

Will: "It's a minor thing which will most likely not matter in practice"

Careful search did not unearth an actual problem in todays code, but it's
better to be safe than surprised.

Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Daney <ddaney@caviumnetworks.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02 11:13:26 +01:00
Thomas Gleixner
dbb26055de locking/rtmutex: Prevent dequeue vs. unlock race
David reported a futex/rtmutex state corruption. It's caused by the
following problem:

CPU0		CPU1		CPU2

l->owner=T1
		rt_mutex_lock(l)
		lock(l->wait_lock)
		l->owner = T1 | HAS_WAITERS;
		enqueue(T2)
		boost()
		  unlock(l->wait_lock)
		schedule()

				rt_mutex_lock(l)
				lock(l->wait_lock)
				l->owner = T1 | HAS_WAITERS;
				enqueue(T3)
				boost()
				  unlock(l->wait_lock)
				schedule()
		signal(->T2)	signal(->T3)
		lock(l->wait_lock)
		dequeue(T2)
		deboost()
		  unlock(l->wait_lock)
				lock(l->wait_lock)
				dequeue(T3)
				  ===> wait list is now empty
				deboost()
				 unlock(l->wait_lock)
		lock(l->wait_lock)
		fixup_rt_mutex_waiters()
		  if (wait_list_empty(l)) {
		    owner = l->owner & ~HAS_WAITERS;
		    l->owner = owner
		     ==> l->owner = T1
		  }

				lock(l->wait_lock)
rt_mutex_unlock(l)		fixup_rt_mutex_waiters()
				  if (wait_list_empty(l)) {
				    owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
 ===> Success (l->owner = NULL)
				    l->owner = owner
				     ==> l->owner = T1
				  }

That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.

This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().

The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.

It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.

Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.

Reported-by: David Daney <ddaney@caviumnetworks.com>
Tested-by: David Daney <david.daney@cavium.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@vger.kernel.org
Fixes: 23f78d4a03 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02 11:13:26 +01:00
Sebastian Andrzej Siewior
b32614c034 tracing/rb: Convert to hotplug state machine
Install the callbacks via the state machine. The notifier in struct
ring_buffer is replaced by the multi instance interface.  Upon
__ring_buffer_alloc() invocation, cpuhp_state_add_instance() will invoke
the trace_rb_cpu_prepare() on each CPU.

This callback may now fail. This means __ring_buffer_alloc() will fail and
cleanup (like previously) and during a CPU up event this failure will not
allow the CPU to come up.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161126231350.10321-7-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:34 +01:00
WANG Cong
6060298272 audit: remove useless synchronize_net()
netlink kernel socket is protected by refcount, not RCU.
Its rcv path is neither protected by RCU. So the synchronize_net()
is just pointless.

Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-01 11:29:02 -05:00
Baolin Wang
4a057549d6 alarmtimer: Add tracepoints for alarm timers
Alarm timers are one of the mechanisms to wake up a system from suspend,
but there exist no tracepoints to analyse which process/thread armed an
alarmtimer.

Add tracepoints for start/cancel/expire of individual alarm timers and one
for tracing the suspend time decision when to resume the system.

The following trace excerpt illustrates the new mechanism:

Binder:3292_2-3304  [000] d..2   149.981123: alarmtimer_cancel:
alarmtimer:ffffffc1319a7800 type:REALTIME
expires:1325463120000000000 now:1325376810370370245

Binder:3292_2-3304  [000] d..2   149.981136: alarmtimer_start:
alarmtimer:ffffffc1319a7800 type:REALTIME
expires:1325376840000000000 now:1325376810370384591

Binder:3292_9-3953  [000] d..2   150.212991: alarmtimer_cancel:
alarmtimer:ffffffc1319a5a00 type:BOOTTIME
expires:179552000000 now:150154008122

Binder:3292_9-3953  [000] d..2   150.213006: alarmtimer_start:
alarmtimer:ffffffc1319a5a00 type:BOOTTIME
expires:179551000000 now:150154025622

system_server-3000  [002] ...1  162.701940: alarmtimer_suspend:
alarmtimer type:REALTIME expires:1325376840000000000

The wakeup time which is selected at suspend time allows to map it back to
the task arming the timer: Binder:3292_2.

[ tglx: Store alarm timer expiry time instead of some useless RTC relative
  	information, add proper type information for wakeups which are
  	handled via the clock_nanosleep/freezer and massage the changelog. ]

Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Link: http://lkml.kernel.org/r/1480372524-15181-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-01 14:45:08 +01:00
Rafael J. Wysocki
4e28ec3d5f Merge back earlier cpuidle material for v4.10. 2016-12-01 14:39:51 +01:00
Josef Bacik
e2d2afe15e bpf: fix states equal logic for varlen access
If we have a branch that looks something like this

int foo = map->value;
if (condition) {
  foo += blah;
} else {
  foo = bar;
}
map->array[foo] = baz;

We will incorrectly assume that the !condition branch is equal to the condition
branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
adjust this logic to only do this if we didn't do a varlen access after we
processed the !condition branch, otherwise we have different ranges and need to
check the other branch as well.

Fixes: 484611357c ("bpf: allow access into map value arrays")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:50:52 -05:00
Thiago Jung Bauermann
e2e806f9e4 kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.
kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
implementation to find free memory for the purgatory stack.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-30 23:15:01 +11:00
Thiago Jung Bauermann
ec2b9bfaac kexec_file: Change kexec_add_buffer to take kexec_buf as argument.
This is done to simplify the kexec_add_buffer argument list.
Adapt all callers to set up a kexec_buf to pass to kexec_add_buffer.

In addition, change the type of kexec_buf.buffer from char * to void *.
There is no particular reason for it to be a char *, and the change
allows us to get rid of 3 existing casts to char * in the code.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-30 23:14:59 +11:00
Thiago Jung Bauermann
60fe3910bb kexec_file: Allow arch-specific memory walking for kexec_add_buffer
Allow architectures to specify a different memory walking function for
kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but
PowerPC uses the memblock subsystem.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-30 23:14:57 +11:00
Peter Zijlstra
f8319483f5 locking/lockdep: Provide a type check for lock_is_held
Christoph requested lockdep_assert_held() variants that distinguish
between held-for-read or held-for-write.

Provide:

  int lock_is_held_type(struct lockdep_map *lock, int read)

which takes the same argument as lock_acquire(.read) and matches it to
the held_lock instance.

Use of this function should be gated by the debug_locks variable. When
that is 0 the return value of the lock_is_held_type() function is
undefined. This is done to allow both negative and positive tests for
holding locks.

By default we provide (positive) lockdep_assert_held{,_exclusive,_read}()
macros.

Requested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-30 14:32:25 +11:00
Daniel Mack
01ae87eab5 bpf: cgroup: fix documentation of __cgroup_bpf_update()
There's a 'not' missing in one paragraph. Add it.

Fixes: 3007098494 ("cgroup: add support for eBPF programs")
Signed-off-by: Daniel Mack <daniel@zonque.org>
Reported-by: Rami Rosen <roszenrami@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-29 19:50:59 -05:00
Linus Torvalds
faaae2a581 Re-enable CONFIG_MODVERSIONS in a slightly weaker form
This enables CONFIG_MODVERSIONS again, but allows for missing symbol CRC
information in order to work around the issue that newer binutils
versions seem to occasionally drop the CRC on the floor.  binutils 2.26
seems to work fine, while binutils 2.27 seems to break MODVERSIONS of
symbols that have been defined in assembler files.

[ We've had random missing CRC's before - it may be an old problem that
  just is now reliably triggered with the weak asm symbols and a new
  version of binutils ]

Some day I really do want to remove MODVERSIONS entirely.  Sadly, today
does not appear to be that day: Debian people apparently do want the
option to enable MODVERSIONS to make it easier to have external modules
across kernel versions, and this seems to be a fairly minimal fix for
the annoying problem.

Cc: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Michal Marek <mmarek@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-29 16:01:30 -08:00
Richard Guy Briggs
8fae477056 audit: add support for session ID user filter
Define AUDIT_SESSIONID in the uapi and add support for specifying user
filters based on the session ID.  Also add the new session ID filter
to the feature bitmap so userspace knows it is available.

https://github.com/linux-audit/audit-kernel/issues/4
RFE: add a session ID filter to the kernel's user filter

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: combine multiple patches from Richard into this one]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-29 15:10:12 -05:00
Joel Fernandes
80ec355210 trace: Add an option for boot clock as trace clock
Unlike monotonic clock, boot clock as a trace clock will account for
time spent in suspend useful for tracing suspend/resume. This uses
earlier introduced infrastructure for using the fast boot clock.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Link: http://lkml.kernel.org/r/1480372524-15181-7-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-29 18:02:59 +01:00
Joel Fernandes
948a5312f4 timekeeping: Add a fast and NMI safe boot clock
This boot clock can be used as a tracing clock and will account for
suspend time.

To keep it NMI safe since we're accessing from tracing, we're not using a
separate timekeeper with updates to monotonic clock and boot offset
protected with seqlocks. This has the following minor side effects:

(1) Its possible that a timestamp be taken after the boot offset is updated
but before the timekeeper is updated. If this happens, the new boot offset
is added to the old timekeeping making the clock appear to update slightly
earlier:
   CPU 0                                        CPU 1
   timekeeping_inject_sleeptime64()
   __timekeeping_inject_sleeptime(tk, delta);
                                                timestamp();
   timekeeping_update(tk, TK_CLEAR_NTP...);

(2) On 32-bit systems, the 64-bit boot offset (tk->offs_boot) may be
partially updated.  Since the tk->offs_boot update is a rare event, this
should be a rare occurrence which postprocessing should be able to handle.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1480372524-15181-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-29 18:02:59 +01:00
Peter Zijlstra
c1de45ca83 sched/idle: Add support for tasks that inject idle
Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
realtime tasks to take control of CPU then inject idle. There are two
issues with this approach:

 1. Low efficiency: injected idle task is treated as busy so sched ticks
    do not stop during injected idle period, the result of these
    unwanted wakeups can be ~20% loss in power savings.

 2. Idle accounting: injected idle time is presented to user as busy.

This patch addresses the issues by introducing a new PF_IDLE flag which
allows any given task to be treated as idle task while the flag is set.
Therefore, idle injection tasks can run through the normal flow of NOHZ
idle enter/exit to get the correct accounting as well as tick stop when
possible.

The implication is that idle task is then no longer limited to PID == 0.

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-29 14:02:21 +01:00
Jacob Pan
bb8313b603 cpuidle: Allow enforcing deepest idle state selection
When idle injection is used to cap power, we need to override the
governor's choice of idle states.

For this reason, make it possible the deepest idle state selection to
be enforced by setting a flag on a given CPU to achieve the maximum
potential power draw reduction.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-29 14:02:21 +01:00
Daniel Borkmann
a3af5f8001 bpf: allow for mount options to specify permissions
Since we recently converted the BPF filesystem over to use mount_nodev(),
we now have the possibility to also hold mount options in sb's s_fs_info.
This work implements mount options support for specifying permissions on
the sb's inode, which will be used by tc when it manually needs to mount
the fs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:38:47 -05:00
Daniel Borkmann
21116b7068 bpf: add owner_prog_type and accounted mem to array map's fdinfo
Allow for checking the owner_prog_type of a program array map. In some
cases bpf(2) can return -EINVAL /after/ the verifier passed and did all
the rewrites of the bpf program.

The reason that lets us fail at this late stage is that program array
maps are incompatible. Allow users to inspect this earlier after they
got the map fd through BPF_OBJ_GET command. tc will get support for this.

Also, display how much we charged the map with regards to RLIMIT_MEMLOCK.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:38:47 -05:00
Daniel Borkmann
88575199cc bpf: drop unnecessary context cast from BPF_PROG_RUN
Since long already bpf_func is not only about struct sk_buff * as
input anymore. Make it generic as void *, so that callers don't
need to cast for it each time they call BPF_PROG_RUN().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:38:47 -05:00
AKASHI Takahiro
39290b389e module: extend 'rodata=off' boot cmdline parameter to module mappings
The current "rodata=off" parameter disables read-only kernel mappings
under CONFIG_DEBUG_RODATA:
    commit d2aa1acad2 ("mm/init: Add 'rodata=off' boot cmdline parameter
    to disable read-only kernel mappings")

This patch is a logical extension to module mappings ie. read-only mappings
at module loading can be disabled even if CONFIG_DEBUG_SET_MODULE_RONX
(mainly for debug use). Please note, however, that it only affects RO/RW
permissions, keeping NX set.

This is the first step to make CONFIG_DEBUG_SET_MODULE_RONX mandatory
(always-on) in the future as CONFIG_DEBUG_RODATA on x86 and arm64.

Suggested-by: and Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Link: http://lkml.kernel.org/r/20161114061505.15238-1-takahiro.akashi@linaro.org
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-27 16:15:33 -08:00
David S. Miller
0b42f25d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.

Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26 23:42:21 -05:00
Miroslav Benes
71d9f50793 module: Fix a comment above strong_try_module_get()
The comment above strong_try_module_get() function is not true anymore.
Return values changed with commit c9a3ba55bb ("module: wait for
dependent modules doing init.").

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1611161635330.12580@pobox.suse.cz
[jeyu@redhat.com: style fixes to make checkpatch.pl happy]
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-26 11:18:03 -08:00
Aaron Tomlin
905dd707fc module: When modifying a module's text ignore modules which are going away too
By default, during the access permission modification of a module's core
and init pages, we only ignore modules that are malformed. Albeit for a
module which is going away, it does not make sense to change its text to
RO since the module should be RW, before deallocation.

This patch makes set_all_modules_text_ro() skip modules which are going
away too.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Link: http://lkml.kernel.org/r/1477560966-781-1-git-send-email-atomlin@redhat.com
[jeyu@redhat.com: add comment as suggested by Steven Rostedt]
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-26 11:18:03 -08:00
Aaron Tomlin
885a78d4a5 module: Ensure a module's state is set accordingly during module coming cleanup code
In load_module() in the event of an error, for e.g. unknown module
parameter(s) specified we go to perform some module coming clean up
operations. At this point the module is still in a "formed" state
when it is actually going away.

This patch updates the module's state accordingly to ensure anyone on the
module_notify_list waiting for a module going away notification will be
notified accordingly.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: http://lkml.kernel.org/r/1476980293-19062-2-git-send-email-atomlin@redhat.com
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-26 11:18:02 -08:00
Petr Mladek
7fd8329ba5 taint/module: Clean up global and module taint flags handling
The commit 66cc69e34e ("Fix: module signature vs tracepoints:
add new TAINT_UNSIGNED_MODULE") updated module_taint_flags() to
potentially print one more character. But it did not increase the
size of the corresponding buffers in m_show() and print_modules().

We have recently done the same mistake when adding a taint flag
for livepatching, see
https://lkml.kernel.org/r/cfba2c823bb984690b73572aaae1db596b54a082.1472137475.git.jpoimboe@redhat.com

Also struct module uses an incompatible type for mod-taints flags.
It survived from the commit 2bc2d61a96 ("[PATCH] list module
taint flags in Oops/panic"). There was used "int" for the global taint
flags at these times. But only the global tain flags was later changed
to "unsigned long" by the commit 25ddbb18aa ("Make the taint
flags reliable").

This patch defines TAINT_FLAGS_COUNT that can be used to create
arrays and buffers of the right size. Note that we could not use
enum because the taint flag indexes are used also in assembly code.

Then it reworks the table that describes the taint flags. The TAINT_*
numbers can be used as the index. Instead, we add information
if the taint flag is also shown per-module.

Finally, it uses "unsigned long", bit operations, and the updated
taint_flags table also for mod->taints.

It is not optimal because only few taint flags can be printed by
module_taint_flags(). But better be on the safe side. IMHO, it is
not worth the optimization and this is a good compromise.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: http://lkml.kernel.org/r/1474458442-21581-1-git-send-email-pmladek@suse.com
[jeyu@redhat.com: fix broken lkml link in changelog]
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-26 11:18:01 -08:00
Daniel Mack
f432455148 bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:26:04 -05:00
Daniel Mack
3007098494 cgroup: add support for eBPF programs
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:25:52 -05:00
Viresh Kumar
d06e622d3d cpufreq: schedutil: Rectify comment in sugov_irq_work() function
This patch rectifies a comment present in sugov_irq_work() function to
follow proper grammar.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-24 21:50:59 +01:00
Tim Chen
afe06efdf0 sched: Extend scheduler's asym packing
We generalize the scheduler's asym packing to provide an ordering
of the cpu beyond just the cpu number.  This allows the use of the
ASYM_PACKING scheduler machinery to move loads to preferred CPU in a
sched domain. The preference is defined with the cpu priority
given by arch_asym_cpu_priority(cpu).

We also record the most preferred cpu in a sched group when
we build the cpu's capacity for fast lookup of preferred cpu
during load balancing.

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-pm@vger.kernel.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/0e73ae12737dfaafa46c07066cc7c5d3f1675e46.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24 14:09:46 +01:00
Mike Galbraith
83929cce95 sched/autogroup: Fix 64-bit kernel nice level adjustment
Michael Kerrisk reported:

> Regarding the previous paragraph...  My tests indicate
> that writing *any* value to the autogroup [nice priority level]
> file causes the task group to get a lower priority.

Because autogroup didn't call the then meaningless scale_load()...

Autogroup nice level adjustment has been broken ever since load
resolution was increased for 64-bit kernels.  Use scale_load() to
scale group weight.

Michael Kerrisk tested this patch to fix the problem:

> Applied and tested against 4.9-rc6 on an Intel u7 (4 cores).
> Test setup:
>
> Terminal window 1: running 40 CPU burner jobs
> Terminal window 2: running 40 CPU burner jobs
> Terminal window 1: running  1 CPU burner job
>
> Demonstrated that:
> * Writing "0" to the autogroup file for TW1 now causes no change
>   to the rate at which the process on the terminal consume CPU.
> * Writing -20 to the autogroup file for TW1 caused those processes
>   to get the lion's share of CPU while TW2 TW3 get a tiny amount.
> * Writing -20 to the autogroup files for TW1 and TW3 allowed the
>   process on TW3 to get as much CPU as it was getting as when
>   the autogroup nice values for both terminals were 0.

Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-man <linux-man@vger.kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1479897217.4306.6.camel@gmx.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-24 05:45:02 +01:00
Steven Rostedt (Red Hat)
38e11df134 ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline
Both rb_end_commit() and rb_set_commit_to_write() are in the fast path of
the ring buffer recording. Make sure they are always inlined.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23 20:42:31 -05:00
Steven Rostedt (Red Hat)
babe3fce95 ring-buffer: Froce rb_update_write_stamp() to be inlined
The function rb_update_write_stamp() is in the hotpath of the ring buffer
recording. Make sure that it is inlined as well. There's not many places
that call it.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23 20:38:39 -05:00
Steven Rostedt (Red Hat)
2289d5672f ring-buffer: Force inline of hotpath helper functions
There's several small helper functions in ring_buffer.c that are used in the
hot path. For some reason, even though they are marked inline, gcc tends not
to enforce it. Make sure these functions are always inlined.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23 20:35:32 -05:00
Steven Rostedt (Red Hat)
52ffabe384 tracing: Make __buffer_unlock_commit() always_inline
The function __buffer_unlock_commit() is called in a few places outside of
trace.c. But for the most part, it should really be inlined, as it is in the
hot path of the trace_events. For the callers outside of trace.c, create a
new function trace_buffer_unlock_commit_nostack(), as the reason it was used
was to avoid the stack tracing that trace_buffer_unlock_commit() could do.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23 20:30:51 -05:00
Steven Rostedt (Red Hat)
4239174570 tracing: Make tracepoint_printk a static_key
Currently, when tracepoint_printk is set (enabled by the "tp_printk" kernel
command line), it causes trace events to print via printk(). This is a very
dangerous operation, but is useful for debugging.

The issue is, it's seldom used, but it is always checked even if it's not
enabled by the kernel command line. Instead of having this feature called by
a branch against a variable, turn that variable into a static key, and this
will remove the test and jump.

To simplify things, the functions output_printk() and
trace_event_buffer_commit() were moved from trace_events.c to trace.c.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23 15:52:45 -05:00
Steven Rostedt (Red Hat)
929ddbf3ef ring-buffer: Always inline rb_event_data()
The rb_event_data() is the fast path of getting the ring buffer data from an
event. Externally, ring_buffer_event_data() is used to access this function.
But unfortunately, rb_event_data() is not inlined, and calling
ring_buffer_event_data() causes that function to be called again. Force
rb_event_data() to be inlined to lower the number of operations needed when
calling ring_buffer_event_data().

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23 11:40:34 -05:00
Steven Rostedt (Red Hat)
fa7ffb39ef ring-buffer: Make rb_reserve_next_event() always inlined
The function rb_reserved_next_event() is called by two functions:
ring_buffer_lock_reserve() and ring_buffer_write(). This is in a very hot
path of the tracing code, and it is best that they are not functions. The
two callers are basically wrapers for rb_reserver_next_event(). Removing the
function calls can save execution time in the hotpath of tracing.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23 11:36:30 -05:00
Steven Rostedt (Red Hat)
3e9a8aadca tracing: Create a always_inlined __trace_buffer_lock_reserve()
As Andi Kleen pointed out in the Link below, the trace events has quite a
bit of code execution. A lot of that happens to be calling functions, where
some of them should simply be inlined. One of these functions happens to be
trace_buffer_lock_reserve() which is also a global, but it is used
throughout the file it is defined in. Create a __trace_buffer_lock_reserve()
that is always inlined that the file can benefit from.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23 11:29:58 -05:00
Linus Torvalds
ded9b5dd20 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Six fixes for bugs that were found via fuzzing, and a trivial
  hw-enablement patch for AMD Family-17h CPU PMUs"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Allow only a single PMU/box within an events group
  perf/x86/intel: Cure bogus unwind from PEBS entries
  perf/x86: Restore TASK_SIZE check on frame pointer
  perf/core: Fix address filter parser
  perf/x86: Add perf support for AMD family-17h processors
  perf/x86/uncore: Fix crash by removing bogus event_list[] handling for SNB client uncore IMC
  perf/core: Do not set cpuctx->cgrp for unscheduled cgroups
2016-11-23 08:09:21 -08:00
Ingo Molnar
2b4d5b2582 sched/fair: Clean up the tunable parameter definitions
No change in functionality:

 - align the default values vertically to make them easier to scan
 - standardize the 'default:' lines
 - fix minor whitespace typos

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23 10:38:55 +01:00
T.Zhou
176cedc4ed sched/dl: Fix comment in pick_next_task_dl()
Fix cut & paste oversight:

  s/pull_rt_task/pull_dl_task

Signed-off-by: T.Zhou <t1zhou@163.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@gmail.com
Link: http://lkml.kernel.org/r/20161123004832.GA2983@geo
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23 10:23:21 +01:00
Ingo Molnar
ec84f00567 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23 10:23:09 +01:00
Ingo Molnar
af91a81131 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates, yet again just simple changes.

 - Miscellaneous fixes, including a change to call_rcu()'s
   rcu_head alignment check.

 - Security-motivated list consistency checks, which are
   disabled by default behind DEBUG_LIST.

 - Torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23 10:04:28 +01:00
Steven Rostedt (Red Hat)
7d43640022 tracing: Add error checks to creation of event files
The creation of the set_event_pid file was assigned to a variable "entry"
but that variable was never used. Ideally, it should be used to check if the
file was created and warn if it was not.

The files header_page, header_event should also be checked and a warning if
they fail to be created.

The "enable" file was moved up, as it is a more crucial file to have and a
hard failure (return -ENOMEM) should be returned if it is not created.

Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-22 18:32:03 -05:00
Chunyan Zhang
478409dd68 tracing: Add hook to function tracing for other subsystems to use
Currently Function traces can be only exported to the ring buffer. This
adds a trace_export concept which can process traces and export
them to a registered destination as an addition to the current
one that outputs to Ftrace - i.e. ring buffer.

In this way, if we want function traces to be sent to other destinations
rather than only to the ring buffer, we just need to register a new
trace_export and implement its own .write() function for writing traces to
storage.

With this patch, only function tracing (trace type is TRACE_FN)
is supported.

Link: http://lkml.kernel.org/r/1479715043-6534-2-git-send-email-zhang.chunyan@linaro.org

Signed-off-by: Chunyan Zhang <zhang.chunyan@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-22 17:40:00 -05:00
Sebastian Andrzej Siewior
31eff2434d sched/nohz: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: rt@linuxtronix.de
Link: http://lkml.kernel.org/r/20161117183541.8588-14-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 23:34:41 +01:00
Eric W. Biederman
f84df2a6f2 exec: Ensure mm->user_ns contains the execed files
When the user namespace support was merged the need to prevent
ptrace from revealing the contents of an unreadable executable
was overlooked.

Correct this oversight by ensuring that the executed file
or files are in mm->user_ns, by adjusting mm->user_ns.

Use the new function privileged_wrt_inode_uidgid to see if
the executable is a member of the user namespace, and as such
if having CAP_SYS_PTRACE in the user namespace should allow
tracing the executable.  If not update mm->user_ns to
the parent user namespace until an appropriate parent is found.

Cc: stable@vger.kernel.org
Reported-by: Jann Horn <jann@thejh.net>
Fixes: 9e4a36ece6 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 13:21:00 -06:00
Eric W. Biederman
84d77d3f06 ptrace: Don't allow accessing an undumpable mm
It is the reasonable expectation that if an executable file is not
readable there will be no way for a user without special privileges to
read the file.  This is enforced in ptrace_attach but if ptrace
is already attached before exec there is no enforcement for read-only
executables.

As the only way to read such an mm is through access_process_vm
spin a variant called ptrace_access_vm that will fail if the
target process is not being ptraced by the current process, or
the current process did not have sufficient privileges when ptracing
began to read the target processes mm.

In the ptrace implementations replace access_process_vm by
ptrace_access_vm.  There remain several ptrace sites that still use
access_process_vm as they are reading the target executables
instructions (for kernel consumption) or register stacks.  As such it
does not appear necessary to add a permission check to those calls.

This bug has always existed in Linux.

Fixes: v1.0
Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 12:57:38 -06:00
David S. Miller
f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Eric W. Biederman
64b875f7ac ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked.  This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.

Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in.  This has already allowed one mistake through
insufficient granulariy.

I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.

This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid.  More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.

Cc: stable@vger.kernel.org
Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 11:49:49 -06:00
Eric W. Biederman
bfedb58925 mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file.  A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec, so
it is clear in which user namespace CAP_SYS_PTRACE must be present in
to be able to safely give read permission to the executable.

The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.

The function ptrace_attach is modified to only set PT_PTRACE_CAP when
CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
PT_PTRACE_CAP is to be a flag to note that whatever permission changes
the task might go through the tracer has sufficient permissions for
it not to be an issue.  task->cred->user_ns is always the same
as or descendent of mm->user_ns.  Which guarantees that having
CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
credentials.

To prevent regressions mm->dumpable and mm->user_ns are not considered
when a task has no mm.  As simply failing ptrace_may_attach causes
regressions in privileged applications attempting to read things
such as /proc/<pid>/stat

Cc: stable@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Fixes: 8409cca705 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 11:49:48 -06:00
Pan Xinhui
05ffc95139 locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted
An over-committed guest with more vCPUs than pCPUs has a heavy overload
in the two spin_on_owner. This blames on the lock holder preemption
issue.

Break out of the loop if the vCPU is preempted: if vcpu_is_preempted(cpu)
is true.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown]         [H] 0xc0000000000768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: David.Laight@ACULAB.COM
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: benh@kernel.crashing.org
Cc: boqun.feng@gmail.com
Cc: bsingharora@gmail.com
Cc: dave@stgolabs.net
Cc: kernellwp@gmail.com
Cc: konrad.wilk@oracle.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: mpe@ellerman.id.au
Cc: paulmck@linux.vnet.ibm.com
Cc: paulus@samba.org
Cc: rkrcmar@redhat.com
Cc: virtualization@lists.linux-foundation.org
Cc: will.deacon@arm.com
Cc: xen-devel-request@lists.xenproject.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1478077718-37424-4-git-send-email-xinhui.pan@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22 12:48:10 +01:00
Pan Xinhui
5aff60a191 locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload
in osq_lock().

This is because if vCPU-A holds the osq lock and yields out, vCPU-B ends
up waiting for per_cpu node->locked to be set. IOW, vCPU-B waits for
vCPU-A to run and unlock the osq lock.

Use the new vcpu_is_preempted(cpu) interface to detect if a vCPU is
currently running or not, and break out of the spin-loop if so.

test case:

 $ perf record -a perf bench sched messaging -g 400 -p && perf report

 before patch:
 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

 after patch:
 20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
  8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
  4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
  3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
  2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
  2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
  2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: David.Laight@ACULAB.COM
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: benh@kernel.crashing.org
Cc: bsingharora@gmail.com
Cc: dave@stgolabs.net
Cc: kernellwp@gmail.com
Cc: konrad.wilk@oracle.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: mpe@ellerman.id.au
Cc: paulmck@linux.vnet.ibm.com
Cc: paulus@samba.org
Cc: rkrcmar@redhat.com
Cc: virtualization@lists.linux-foundation.org
Cc: will.deacon@arm.com
Cc: xen-devel-request@lists.xenproject.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1478077718-37424-3-git-send-email-xinhui.pan@linux.vnet.ibm.com
[ Translated to English. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22 12:48:10 +01:00
Ingo Molnar
02cb689b2c Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22 12:37:38 +01:00
Oleg Nesterov
8e5bfa8c1f sched/autogroup: Do not use autogroup->tg in zombie threads
Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its ->sched_task_group before _put() and possibly free().

So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22 12:33:43 +01:00
Oleg Nesterov
18f649ef34 sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()
The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
it, but see the next patch.

However the comment is correct in that autogroup_move_group() must always
change task_group() for every thread so the sysctl_ check is very wrong;
we can race with cgroups and even sys_setsid() is not safe because a task
running with task_group() == ag->tg must participate in refcounting:

	int main(void)
	{
		int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);

		assert(sctl > 0);
		if (fork()) {
			wait(NULL); // destroy the child's ag/tg
			pause();
		}

		assert(pwrite(sctl, "1\n", 2, 0) == 2);
		assert(setsid() > 0);
		if (fork())
			pause();

		kill(getppid(), SIGKILL);
		sleep(1);

		// The child has gone, the grandchild runs with kref == 1
		assert(pwrite(sctl, "0\n", 2, 0) == 2);
		assert(setsid() > 0);

		// runs with the freed ag/tg
		for (;;)
			sleep(1);

		return 0;
	}

crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
autogroup_move_group() actually frees the task_group or this happens later.

Reported-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22 12:33:42 +01:00
Marc Zyngier
4e20156640 genirq/msi: Drop artificial PCI dependency
The generic MSI layer doesn't have any PCI ties anymore, and the
build hack should have been removed some time ago.

Fixes: d9109698be ("genirq: Introduce msi_domain_alloc/free_irqs()")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1479806476-20801-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22 11:00:19 +01:00
Linus Torvalds
8d1a2408ef Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:

 1) With modern networking cards we can run out of 32-bit DMA space, so
    support 64-bit DMA addressing when possible on sparc64. From Dave
    Tushar.

 2) Some signal frame validation checks are inverted on sparc32, fix
    from Andreas Larsson.

 3) Lockdep tables can get too large in some circumstances on sparc64,
    add a way to adjust the size a bit. From Babu Moger.

 4) Fix NUMA node probing on some sun4v systems, from Thomas Tai.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc: drop duplicate header scatterlist.h
  lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined
  config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc
  sunbmac: Fix compiler warning
  sunqe: Fix compiler warnings
  sparc64: Enable 64-bit DMA
  sparc64: Enable sun4v dma ops to use IOMMU v2 APIs
  sparc64: Bind PCIe devices to use IOMMU v2 service
  sparc64: Initialize iommu_map_table and iommu_pool
  sparc64: Add ATU (new IOMMU) support
  sparc64: Add FORCE_MAX_ZONEORDER and default to 13
  sparc64: fix compile warning section mismatch in find_node()
  sparc32: Fix inverted invalid_frame_pointer checks on sigreturns
  sparc64: Fix find_node warning if numa node cannot be found
2016-11-21 13:56:17 -08:00
Rafael J. Wysocki
08b98d3291 PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag
Modify the ACPI system sleep support setup code to select
suspend-to-idle as the default system sleep state if the
ACPI_FADT_LOW_POWER_S0 flag is set in the FADT and the
default sleep state was not selected from the kernel command
line.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2016-11-21 22:48:10 +01:00
Rafael J. Wysocki
406e79385f PM / sleep: System sleep state selection interface rework
There are systems in which the platform doesn't support any special
sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only
available system sleep state.  However, some user space frameworks
only use the "mem" and (sometimes) "standby" sleep state labels, so
the users of those systems need to modify user space in order to be
able to use system suspend at all and that may be a pain in practice.

Commit 0399d4db3e (PM / sleep: Introduce command line argument for
sleep state enumeration) attempted to address this problem by adding
a command line argument to change the meaning of the "mem" string in
/sys/power/state to make it trigger suspend-to-idle (instead of
suspend-to-RAM).

However, there also are systems in which the platform does support
special sleep states, but suspend-to-idle is the preferred one anyway
(it even may save more energy than the platform-provided sleep states
in some cases) and the above commit doesn't help in those cases.

For this reason, rework the system sleep state selection interface
again (but preserve backwards compatibiliby).  Namely, add a new
sysfs file, /sys/power/mem_sleep, that will control the system
suspend mode triggered by writing "mem" to /sys/power/state (in
analogy with what /sys/power/disk does for hibernation).  Make it
select suspend-to-RAM ("deep" sleep) by default (if supported) and
fall back to suspend-to-idle ("s2idle") otherwise and add a new
command line argument, mem_sleep_default, allowing that default to
be overridden if need be.

At the same time, drop the relative_sleep_states command line
argument that doesn't make sense any more.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2016-11-21 22:45:40 +01:00
Linus Torvalds
27e7ab99db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Clear congestion control state when changing algorithms on an
    existing socket, from Florian Westphal.

 2) Fix register bit values in altr_tse_pcs portion of stmmac driver,
    from Jia Jie Ho.

 3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe
    CAVALLARO.

 4) Fix udplite multicast delivery handling, it ignores the udp_table
    parameter passed into the lookups, from Pablo Neira Ayuso.

 5) Synchronize the space estimated by rtnl_vfinfo_size and the space
    actually used by rtnl_fill_vfinfo. From Sabrina Dubroca.

 6) Fix memory leak in fib_info when splitting nodes, from Alexander
    Duyck.

 7) If a driver does a napi_hash_del() explicitily and not via
    netif_napi_del(), it must perform RCU synchronization as needed. Fix
    this in virtio-net and bnxt drivers, from Eric Dumazet.

 8) Likewise, it is not necessary to invoke napi_hash_del() is we are
    also doing neif_napi_del() in the same code path. Remove such calls
    from be2net and cxgb4 drivers, also from Eric Dumazet.

 9) Don't allocate an ID in peernet2id_alloc() if the netns is dead,
    from WANG Cong.

10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold.

11) We cannot cache routes in ip6_tunnel when using inherited traffic
    classes, from Paolo Abeni.

12) Fix several crashes and leaks in cpsw driver, from Johan Hovold.

13) Splice operations cannot use freezable blocking calls in AF_UNIX,
    from WANG Cong.

14) Link dump filtering by master device and kind support added an error
    in loop index updates during the dump if we actually do filter, fix
    from Zhang Shengju.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
  tcp: zero ca_priv area when switching cc algorithms
  net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
  ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
  tipc: eliminate obsolete socket locking policy description
  rtnl: fix the loop index update error in rtnl_dump_ifinfo()
  l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
  net: macb: add check for dma mapping error in start_xmit()
  rtnetlink: fix FDB size computation
  netns: fix get_net_ns_by_fd(int pid) typo
  af_unix: conditionally use freezable blocking calls in read
  net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
  net: ethernet: ti: cpsw: add missing sanity check
  net: ethernet: ti: cpsw: fix secondary-emac probe error path
  net: ethernet: ti: cpsw: fix of_node and phydev leaks
  net: ethernet: ti: cpsw: fix deferred probe
  net: ethernet: ti: cpsw: fix mdio device reference leak
  net: ethernet: ti: cpsw: fix bad register access in probe error path
  net: sky2: Fix shutdown crash
  cfg80211: limit scan results cache size
  net sched filters: pass netlink message flags in event notification
  ...
2016-11-21 13:26:28 -08:00
Daniel Borkmann
97bc402db7 bpf, mlx5: fix mlx5e_create_rq taking reference on prog
In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but
without checking the return value. bpf_prog_add() can fail since 92117d8443
("bpf: fix refcnt overflow"), so we really must check it. Take the reference
right when we assign it to the rq from priv->xdp_prog, and just drop the
reference on error path. Destruction in mlx5e_destroy_rq() looks good, though.

Fixes: 86994156c7 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21 11:25:57 -05:00
Alexander Shishkin
e96271f3ed perf/core: Fix address filter parser
The token table passed into match_token() must be null-terminated, which
it currently is not in the perf's address filter string parser, as caught
by Vince's perf_fuzzer and KASAN.

It doesn't blow up otherwise because of the alignment padding of the table
to the next element in the .rodata, which is luck.

Fixing by adding a null-terminator to the token table.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: stable@vger.kernel.org # v4.7+
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/877f81f264.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21 11:28:36 +01:00
Waiman Long
194a6b5b9c sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q
Currently the wake_q data structure is defined by the WAKE_Q() macro.
This macro, however, looks like a function doing something as "wake" is
a verb. Even checkpatch.pl was confused as it reported warnings like

  WARNING: Missing a blank line after declarations
  #548: FILE: kernel/futex.c:3665:
  +	int ret;
  +	WAKE_Q(wake_q);

This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
what the macro is doing and eliminates the checkpatch.pl warnings.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
[ Resolved conflict and added missing rename. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21 10:29:01 +01:00
Steve Grubb
c1e8f06d7a audit: fix formatting of AUDIT_CONFIG_CHANGE events
The AUDIT_CONFIG_CHANGE events sometimes use a op= field. The current
code logs the value of the field with quotes. This field is documented
to not be encoded, so it should not have quotes.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
[PM: reformatted commit description to make checkpatch.pl happy]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-20 15:38:00 -05:00
Richard Guy Briggs
833fc48d18 audit: skip sessionid sentinel value when auto-incrementing
The value (unsigned int)-1 is used as a sentinel to indicate the
sessionID is unset.  Skip this value when the session_id value wraps.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-20 15:28:22 -05:00
Babu Moger
e245d99e6c lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined
Reduce the size of data structure for lockdep entries by half if
PROVE_LOCKING_SMALL if defined. This is used only for sparc.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 11:33:19 -08:00
Alexey Dobriyan
c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
Thomas Gleixner
f97960fbdd kernel/printk: Block cpuhotplug callback when tasks are frozen
The recent conversion of the console hotplug notifier to the state machine
missed the fact, that the notifier only operated on the non frozen
transitions. As a consequence the console_lock/unlock() pair is also
invoked during suspend, which results in a lockdep warning.

Restore the previous state by making the lock/unlock conditional on
!tasks_frozen.

Fixes: 90b14889d2 ("kernel/printk: Convert to hotplug state machine")
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611171729320.3645@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-11-17 19:44:58 +01:00
Ingo Molnar
89a01c51cb Merge branch 'x86/cpufeature' into x86/asm, to pick up dependency
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-17 08:30:54 +01:00
Viresh Kumar
21ef57297b cpufreq: schedutil: irq-work and mutex are only used in slow path
Execute the irq-work specific initialization/exit code only when the
fast path isn't available.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-16 23:21:08 +01:00
Viresh Kumar
02a7b1ee3b cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
If slow path frequency changes are conducted in a SCHED_OTHER context
then they may be delayed for some amount of time, including
indefinitely, when real time or deadline activity is taking place.

Move the slow path to a real time kernel thread. In the future the
thread should be made SCHED_DEADLINE. The RT priority is arbitrarily set
to 50 for now.

Hackbench results on ARM Exynos, dual core A15 platform for 10
iterations:

$ hackbench -s 100 -l 100 -g 10 -f 20

Before			After
---------------------------------
1.808			1.603
1.847			1.251
2.229			1.590
1.952			1.600
1.947			1.257
1.925			1.627
2.694			1.620
1.258			1.621
1.919			1.632
1.250			1.240

Average:

1.8829			1.5041

Based on initial work by Steve Muckle.

Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-16 23:21:08 +01:00
Viresh Kumar
4a71ce4348 cpufreq: schedutil: enable fast switch earlier
The fast_switch_enabled flag will be used by both sugov_policy_alloc()
and sugov_policy_free() with a later patch.

Prepare for that by moving the calls to enable and disable it to the
beginning of sugov_init() and end of sugov_exit().

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-16 23:21:08 +01:00
Viresh Kumar
8e2ddb0364 cpufreq: schedutil: Avoid indented labels
Switch to the more common practice of writing labels.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-16 23:21:07 +01:00
Josef Bacik
f23cc643f9 bpf: fix range arithmetic for bpf map access
I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
invalid accesses to bpf map entries.  Fix this up by doing a few things

1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in real
life and just adds extra complexity.

2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
minimum value to 0 for positive AND's.

3) Don't do operations on the ranges if they are set to the limits, as they are
by definition undefined, and allowing arithmetic operations on those values
could make them appear valid when they really aren't.

This fixes the testcase provided by Jann as well as a few other theoretical
problems.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 13:21:45 -05:00
Thomas Gleixner
b6e5d5b947 genirq/affinity: Use default affinity mask for reserved vectors
The reserved vectors at the beginning and the end of the vector space get
cpu_possible_mask assigned as their affinity mask.

All other non-auto affine interrupts get the default irq affinity mask
assigned. Using cpu_possible_mask breaks that rule.

Treat them like any other interrupt and use irq_default_affinity as target
mask.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
2016-11-16 18:44:01 +01:00
Christoph Hellwig
bfe1307738 genirq/affinity: Take reserved vectors into account when spreading irqs
The recent addition of reserved vectors at the beginning or the end of the
vector space did not take the reserved vectors at the beginning into
account for the various loop exit conditions. As a consequence the last
vectors of the spread area are not included into the spread algorithm and
are treated like the reserved vectors at the end of the vector space and
get the default affinity mask assigned.

Sum up the affinity vectors and the reserved vectors at the beginning and
use the sum as exit condition.

[ tglx: Fixed all conditions instead of only one and massaged changelog ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/1479201178-29604-2-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 18:44:01 +01:00
Martin KaFai Lau
2874aa2e46 bpf: Fix compilation warning in __bpf_lru_list_rotate_inactive
gcc-6.2.1 gives the following warning:
kernel/bpf/bpf_lru_list.c: In function ‘__bpf_lru_list_rotate_inactive.isra.3’:
kernel/bpf/bpf_lru_list.c:201:28: warning: ‘next’ may be used uninitialized in this function [-Wmaybe-uninitialized]

The "next" is currently initialized in the while() loop which must have >=1
iterations.

This patch initializes next to get rid of the compiler warning.

Fixes: 3a08c2fd76 ("bpf: LRU List")
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16 11:30:56 -05:00
Vincent Guittot
d03266910a sched/fair: Fix task group initialization
The moves of tasks are now propagated down to root and the utilization
of cfs_rq reflects reality so it doesn't need to be estimated at init.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:11 +01:00
Vincent Guittot
4e5160766f sched/fair: Propagate asynchrous detach
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.

During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.

The propagation relies on patch:

  "sched: Fix hierarchical order in rq->leaf_cfs_rq_list"

... which orders children and parents, to ensure that it's done in one pass.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:10 +01:00
Vincent Guittot
09a43ace1f sched/fair: Propagate load during synchronous attach/detach
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.

For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:10 +01:00
Vincent Guittot
d31b1a66cb sched/fair: Factorize PELT update
Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:

 - when attaching/detaching a sched_entity, we update cfs_rq and then
   we sync the entity with the cfs_rq.

 - when enqueueing/dequeuing the sched_entity, we update both
   sched_entity and cfs_rq metrics to now.

Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:09 +01:00
Vincent Guittot
9c2791f936 sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.

The hierarchical order in shares update list has been introduced by
commit:

  67e86250f8 ("sched: Introduce hierarchal order on shares update list")

With the current implementation a child can be still put after its
parent.

Lets take the example of:

       root
        \
         b
         /\
         c d*
           |
           e*

with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail

The branch d -> e will be added the first time that they are enqueued,
starting with e then d.

When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail

Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail

e is not placed at the right position and will be called the last
whereas it should be called at the beginning.

Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:08 +01:00
Vincent Guittot
df217913e7 sched/fair: Factorize attach/detach entity
Factorize post_init_entity_util_avg() and part of attach_task_cfs_rq()
in one function attach_entity_cfs_rq().

Create symmetric detach_entity_cfs_rq() function.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:08 +01:00
Morten Rasmussen
893c5d2279 sched/fair: Fix incorrect comment for capacity_margin
The comment for capacity_margin introduced in:

  3273163c67 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

... got its usage the wrong way round - fix it.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:07 +01:00
Morten Rasmussen
9e0994c0a1 sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.

This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:06 +01:00
Morten Rasmussen
bf475ce0a3 sched/fair: Add per-CPU min capacity to sched_group_capacity
struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.

Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).

But even the average may not be sufficient if the group covers CPUs of
different capacities.

Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:06 +01:00
Morten Rasmussen
6a0b19c0f3 sched/fair: Consider spare capacity in find_idlest_group()
In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:05 +01:00
Morten Rasmussen
104cb16d9e sched/fair: Compute task/cpu utilization at wake-up correctly
At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:05 +01:00
Martin Schwidefsky
527b0a76f4 sched/cpuacct: Avoid %lld seq_printf warning
For s390 kernel builds I keep getting this warning:

 kernel/sched/cpuacct.c: In function 'cpuacct_stats_show':
 kernel/sched/cpuacct.c:298:25: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'clock_t {aka long int}' [-Wformat=]
   seq_printf(sf, "%s %lld\n",

Silence the warning by adding an explicit cast.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161111142749.6545-1-schwidefsky@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:29:03 +01:00
Ingo Molnar
0acbc7aa47 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:16:28 +01:00
Christian Borntraeger
f2f09a4cee locking/core: Remove cpu_relax_lowlatency() users
With the s390 special case of a yielding cpu_relax() implementation gone,
we can now remove all users of cpu_relax_lowlatency() and replace them
with cpu_relax().

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1477386195-32736-5-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:15:10 +01:00
Christian Borntraeger
bf0d31c054 locking/core, stop_machine: Yield the CPU during stop machine()
Some time ago the following commit:

  57f2ffe14f ("s390: remove diag 44 calls from cpu_relax()")

... stopped cpu_relax() on s390 yielding to the hypervisor.

As it turns out this made stop_machine() run really slow on virtualized
overcommited systems. For example the kprobes test during bootup took
several seconds instead of just running unnoticed with large guests.

Therefore, yielding was reintroduced with commit:

  4d92f50249 ("s390: reintroduce diag 44 calls for cpu_relax()")

... but in fact the stop machine code seems to be the only place where
this yielding was really necessary. This place is probably the most
important one as it makes all but one guest CPUs wait for one guest CPU.

As we now have cpu_relax_yield(), we can use this in multi_cpu_stop().
For now lets only add it here. We can add it later in other places
when necessary.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1477386195-32736-3-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-16 10:15:09 +01:00
Nicolas Pitre
baa73d9e47 posix-timers: Make them configurable
Some embedded systems have no use for them.  This removes about
25KB from the kernel binary size when configured out.

Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.

The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 09:26:35 +01:00
Nicolas Pitre
53d3eaa315 posix_cpu_timers: Move the add_device_randomness() call to a proper place
There is no logical relation between add_device_randomness() and
posix_cpu_timers_exit(). Let's move the former to where the later
is called. This way, when posix-cpu-timers.c is compiled out, there
is no need to worry about not losing a call to add_device_randomness().

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-6-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 09:26:34 +01:00
Nicolas Pitre
74ba181e61 timer: Move sys_alarm from timer.c to itimer.c
Move the only user of alarm_setitimer to itimer.c where it is defined.
This allows for making alarm_setitimer static, and dropping it from the
build when __ARCH_WANT_SYS_ALARM is not defined.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-5-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 09:26:34 +01:00
Joel Fernandes
d032ae8921 ftrace: Provide API to use global filtering for ftrace ops
Currently the global_ops filtering hash is not available to outside users
registering for function tracing. Provide an API for those users to be
able to choose global filtering.

This is in preparation for pstore's ftrace feature to be able to
use the global filters.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:30 -08:00
Steven Rostedt
fa32e8557b tracing: Add new trace_marker_raw
A new file is created:

 /sys/kernel/debug/tracing/trace_marker_raw

This allows for appications to create data structures and write the binary
data directly into it, and then read the trace data out from trace_pipe_raw
into the same type of data structure. This saves on converting numbers into
ASCII that would be required by trace_marker.

Suggested-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-15 15:13:59 -05:00
Martin KaFai Lau
8f8449384e bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH
Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 11:50:20 -05:00
Martin KaFai Lau
29ba732acb bpf: Add BPF_MAP_TYPE_LRU_HASH
Provide a LRU version of the existing BPF_MAP_TYPE_HASH.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 11:50:20 -05:00
Martin KaFai Lau
fd91de7b3c bpf: Refactor codes handling percpu map
Refactor the codes that populate the value
of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
typed bpf_map.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 11:50:20 -05:00
Martin KaFai Lau
961578b634 bpf: Add percpu LRU list
Instead of having a common LRU list, this patch allows a
percpu LRU list which can be selected by specifying a map
attribute.  The map attribute will be added in the later
patch.

While the common use case for LRU is #reads >> #updates,
percpu LRU list allows bpf prog to absorb unusual #updates
under pathological case (e.g. external traffic facing machine which
could be under attack).

Each percpu LRU is isolated from each other.  The LRU nodes (including
free nodes) cannot be moved across different LRU Lists.

Here are the update performance comparison between
common LRU list and percpu LRU list (the test code is
at the last patch):

[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
 1 cpus: 2934082 updates
 4 cpus: 7391434 updates
 8 cpus: 6500576 updates

[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
  1 cpus: 2896553 updates
  4 cpus: 9766395 updates
  8 cpus: 17460553 updates

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 11:50:20 -05:00
Martin KaFai Lau
3a08c2fd76 bpf: LRU List
Introduce bpf_lru_list which will provide LRU capability to
the bpf_htab in the later patch.

* General Thoughts:
1. Target use case.  Read is more often than update.
   (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
   If bpf_prog does a bpf_lookup_elem() first and then an in-place
   update, it still counts as a read operation to the LRU list concern.
2. It may be useful to think of it as a LRU cache
3. Optimize the read case
   3.1 No lock in read case
   3.2 The LRU maintenance is only done during bpf_update_elem()
4. If there is a percpu LRU list, it will lose the system-wise LRU
   property.  A completely isolated percpu LRU list has the best
   performance but the memory utilization is not ideal considering
   the work load may be imbalance.
5. Hence, this patch starts the LRU implementation with a global LRU
   list with batched operations before accessing the global LRU list.
   As a LRU cache, #read >> #update/#insert operations, it will work well.
6. There is a local list (for each cpu) which is named
   'struct bpf_lru_locallist'.  This local list is not used to sort
   the LRU property.  Instead, the local list is to batch enough
   operations before acquiring the lock of the global LRU list.  More
   details on this later.
7. In the later patch, it allows a percpu LRU list by specifying a
   map-attribute for scalability reason and for use cases that need to
   prepare for the worst (and pathological) case like DoS attack.
   The percpu LRU list is completely isolated from each other and the
   LRU nodes (including free nodes) cannot be moved across the list.  The
   following description is for the global LRU list but mostly applicable
   to the percpu LRU list also.

* Global LRU List:
1. It has three sub-lists: active-list, inactive-list and free-list.
2. The two list idea, active and inactive, is borrowed from the
   page cache.
3. All nodes are pre-allocated and all sit at the free-list (of the
   global LRU list) at the beginning.  The pre-allocation reasoning
   is similar to the existing BPF_MAP_TYPE_HASH.  However,
   opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
   the LRU map.

* Active/Inactive List (of the global LRU list):
1. The active list, as its name says it, maintains the active set of
   the nodes.  We can think of it as the working set or more frequently
   accessed nodes.  The access frequency is approximated by a ref-bit.
   The ref-bit is set during the bpf_lookup_elem().
2. The inactive list, as its name also says it, maintains a less
   active set of nodes.  They are the candidates to be removed
   from the bpf_htab when we are running out of free nodes.
3. The ordering of these two lists is acting as a rough clock.
   The tail of the inactive list is the older nodes and
   should be released first if the bpf_htab needs free element.

* Rotating the Active/Inactive List (of the global LRU list):
1. It is the basic operation to maintain the LRU property of
   the global list.
2. The active list is only rotated when the inactive list is running
   low.  This idea is similar to the current page cache.
   Inactive running low is currently defined as
   "# of inactive < # of active".
3. The active list rotation always starts from the tail.  It moves
   node without ref-bit set to the head of the inactive list.
   It moves node with ref-bit set back to the head of the active
   list and then clears its ref-bit.
4. The inactive rotation is pretty simply.
   It walks the inactive list and moves the nodes back to the head of
   active list if its ref-bit is set. The ref-bit is cleared after moving
   to the active list.
   If the node does not have ref-bit set, it just leave it as it is
   because it is already in the inactive list.

* Shrinking the Inactive List (of the global LRU list):
1. Shrinking is the operation to get free nodes when the bpf_htab is
   full.
2. It usually only shrinks the inactive list to get free nodes.
3. During shrinking, it will walk the inactive list from the tail,
   delete the nodes without ref-bit set from bpf_htab.
4. If no free node found after step (3), it will forcefully get
   one node from the tail of inactive or active list.  Forcefully is
   in the sense that it ignores the ref-bit.

* Local List:
1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
   batch enough operations before acquiring the lock of the
   global LRU.
2. A local list has two sub-lists, free-list and pending-list.
3. During bpf_update_elem(), it will try to get from the free-list
   of (the current CPU local list).
4. If the local free-list is empty, it will acquire from the
   global LRU list.  The global LRU list can either satisfy it
   by its global free-list or by shrinking the global inactive
   list.  Since we have acquired the global LRU list lock,
   it will try to get at most LOCAL_FREE_TARGET elements
   to the local free list.
5. When a new element is added to the bpf_htab, it will
   first sit at the pending-list (of the local list) first.
   The pending-list will be flushed to the global LRU list
   when it needs to acquire free nodes from the global list
   next time.

* Lock Consideration:
The LRU list has a lock (lru_lock).  Each bucket of htab has a
lock (buck_lock).  If both locks need to be acquired together,
the lock order is always lru_lock -> buck_lock and this only
happens in the bpf_lru_list.c logic.

In hashtab.c, both locks are not acquired together (i.e. one
lock is always released first before acquiring another lock).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 11:50:20 -05:00
Linus Torvalds
81bcfe5e48 Alexei discovered a race condition in modules failing to load that
can cause a ftrace check to trigger and disable ftrace. This is because
 of the way modules are registered to ftrace. Their functions are
 loaded in the ftrace function tables but set to "disabled" since
 they are still in the process of being loaded by the module. After
 the module is finished, it calls back into the ftrace infrastructure
 to enable it. Looking deeper into the locations that access all the
 functions in the table, I found more locations that should ignore
 the disabled ones.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYKxq1FBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 k24IAIvZoDRiFvwdF8ZNrngD8ZT1IfEg771yzouKgTXT+mHhty8IpO6YRIYxJ5vS
 bLmmg1wGhSjBUGdAsm1e16r8bbxuZo7SbYx0CUz/XEQimbKJZU56M6oBrp22qtEv
 f4TBN0RPHhghG5IpHTm1Cvh4FZ9NgDkoWVuSKf7/vhxJ1GNpu/cpaTS60x+X6Fdj
 mFVIdwDRcupqXJLtqoB4tDC+iekqD0Zwj9yxpBL12Em/PvcgtXsBO4oaP0vEMOES
 ylGwEY0jCSpRJ2EsX1QnZDMP3DM0m9JLGmkzGuTwekGsHxe9+ODyQeYZyZ105rbD
 hYPfo3xS0diXnOyosxetgjefUwQ=
 =VoP8
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Alexei discovered a race condition in modules failing to load that can
  cause a ftrace check to trigger and disable ftrace.

  This is because of the way modules are registered to ftrace. Their
  functions are loaded in the ftrace function tables but set to
  "disabled" since they are still in the process of being loaded by the
  module. After the module is finished, it calls back into the ftrace
  infrastructure to enable it.

  Looking deeper into the locations that access all the functions in the
  table, I found more locations that should ignore the disabled ones"

* tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records
  ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records
2016-11-15 08:49:13 -08:00
David S. Miller
bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
David Carrillo-Cisneros
864c2357ca perf/core: Do not set cpuctx->cgrp for unscheduled cgroups
Commit:

  db4a835601 ("perf/core: Set cgroup in CPU contexts for new cgroup events")

failed to verify that event->cgrp is actually the scheduled cgroup
in a CPU before setting cpuctx->cgrp. This patch fixes that.

Now that there is a different path for scheduled and unscheduled
cgroup, add a warning to catch when cpuctx->cgrp is still set after
the last cgroup event has been unsheduled.

To verify the bug:

  # Create 2 cgroups.
  mkdir /dev/cgroups/devices/g1
  mkdir /dev/cgroups/devices/g2

  # launch a task, bind it to a cpu and move it to g1
  CPU=2
  while :; do : ; done &
  P=$!

  taskset -pc $CPU $P
  echo $P > /dev/cgroups/devices/g1/tasks

  # monitor g2 (it runs no tasks) and observe output
  perf stat -e cycles -I 1000 -C $CPU -G g2

  #           time             counts unit events
     1.000091408          7,579,527      cycles                    g2
     2.000350111      <not counted>      cycles                    g2
     3.000589181      <not counted>      cycles                    g2
     4.000771428      <not counted>      cycles                    g2

  # note first line that displays that a task run in g2, despite
  # g2 having no tasks. This is because cpuctx->cgrp was wrongly
  # set when context of new event was installed.
  # After applying the fix we obtain the right output:

  perf stat -e cycles -I 1000 -C $CPU -G g2
  #           time             counts unit events
     1.000119615      <not counted>      cycles                    g2
     2.000389430      <not counted>      cycles                    g2
     3.000590962      <not counted>      cycles                    g2

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-15 14:18:22 +01:00
Stanislaw Gruszka
353c50ebe3 sched/cputime: Simplify task_cputime()
Now since fetch_task_cputime() has no other users than task_cputime(),
its code could be used directly in task_cputime().

Moreover since only 2 task_cputime() calls of 17 use a NULL argument,
we can add dummy variables to those calls and remove NULL checks from
task_cputimes().

Also remove NULL checks from task_cputimes_scaled().

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479175612-14718-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-15 09:51:05 +01:00
Stanislaw Gruszka
40565b5aed sched/cputime, powerpc, s390: Make scaled cputime arch specific
Only s390 and powerpc have hardware facilities allowing to measure
cputimes scaled by frequency. On all other architectures
utimescaled/stimescaled are equal to utime/stime (however they are
accounted separately).

Remove {u,s}timescaled accounting on all architectures except
powerpc and s390, where those values are explicitly accounted
in the proper places.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161031162143.GB12646@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-15 09:51:05 +01:00
Stanislaw Gruszka
981ee2d444 sched/cputime, powerpc: Remove cputime_to_scaled()
Currently cputime_to_scaled() just return it's argument on
all implementations, we don't need to call this function.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479175612-14718-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-15 09:51:04 +01:00
Linus Torvalds
e76d21c40b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix off by one wrt. indexing when dumping /proc/net/route entries,
    from Alexander Duyck.

 2) Fix lockdep splats in iwlwifi, from Johannes Berg.

 3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH
    is disabled, from Liping Zhang.

 4) Memory leak when nft_expr_clone() fails, also from Liping Zhang.

 5) Disable UFO when path will apply IPSEC tranformations, from Jakub
    Sitnicki.

 6) Don't bogusly double cwnd in dctcp module, from Florian Westphal.

 7) skb_checksum_help() should never actually use the value "0" for the
    resulting checksum, that has a special meaning, use CSUM_MANGLED_0
    instead. From Eric Dumazet.

 8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from
    Yuval MIntz.

 9) Fix SCTP reference counting of associations and transports in
    sctp_diag. From Xin Long.

10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in
    a previous layer or similar, so explicitly clear the ipv6 control
    block in the skb. From Eli Cooper.

11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG
    Cong.

12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad
    Shenai.

13) Fix potential access past the end of the skb page frag array in
    tcp_sendmsg(). From Eric Dumazet.

14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix
    from David Ahern.

15) Don't return an error in tcp_sendmsg() if we wronte any bytes
    successfully, from Eric Dumazet.

16) Extraneous unlocks in netlink_diag_dump(), we removed the locking
    but forgot to purge these unlock calls. From Eric Dumazet.

17) Fix memory leak in error path of __genl_register_family(). We leak
    the attrbuf, from WANG Cong.

18) cgroupstats netlink policy table is mis-sized, from WANG Cong.

19) Several XDP bug fixes in mlx5, from Saeed Mahameed.

20) Fix several device refcount leaks in network drivers, from Johan
    Hovold.

21) icmp6_send() should use skb dst device not skb->dev to determine L3
    routing domain. From David Ahern.

22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong.

23) We leak new macvlan port in some cases of maclan_common_netlink()
    errors. Fix from Gao Feng.

24) Similar to the icmp6_send() fix, icmp_route_lookup() should
    determine L3 routing domain using skb_dst(skb)->dev not skb->dev.
    Also from David Ahern.

25) Several fixes for route offloading and FIB notification handling in
    mlxsw driver, from Jiri Pirko.

26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet.

27) Fix long standing regression in ipv4 redirect handling, wrt.
    validating the new neighbour's reachability. From Stephen Suryaputra
    Lin.

28) If sk_filter() trims the packet excessively, handle it reasonably in
    tcp input instead of exploding. From Eric Dumazet.

29) Fix handling of napi hash state when copying channels in sfc driver,
    from Bert Kenward.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits)
  mlxsw: spectrum_router: Flush FIB tables during fini
  net: stmmac: Fix lack of link transition for fixed PHYs
  sctp: change sk state only when it has assocs in sctp_shutdown
  bnx2: Wait for in-flight DMA to complete at probe stage
  Revert "bnx2: Reset device during driver initialization"
  ps3_gelic: fix spelling mistake in debug message
  net: ethernet: ixp4xx_eth: fix spelling mistake in debug message
  ibmvnic: Fix size of debugfs name buffer
  ibmvnic: Unmap ibmvnic_statistics structure
  sfc: clear napi_hash state when copying channels
  mlxsw: spectrum_router: Correctly dump neighbour activity
  mlxsw: spectrum: Fix refcount bug on span entries
  bnxt_en: Fix VF virtual link state.
  bnxt_en: Fix ring arithmetic in bnxt_setup_tc().
  Revert "include/uapi/linux/atm_zatm.h: include linux/time.h"
  tcp: take care of truncations done by sk_filter()
  ipv4: use new_gw for redirect neigh lookup
  r8152: Fix error path in open function
  net: bpqether.h: remove if_ether.h guard
  net: __skb_flow_dissect() must cap its return value
  ...
2016-11-14 14:15:53 -08:00
Zhou Chengming
8d414bd2f7 tracing: Allow wakeup_dl tracer to be used by instances
Allow wakeup_dl tracer to be used by instances, like wakeup tracer
and wakeup_rt tracer.

Link: http://lkml.kernel.org/r/1479093553-31264-1-git-send-email-zhouchengming1@huawei.com

Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-14 16:43:00 -05:00
Steven Rostedt (Red Hat)
3f303fbccf tracing/filter: Define op as the enum that it is
The trace_events_file.c filter logic can be a bit complex. I copy this into
a userspace program where I can debug it a bit easier. One issue is the op
is defined in most places as an int instead of as an enum, and gdb just
gives the value when debugging. Having the actual op name shown in gdb is
more useful.

This has no functionality change, but helps in debugging when the file is
debugged in user space.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-14 16:42:59 -05:00
Steven Rostedt (Red Hat)
fdf5b67986 tracing: Optimise comparison filters and fix binary and for 64 bit
Currently the filter logic for comparisons (like greater-than and less-than)
are used, they share the same function and a switch statement is used to
jump to the comparison type to perform. This is done in the extreme hot path
of the tracing code, and it does not take much more space to create a
unique comparison function to perform each type of comparison and remove the
switch statement.

Also, a bug was found where the binary and operation for 64 bits could fail
if the resulting bits were greater than 32 bits, because the result was
passed into a 32 bit variable. This was fixed when adding the separate
binary and function.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-14 16:42:58 -05:00
Masami Hiramatsu
60f1d5e3ba ftrace: Support full glob matching
Use glob_match() to support flexible glob wildcards (*,?)
and character classes ([) for ftrace.
Since the full glob matching is slower than the current
partial matching routines(*pat, pat*, *pat*), this leaves
those routines and just add MATCH_GLOB for complex glob
expression.

e.g.
----
[root@localhost tracing]# echo 'sched*group' > set_ftrace_filter
[root@localhost tracing]# cat set_ftrace_filter
sched_free_group
sched_change_group
sched_create_group
sched_online_group
sched_destroy_group
sched_offline_group
[root@localhost tracing]# echo '[Ss]y[Ss]_*' > set_ftrace_filter
[root@localhost tracing]# head set_ftrace_filter
sys_arch_prctl
sys_rt_sigreturn
sys_ioperm
SyS_iopl
sys_modify_ldt
SyS_mmap
SyS_set_thread_area
SyS_get_thread_area
SyS_set_tid_address
sys_fork
----

Link: http://lkml.kernel.org/r/147566869501.29136.6462645009894738056.stgit@devbox

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-14 16:42:58 -05:00
Steven Rostedt (Red Hat)
546fece4ea ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records
When a module is first loaded and its function ip records are added to the
ftrace list of functions to modify, they are set to DISABLED, as their text
is still in a read only state. When the module is fully loaded, and can be
updated, the flag is cleared, and if their's any functions that should be
tracing them, it is updated at that moment.

But there's several locations that do record accounting and should ignore
records that are marked as disabled, or they can cause issues.

Alexei already fixed one location, but others need to be addressed.

Cc: stable@vger.kernel.org
Fixes: b7ffffbb46 "ftrace: Add infrastructure for delayed enabling of module functions"
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-14 16:31:49 -05:00
Alexei Starovoitov
977c1f9c8c ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records
ftrace_shutdown() checks for sanity of ftrace records
and if dyn_ftrace->flags is not zero, it will warn.
It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point,
since some module was loaded, but before ftrace_module_enable()
cleared the flags for this module.

In other words the module.c is doing:
ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED
... // here ftrace_shutdown() is called that warns, since
err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED

Fix it by ignoring disabled records.
It's similar to what __ftrace_hash_rec_update() is already doing.

Link: http://lkml.kernel.org/r/1478560460-3818619-1-git-send-email-ast@fb.com

Cc: stable@vger.kernel.org
Fixes: b7ffffbb46 "ftrace: Add infrastructure for delayed enabling of module functions"
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-14 16:31:41 -05:00
Mickaël Salaün
535e7b4b5e bpf: Use u64_to_user_ptr()
Replace the custom u64_to_ptr() function with the u64_to_user_ptr()
macro.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14 16:23:08 -05:00
Richard Guy Briggs
8443075eac audit: tame initialization warning len_abuf in audit_log_execve_info
Tame initialization warning of len_abuf in audit_log_execve_info even
though there isn't presently a bug introduced by commit 43761473c2
("audit: fix a double fetch in audit_log_single_execve_arg()").  Using
UNINITIALIZED_VAR instead may mask future bugs.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-14 15:18:48 -05:00
Paul E. McKenney
aa3e0bf1aa rcu: Don't kick unless grace period or request
The current code can result in spurious kicks when there are no grace
periods in progress and no grace-period-related requests.  This is
sort of OK for a diagnostic aid, but the resulting ftrace-dump messages
in dmesg are annoying.  This commit therefore avoids spurious kicks
in the common case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2016-11-14 10:46:31 -08:00
Paul E. McKenney
0742ac3e2f rcu: Make expedited grace periods recheck dyntick idle state
Expedited grace periods check dyntick-idle state, and avoid sending
IPIs to idle CPUs, including those running guest OSes, and, on NOHZ_FULL
kernels, nohz_full CPUs.  However, the kernel has been observed checking
a CPU while it was non-idle, but sending the IPI after it has gone
idle.  This commit therefore rechecks idle state immediately before
sending the IPI, refraining from IPIing CPUs that have since gone idle.

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-11-14 10:46:31 -08:00
Paul E. McKenney
d0af39e89e torture: Trace long read-side delays
Although rcutorture will occasionally do a 50-millisecond grace-period
delay, these delays are quite rare.  And rightly so, because otherwise
the read rate would be quite low.  Thie means that it can be important
to identify whether or not a given run contained a long-delay read.
This commit therefore inserts a trace_rcu_torture_read() event to flag
runs containing long delays.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-11-14 10:46:30 -08:00
Paul E. McKenney
1f21b50b77 rcu: Remove obsolete comment from __call_rcu()
The __call_rcu() comment about opportunistically noting grace period
beginnings and endings is obsolete.  RCU still does such opportunistic
noting, but in __call_rcu_core() rather than __call_rcu(), and there
already is an appropriate comment in __call_rcu_core().  This commit
therefore removes the obsolete comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2016-11-14 10:46:19 -08:00
Paul E. McKenney
5403d367a7 rcu: Remove obsolete rcu_check_callbacks() header comment
In the deep past, rcu_check_callbacks() was only invoked if rcu_pending()
returned true.  Which was fine, but these days rcu_check_callbacks()
is invoked unconditionally.  This commit therefore removes the obsolete
sentence from the header comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2016-11-14 10:46:14 -08:00
Paul E. McKenney
b8f2ed5384 rcu: Tighten up __call_rcu() rcu_head alignment check
Commit 720abae3d6 ("rcu: force alignment on struct
callback_head/rcu_head") forced the rcu_head (AKA callback_head)
structure's alignment to pointer size, that is, to 4-byte boundaries on
32-bit systems and to 8-byte boundaries on 64-bit systems.  This
commit therefore checks for this same alignment in __call_rcu(),
which used to contain a looser check for two-byte alignment.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2016-11-14 10:46:08 -08:00
Linus Torvalds
f5c9f9c723 Revert "printk: make reading the kernel log flush pending lines"
This reverts commit bfd8d3f23b.

It turns out that this flushes things much too aggressiverly, and causes
lines to break up when the system logger races with new continuation
lines being printed.

There's a pending patch to make printk() flushing much more
straightforward, but it's too invasive for 4.9, so in the meantime let's
just not make the system message logging flush continuation lines.
They'll be flushed by the final newline anyway.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-14 09:31:52 -08:00
Linus Torvalds
5d69561b7b Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
 "This fixes a genirq regression that resulted in the Intel/Broxton
  pinctrl/GPIO driver (and possibly others) spewing warnings"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Use irq type from irqdata instead of irqdesc
2016-11-14 08:34:56 -08:00
James Morris
185c0f26c0 Merge commit 'v4.9-rc5' into next 2016-11-14 09:34:01 +11:00
Daniel Borkmann
c540594f86 bpf, mlx4: fix prog refcount in mlx4_en_try_alloc_resources error path
Commit 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings
scheme") added a bug in that the prog's reference count is not dropped
in the error path when mlx4_en_try_alloc_resources() is failing from
mlx4_xdp_set().

We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
need to release again. Earlier in the call path, dev_change_xdp_fd()
itself holds a reference to the prog as well (hence the '- 1' in the
bpf_prog_add()), so a simple atomic_sub() is safe to use here. When
an error is propagated, then bpf_prog_put() is called eventually from
dev_change_xdp_fd()

Fixes: 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-12 23:31:57 -05:00
Linus Torvalds
b9f659b810 Power management fixes for v4.9-rc5
- Prevent the PM core from attempting to suspend parent devices
    if any of their children, whose suspend callbacks were invoked
    asynchronously, have failed to suspend during the "late" and
    "noirq" phases of system-wide suspend of devices (Brian Norris).
 
  - Prevent the boot-time system suspend test code from leaking a
    reference to the RTC device used by it (Johan Hovold).
 
  - Fix cpupower to use the return value of one of its library
    functions correctly and restore the correct behavior of it
    when used for setting cpufreq tunables broken during the 4.7
    development cycle (Laura Abbott).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJYJloUAAoJEILEb/54YlRxkYgP/RvTRLO+6Hlk0Re/Lm5SS7O4
 YjOrcemWkGkA+YJ5Qe4vki28nMzGITIAWuHJK7Nex0A1zFMX3GrNPxxMPHBejfog
 7VjeEp77x0h9F8EzWX2YLZaTCSpvQjpHH5/vUQaTIE8a/WtZVbSzfCT4jqAYYHzV
 rHn52aNwiAKd4kjMTSBJdjTOdhqebRFYRHpBHovoTARcGH4DgEJciQzaEh1O6qAq
 barOGrDDkB32mULjsBNRX4Wyx1wuJiOpVjiD7sIM75ZomlIDC/RyYc9O37t6GDBC
 koTTlilV2/i2VoqcU5nQgAmgUOoDz1skGJW6Rsp/d/S2GnZiLBJwma31e187otBL
 NFl1egL1BJaz5J1sow7YPYBQILzlYGwNhXLiKHoPEshWp5WSn7mIhxxzAou3Llsp
 oPCXDwuRrMhqYrOdclxiL+Pj6dUapvut3SerZB5C9rGg2JL9Y2gDxrhf/jFqXxNb
 mBw3kRx94bDb6DNHJPuplRHWrL0SlM2ZXnUa7XwUSt+jIiwMK2la0cAL1P/7Qva9
 zJ6SG01BnGEd6OI+y9ZSjITdpTVK10JIuCrTmR9cvWNA+A1vOGCkg6Nj+qunmzFh
 j6W2O0KQ/moR0ICCPCg96Hjf888mohX7vhGQqDQwpXYlsvWcWvjNpURsfY47yBH9
 p70+ziBkqyyIAprJKjLB
 =ykEg
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix two bugs in error code paths in the PM core (system-wide
  suspend of devices), a device reference leak in the boot-time suspend
  test code and a cpupower utility regression from the 4.7 cycle.

  Specifics:

   - Prevent the PM core from attempting to suspend parent devices if
     any of their children, whose suspend callbacks were invoked
     asynchronously, have failed to suspend during the "late" and
     "noirq" phases of system-wide suspend of devices (Brian Norris).

   - Prevent the boot-time system suspend test code from leaking a
     reference to the RTC device used by it (Johan Hovold).

   - Fix cpupower to use the return value of one of its library
     functions correctly and restore the correct behavior of it when
     used for setting cpufreq tunables broken during the 4.7 development
     cycle (Laura Abbott)"

* tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
  PM / sleep: fix device reference leak in test_suspend
  cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set
2016-11-11 16:54:23 -08:00
Rafael J. Wysocki
cd16f3dcdb Merge branches 'pm-tools-fixes' and 'pm-sleep-fixes'
* pm-tools-fixes:
  cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set

* pm-sleep-fixes:
  PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
  PM / sleep: fix device reference leak in test_suspend
2016-11-11 23:24:58 +01:00
Hans de Goede
c6c7d83b9c Revert "console: don't prefer first registered if DT specifies stdout-path"
This reverts commit 05fd007e46 ("console: don't prefer first
registered if DT specifies stdout-path").

The reverted commit changes existing behavior on which many ARM boards
rely.  Many ARM small-board-computers, like e.g.  the Raspberry Pi have
both a video output and a serial console.  Depending on whether the user
is using the device as a more regular computer; or as a headless device
we need to have the console on either one or the other.

Many users rely on the kernel behavior of the console being present on
both outputs, before the reverted commit the console setup with no
console= kernel arguments on an ARM board which sets stdout-path in dt
would look like this:

  [root@localhost ~]# cat /proc/consoles
  ttyS0                -W- (EC p a)    4:64
  tty0                 -WU (E  p  )    4:1

Where as after the reverted commit, it looks like this:

  [root@localhost ~]# cat /proc/consoles
  ttyS0                -W- (EC p a)    4:64

This commit reverts commit 05fd007e46 ("console: don't prefer first
registered if DT specifies stdout-path") restoring the original
behavior.

Fixes: 05fd007e46 ("console: don't prefer first registered if DT specifies stdout-path")
Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00
Daniel Bristot de Oliveira
9846d50df3 sched/deadline: Fix typo in a comment
In the comment:

        /*
         * The task might have changed its scheduling policy to something
         * different than SCHED_DEADLINE (through switched_fromd_dl()).
         */

s/fromd/from/

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5408b3b3f9ee197a7b7f10fb834341100a4f2c88.1478599881.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:28:52 +01:00
Ingo Molnar
bfdd5537dc Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:27:11 +01:00
Tahsin Erdogan
83f06168ef locking/lockdep: Remove unused parameter from the add_lock_to_list() function
The 'class' parameter is not used, remove it.
n
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1478592127-4376-1-git-send-email-tahsin@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:25:20 +01:00
Ingo Molnar
4c8ee71620 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:25:07 +01:00
Tobias Klauser
de464375da bpf: Remove unused but set variables
Remove the unused but set variables min_set and max_set in
adjust_reg_min_max_vals to fix the following warning when building with
'W=1':

  kernel/bpf/verifier.c:1483:7: warning: variable ‘min_set’ set but not used [-Wunused-but-set-variable]

There is no warning about max_set being unused, but since it is only
used in the assignment of min_set it can be removed as well.

They were introduced in commit 484611357c ("bpf: allow access into map
value arrays") but seem to have never been used.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 21:15:00 -05:00
Sebastian Andrzej Siewior
90b14889d2 kernel/printk: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-3-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 23:45:26 +01:00
Christoph Hellwig
67c93c218d genirq/affinity: Handle pre/post vectors in irq_create_affinity_masks()
Only calculate the affinity for the main I/O vectors, and skip the
pre or post vectors specified by struct irq_affinity.

Also remove the irq_affinity cpumask argument that has never been used.
If we ever need it in the future we can pass it through struct
irq_affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Link: http://lkml.kernel.org/r/1478654107-7384-4-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 08:25:09 +01:00
Christoph Hellwig
212bd84622 genirq/affinity: Handle pre/post vectors in irq_calc_affinity_vectors()
Only calculate the affinity for the main I/O vectors, and skip the pre or
post vectors specified by struct irq_affinity.

Also remove the irq_affinity cpumask argument that has never been used.  If
we ever need it in the future we can pass it through struct irq_affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Link: http://lkml.kernel.org/r/1478654107-7384-3-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 08:25:08 +01:00
Thomas Gleixner
7ee7e87dfb genirq: Use irq type from irqdata instead of irqdesc
The type flags in the irq descriptor are there for historical reasons and
only updated via irq_modify_status() or irq_set_type(). Both functions also
update the type flags in irqdata. __setup_irq() is the only left over user
of the type flags in the irq descriptor.

If __setup_irq() is called with empty irq type flags, then the type flags
are retrieved from irqdata. If an interrupt is shared, then the type flags
are compared with the type flags stored in the irq descriptor. 

On x86 the ioapic does not have a irq_set_type() callback because the type
is defined in the BIOS tables and cannot be changed. The type is stored in
irqdata at setup time without updating the type data in the irq
descriptor. As a result the comparison described above fails.

There is no point in updating the irq descriptor flags because the only
relevant storage is irqdata. Use the type flags from irqdata for both
retrieval and comparison in __setup_irq() instead.

Aside of that the print out in case of non matching type flags has the old
and new type flags arguments flipped. Fix that as well.

For correctness sake the flags stored in the irq descriptor should be
removed, but this is beyond the scope of this bugfix and will be done in a
later patch.

Fixes: 4b357daed6 ("genirq: Look-up trigger type if not specified by caller")
Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jon Hunter <jonathanh@nvidia.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08 15:15:19 +01:00
Daniel Borkmann
20b2b24f91 bpf: fix map not being uncharged during map creation failure
In map_create(), we first find and create the map, then once that
suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
a new anon fd through anon_inode_getfd(). The problem is, once the
latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
the map via map->ops->map_free(), but without uncharging the previously
locked memory first. That means that the user_struct allocation is
leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
Make the label names in the fix consistent with bpf_prog_load().

Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:22:26 -05:00
Daniel Borkmann
483bed2b0d bpf: fix htab map destruction when extra reserve is in use
Commit a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
added an extra per-cpu reserve to the hash table map to restore old
behaviour from pre prealloc times. When non-prealloc is in use for a
map, then problem is that once a hash table extra element has been
linked into the hash-table, and the hash table is destroyed due to
refcount dropping to zero, then htab_map_free() -> delete_all_elements()
will walk the whole hash table and drop all elements via htab_elem_free().
The problem is that the element from the extra reserve is first fed
to the wrong backend allocator and eventually freed twice.

Fixes: a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:20:52 -05:00
Linus Torvalds
ffbcbfca84 Merge branches 'sched-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stack vmap fixups from Thomas Gleixner:
 "Two small patches related to sched_show_task():

   - make sure to hold a reference on the task stack while accessing it

   - remove the thread_saved_pc printout

  .. and add a sanity check into release_task_stack() to catch problems
  with task stack references"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Remove pointless printout in sched_show_task()
  sched/core: Fix oops in sched_show_task()

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  fork: Add task stack refcounting sanity check and prevent premature task stack freeing
2016-11-05 11:46:02 -07:00
WANG Cong
243d521261 taskstats: fix the length of cgroupstats_cmd_get_policy
cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
so we could end up accessing out-of-bound.

Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
this is safe because the rest are initialized to 0's.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:55:58 -04:00
Linus Torvalds
8243d55977 sched/core: Remove pointless printout in sched_show_task()
In sched_show_task() we print out a useless hex number, not even a
symbol, and there's a big question mark whether this even makes sense
anyway, I suspect we should just remove it all.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-03 07:31:34 +01:00
Tetsuo Handa
382005027f sched/core: Fix oops in sched_show_task()
When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.

Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.

Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-03 07:27:34 +01:00
Johan Hovold
ceb75787bc PM / sleep: fix device reference leak in test_suspend
Make sure to drop the reference taken by class_find_device() after
opening the RTC device.

Fixes: 77437fd4e6 (pm: boot time suspend selftest)
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-02 05:10:04 +01:00
Mickaël Salaün
285fdfc5d9 seccomp: Fix documentation
Fix struct seccomp_filter and seccomp_run_filters() signatures.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-01 08:54:26 -07:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Ingo Molnar
05b93c19d5 Merge branch 'linus' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:41:06 +01:00
Andy Lutomirski
405c075971 fork: Add task stack refcounting sanity check and prevent premature task stack freeing
If something goes wrong with task stack refcounting and a stack
refcount hits zero too early, warn and leak it rather than
potentially freeing it early (and silently).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:39:17 +01:00
Daniel Borkmann
0f98621bef bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7d ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.

Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:28:11 -04:00
David S. Miller
27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Thomas Graf
ebb676daa1 bpf: Print function name in addition to function id
The verifier currently prints raw function ids when printing CALL
instructions or when complaining:

	5: (85) call 23
	unknown func 23

print a meaningful function name instead:

	5: (85) call bpf_redirect#23
	unknown func bpf_redirect#23

Moves the function documentation to a single comment and renames all
helpers names in the list to conform to the bpf_ prefix notation so
they can be greped in the kernel source.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:55:13 -04:00
Linus Torvalds
b546e0c289 Power management fixes for v4.9-rc3
Specifics:
 
  - Fix a missing KERN_CONT in a system suspend message by converting
    the affected code to using pr_info() and pr_cont() instead of the
    "raw" printk() (Jon Hunter).
 
  - Make intel_pstate set the CPU P-state from its .set_policy()
    callback when the scaling_governor sysfs attribute is set to
    "performance" so that it interacts with NOHZ_FULL more
    predictably which was the case before 4.7 (Rafael Wysocki).
 
  - Make intel_pstate always request the maximum allowed P-state when
    the scaling_governor sysfs attribute is set to "performance" to
    prevent it from effectively ingoring that setting is some
    situations (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJYE+vbAAoJEILEb/54YlRxsqQQAIZLfm5Isp4L0JBt2svtLYuO
 5mLgGOfcpsHIpqq5+saKtK+JOcCpm/MJ0PMQdv3WhKkFGXG9++Dz6F2gNNFlE+Sy
 era/B9NFry1BsQ14xsX8XonYRkQ2tS+CgOw+akSZl9BTVeX57VASqPIzlKI4Auz/
 Ru5M8n4QWAD7dPWnqHTFHYbmTnMtr18wn0mNjiC/cLz7TPxmGM9Pe730/bpGKMDC
 DQpflnpT/ktqV1imDuaiMeuP2re2DSZWMrirZo326UsDtcKN1A4sCO+PPUUt/TTb
 k6pKsZ9OB9FgMmaRqx5G6wCD3NvuhKa1JOqg3RvAVZ0SmMcl2AFhCaoEJtX+TKvz
 3365o2tSQeGc3bmXTb7YqP2SaYm107rOG/M0I2JuA7qNUKxk8MFfbNdaYxTY2ezT
 D5NF0SCycOMeEXIDTTuZ2oN1NHR6osO7bNFOIacKlcIL8PmEl8fDqX2KdYQze5xh
 AGsa1Oee7FPrDd2kQIL6HNGgVjd6gtHU9cVjyzQeUv9SU6zwOlu4/BbZI+wIterN
 L2OfW7w6/KSSyScDZ3TbifeJwpP918bUv578dEP6U/2VKthPBUPY/MYMMQzSuEIn
 RAQoDShqacIjdyLZtStii28KctMOk0L/edEDbdFmyu2ttNwpB+hk2ieCYWfI8wAa
 2SAaYyRGpC2nafsSTv+E
 =RtkB
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix two intel_pstate issues related to the way it works when the
  scaling_governor sysfs attribute is set to "performance" and fix up
  messages in the system suspend core code.

  Specifics:

   - Fix a missing KERN_CONT in a system suspend message by converting
     the affected code to using pr_info() and pr_cont() instead of the
     "raw" printk() (Jon Hunter).

   - Make intel_pstate set the CPU P-state from its .set_policy()
     callback when the scaling_governor sysfs attribute is set to
     "performance" so that it interacts with NOHZ_FULL more predictably
     which was the case before 4.7 (Rafael Wysocki).

   - Make intel_pstate always request the maximum allowed P-state when
     the scaling_governor sysfs attribute is set to "performance" to
     prevent it from effectively ingoring that setting is some
     situations (Rafael Wysocki)"

* tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Always set max P-state in performance mode
  PM / suspend: Fix missing KERN_CONT for suspend message
  cpufreq: intel_pstate: Set P-state upfront in performance mode
2016-10-28 18:29:13 -07:00
Linus Torvalds
b49c3170bf Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc kernel fixes: a virtualization environment related fix, an uncore
  PMU driver removal handling fix, a PowerPC fix and new events for
  Knights Landing"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
  perf/powerpc: Don't call perf_event_disable() from atomic context
  perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
  perf/x86/intel/cstate: Add C-state residency events for Knights Landing
2016-10-28 16:27:16 -07:00
Linus Torvalds
a8006bd915 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "Fix four timer locking races: two were noticed by Linus while
  reviewing the code while chasing for a corruption bug, and two
  from fixing spurious USB timeouts"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Prevent base clock corruption when forwarding
  timers: Prevent base clock rewind when forwarding clock
  timers: Lock base for same bucket optimization
  timers: Plug locking race vs. timer migration
2016-10-28 11:26:01 -07:00
Linus Torvalds
965c4b7e1a Merge branches 'core-urgent-for-linus', 'irq-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool, irq and scheduler fixes from Ingo Molnar:
 "One more objtool fixlet for GCC6 code generation patterns, an irq
  DocBook fix and an unused variable warning fix in the scheduler"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Fix rare switch jump table pattern detection

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  doc: Add missing parameter for msi_setup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Remove unused but set variable 'rq'
2016-10-28 10:12:27 -07:00
Christoph Hellwig
ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Jiri Olsa
5aab90ce1e perf/powerpc: Don't call perf_event_disable() from atomic context
The trinity syscall fuzzer triggered following WARN() on powerpc:

  WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
  ...
  NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
  LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
  Call Trace:
  [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
  [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
  [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
  [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
  [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
  [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

Followed by a lockdep warning:

  ===============================
  [ INFO: suspicious RCU usage. ]
  4.8.0-rc5+ #7 Tainted: G        W
  -------------------------------
  ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!

  other info that might help us debug this:

  rcu_scheduler_active = 1, debug_locks = 0
  2 locks held by ls/2998:
   #0:  (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
   #1:  (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0

  stack backtrace:
  CPU: 9 PID: 2998 Comm: ls Tainted: G        W       4.8.0-rc5+ #7
  Call Trace:
  [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
  [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
  [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
  [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
  [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
  [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
  [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
  [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
  [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
  [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
  [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
  [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

While it looks like the first WARN() is probably valid, the other one is
triggered by disabling event via perf_event_disable() from atomic context.

The event is disabled here in case we were not able to emulate
the instruction that hit the breakpoint. By disabling the event
we unschedule the event and make sure it's not scheduled back.

But we can't call perf_event_disable() from atomic context, instead
we need to use the event's pending_disable irq_work method to disable it.

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-28 11:06:25 +02:00
Jiri Olsa
0933840acf perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
CAI Qian reported a crash in the PMU uncore device removal code,
enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:

  https://marc.info/?l=linux-kernel&m=147688837328451

The reason for the crash is that perf_pmu_unregister() tries to remove
a PMU device which is not added at this point. We add PMU devices
only after pmu_bus is registered, which happens in the
perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.

The fix is to get the 'pmu_bus_running' flag state at the point
the PMU is taken out of the PMU list and remove the device
later only if it's set.

Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-28 11:06:25 +02:00
Andrey Konovalov
b274c0bb39 kcov: properly check if we are in an interrupt
in_interrupt() returns a nonzero value when we are either in an
interrupt or have bh disabled via local_bh_disable().  Since we are
interested in only ignoring coverage from actual interrupts, do a proper
check instead of just calling in_interrupt().

As a result of this change, kcov will start to collect coverage from
within local_bh_disable()/local_bh_enable() sections.

Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morse <james.morse@arm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 18:43:42 -07:00
Johannes Berg
56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Linus Torvalds
9c953d639c Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes for this series, most notably the fix for the blk-mq
  software queue regression in from this merge window.

  Apart from that, a fix for an unlikely hang if a queue is flooded with
  FUA requests from Ming, and a few small fixes for nbd and badblocks.
  Lastly, a rename update for the proc softirq output, since the block
  polling code was made generic"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: update hardware and software queues for sleeping alloc
  block: flush: fix IO hang in case of flood fua req
  nbd: fix incorrect unlock of nbd->sock_lock in sock_shutdown
  badblocks: badblocks_set/clear update unacked_exist
  softirq: Display IRQ_POLL for irq-poll statistics
2016-10-27 10:05:31 -07:00
Linus Torvalds
9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:

     wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().

The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).

It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.

As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.

Peter Zijlstra already has a patch for that, but let's see if anybody
even notices.  In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 09:27:57 -07:00
Tobias Klauser
f5d6d2da0d sched/fair: Remove unused but set variable 'rq'
Since commit:

  8663e24d56 ("sched/fair: Reorder cgroup creation code")

... the variable 'rq' in alloc_fair_sched_group() is set but no longer used.
Remove it to fix the following GCC warning when building with 'W=1':

  kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-27 08:33:52 +02:00
Douglas Anderson
4b7e9cf9c8 timers: Fix documentation for schedule_timeout() and similar
The documentation for schedule_timeout(), schedule_hrtimeout(), and
schedule_hrtimeout_range() all claim that the routines couldn't possibly
return early if the task state was TASK_UNINTERRUPTIBLE. This is simply
not true since wake_up_process() will cause those routines to exit early.

We cannot make schedule_[hr]timeout() loop until the timeout expires if the
task state is uninterruptible because we have users which rely on the
existing and designed behaviour.

Make the documentation match the (correct) implementation.

schedule_hrtimeout() returns -EINTR even when a uninterruptible task was
woken up. This might look strange, but making the return code depend on the
state is too much of an effort as it would affect all the call sites. There
is no value in doing so, but we spell it out clearly in the documentation.

Suggested-by: Daniel Kurtz <djkurtz@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: huangtao@rock-chips.com
Cc: heiko@sntech.de
Cc: broonie@kernel.org
Cc: briannorris@chromium.org
Cc: Andreas Mohr <andi@lisas.de>
Cc: linux-rockchip@lists.infradead.org
Cc: tony.xie@rock-chips.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux@roeck-us.net
Cc: tskd08@gmail.com
Link: http://lkml.kernel.org/r/1477065531-30342-2-git-send-email-dianders@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-26 13:14:47 +02:00
Douglas Anderson
6c5e905969 timers: Fix usleep_range() in the context of wake_up_process()
Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter. However, nothing in the code ensures
this, when the sleeping task is woken by wake_up_process() or any other
mechanism which can wake a task from uninterruptible state.

Neither usleep_range() nor schedule_hrtimeout_range*() have any protection
against wakeups. schedule_hrtimeout_range*() is designed this way despite
the fact that the API documentation does not mention it.

msleep() already has code to handle this case since it will loop as long
as there was still time left.  usleep_range() has no such loop, add it.

Presumably this problem was not detected before because usleep_range() is
only used in a few places and the function is mostly used in contexts which
are not exposed to wakeups of any form.

An effort was made to look for users relying on the old behavior by
looking for usleep_range() in the same file as wake_up_process().
No problems were found by this search, though it is conceivable that
someone could have put the sleep and wakeup in two different files.

An effort was made to ask several upstream maintainers if they were aware
of people relying on wake_up_process() to wake up usleep_range(). No
maintainers were aware of that but they were aware of many people relying
on usleep_range() never returning before the minimum.

Reported-by: Tao Huang <huangtao@rock-chips.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: heiko@sntech.de
Cc: broonie@kernel.org
Cc: briannorris@chromium.org
Cc: Andreas Mohr <andi@lisas.de>
Cc: linux-rockchip@lists.infradead.org
Cc: tony.xie@rock-chips.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: djkurtz@chromium.org
Cc: linux@roeck-us.net
Cc: tskd08@gmail.com
Link: http://lkml.kernel.org/r/1477065531-30342-1-git-send-email-dianders@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-26 13:14:46 +02:00
Michael Ellerman
51111dce25 kernel/smp: Tell the user we're bringing up secondary CPUs
Currently we don't print anything before starting to bring up secondary
CPUs. This can be confusing if it takes a long time to bring up the
secondaries, or if the kernel crashes while doing so and produces no
further output.

On x86 they work around this by detecting when the first secondary CPU
comes up and printing a message (see announce_cpu()). But doing it in
smp_init() is simpler and works for all arches.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: akpm@osdl.org
Cc: jgross@suse.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: len.brown@intel.com
Cc: peterz@infradead.org
Cc: richard@nod.at
Cc: jolsa@redhat.com
Cc: boris.ostrovsky@oracle.com
Cc: mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1477460275-8266-3-git-send-email-mpe@ellerman.id.au
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-26 12:02:35 +02:00
Michael Ellerman
92b2327829 kernel/smp: Make the SMP boot message common on all arches
Currently after bringing up secondary CPUs all arches print "Brought up
%d CPUs". On x86 they also print the number of nodes that were brought
online.

It would be nice to also print the number of nodes on other arches.
Although we could override smp_announce() on the other ~10 NUMA aware
arches, it seems simpler to just always print the number of nodes. On
non-NUMA arches there is just always 1 node.

Having done that, smp_announce() is no longer weak, and seems small
enough to just pull directly into smp_init().

Also update the printing of "%d CPUs" to be smart when an SMP kernel is
booted on a single CPU system, or when only one CPU is available, eg:

   smp: Brought up 2 nodes, 1 CPU

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: akpm@osdl.org
Cc: jgross@suse.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: len.brown@intel.com
Cc: peterz@infradead.org
Cc: richard@nod.at
Cc: jolsa@redhat.com
Cc: boris.ostrovsky@oracle.com
Cc: mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1477460275-8266-2-git-send-email-mpe@ellerman.id.au
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-26 12:02:35 +02:00
Michael Ellerman
ca7dfdbb33 kernel/smp: Define pr_fmt() for smp.c
This makes all our pr_xxx()'s start with "smp: ", which helps pin down
where they come from and generally looks nice. There is actually only
one pr_xxx() use in smp.c at the moment, but we will add some more in
the next commit.

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: akpm@osdl.org
Cc: jgross@suse.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: len.brown@intel.com
Cc: peterz@infradead.org
Cc: richard@nod.at
Cc: jolsa@redhat.com
Cc: boris.ostrovsky@oracle.com
Cc: mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1477460275-8266-1-git-send-email-mpe@ellerman.id.au
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-26 12:02:35 +02:00
Josh Poimboeuf
0ee1dd9f5e x86/dumpstack: Remove raw stack dump
For mostly historical reasons, the x86 oops dump shows the raw stack
values:

  ...
  [registers]
  Stack:
   ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
   ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
   0000000000000002 0000000000000000 0000000000000000 0000000000000000
  Call Trace:
  ...

This seems to be an artifact from long ago, and probably isn't needed
anymore.  It generally just adds noise to the dump, and it can be
actively harmful because it leaks kernel addresses.

Linus says:

  "The stack dump actually goes back to forever, and it used to be
   useful back in 1992 or so. But it used to be useful mainly because
   stacks were simpler and we didn't have very good call traces anyway. I
   definitely remember having used them - I just do not remember having
   used them in the last ten+ years.

   Of course, it's still true that if you can trigger an oops, you've
   likely already lost the security game, but since the stack dump is so
   useless, let's aim to just remove it and make games like the above
   harder."

This also removes the related 'kstack=' cmdline option and the
'kstack_depth_to_print' sysctl.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 18:40:37 +02:00
Thomas Gleixner
6bad6bccf2 timers: Prevent base clock corruption when forwarding
When a timer is enqueued we try to forward the timer base clock. This
mechanism has two issues:

1) Forwarding a remote base unlocked

The forwarding function is called from get_target_base() with the current
timer base lock held. But if the new target base is a different base than
the current base (can happen with NOHZ, sigh!) then the forwarding is done
on an unlocked base. This can lead to corruption of base->clk.

Solution is simple: Invoke the forwarding after the target base is locked.

2) Possible corruption due to jiffies advancing

This is similar to the issue in get_net_timer_interrupt() which was fixed
in the previous patch. jiffies can advance between check and assignement
and therefore advancing base->clk beyond the next expiry value.

So we need to read jiffies into a local variable once and do the checks and
assignment with the local copy.

Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible")
Reported-by: Ashton Holmes <scoopta@gmail.com>
Reported-by: Michael Thayer <michael.thayer@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Necasek <michal.necasek@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: knut.osmundsen@oracle.com
Cc: stable@vger.kernel.org
Cc: stern@rowland.harvard.edu
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-25 16:32:50 +02:00
Thomas Gleixner
041ad7bc75 timers: Prevent base clock rewind when forwarding clock
Ashton and Michael reported, that kernel versions 4.8 and later suffer from
USB timeouts which are caused by the timer wheel rework.

This is caused by a bug in the base clock forwarding mechanism, which leads
to timers expiring early. The scenario which leads to this is:

run_timers()
  while (jiffies >= base->clk) {
    collect_expired_timers();
    base->clk++;
    expire_timers();
  }          

So base->clk = jiffies + 1. Now the cpu goes idle:

idle()
  get_next_timer_interrupt()
    nextevt = __next_time_interrupt();
    if (time_after(nextevt, base->clk))
       	base->clk = jiffies;

jiffies has not advanced since run_timers(), so this assignment effectively
decrements base->clk by one.

base->clk is the index into the timer wheel arrays. So let's assume the
following state after the base->clk increment in run_timers():

 jiffies = 0
 base->clk = 1

A timer gets enqueued with an expiry delta of 63 ticks (which is the case
with the USB timeout and HZ=250) so the resulting bucket index is:

  base->clk + delta = 1 + 63 = 64

The timer goes into the first wheel level. The array size is 64 so it ends
up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
to index into bucket 0 again.

If the cpu goes idle before jiffies advance, then the bug in the forwarding
mechanism sets base->clk back to 0, so the next invocation of run_timers()
at the next tick will index into bucket 0 and therefore expire the timer 62
ticks too early.

Instead of blindly setting base->clk to jiffies we must make the forwarding
conditional on jiffies > base->clk, but we cannot use jiffies for this as
we might run into the following issue:

  if (time_after(jiffies, base->clk) {
    if (time_after(nextevt, base->clk))
       base->clk = jiffies;

jiffies can increment between the check and the assigment far enough to
advance beyond nextevt. So we need to use a stable value for checking.

get_next_timer_interrupt() has the basej argument which is the jiffies
value snapshot taken in the calling code. So we can just that.

Thanks to Ashton for bisecting and providing trace data!

Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Reported-by: Ashton Holmes <scoopta@gmail.com>
Reported-by: Michael Thayer <michael.thayer@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Necasek <michal.necasek@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: knut.osmundsen@oracle.com
Cc: stable@vger.kernel.org
Cc: stern@rowland.harvard.edu
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-25 16:32:50 +02:00
Thomas Gleixner
4da9152a43 timers: Lock base for same bucket optimization
Linus stumbled over the unlocked modification of the timer expiry value in
mod_timer() which is an optimization for timers which stay in the same
bucket - due to the bucket granularity - despite their expiry time getting
updated.

The optimization itself still makes sense even if we take the lock, because
in case that the bucket stays the same, we avoid the pointless
queue/enqueue dance.

Make the check and the modification of timer->expires protected by the base
lock and shuffle the remaining code around so we can keep the lock held
when we actually have to requeue the timer to a different bucket.

Fixes: f00c0afdfa ("timers: Implement optimization for same expiry time in mod_timer()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-10-25 16:27:39 +02:00
Thomas Gleixner
b831275a35 timers: Plug locking race vs. timer migration
Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
timer flags. As a consequence the compiler is allowed to reload the flags
between the initial check for TIMER_MIGRATION and the following timer base
computation and the spin lock of the base.

While this has not been observed (yet), we need to make sure that it never
happens.

Fixes: 0eeda71bc3 ("timer: Replace timer base by a cpu index")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-10-25 16:27:39 +02:00
Waiman Long
b341afb325 locking/mutex: Enable optimistic spinning of woken waiter
This patch makes the waiter that sets the HANDOFF flag start spinning
instead of sleeping until the handoff is complete or the owner
sleeps. Otherwise, the handoff will cause the optimistic spinners to
abort spinning as the handed-off owner may not be running.

Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <Will.Deacon@arm.com>
Link: http://lkml.kernel.org/r/1472254509-27508-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 11:31:54 +02:00
Waiman Long
a40ca56577 locking/mutex: Simplify some ww_mutex code in __mutex_lock_common()
This patch removes some of the redundant ww_mutex code in
__mutex_lock_common().

Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <Will.Deacon@arm.com>
Link: http://lkml.kernel.org/r/1472254509-27508-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 11:31:53 +02:00
Peter Zijlstra
5bbd7e6443 locking/mutex: Restructure wait loop
Doesn't really matter yet, but pull the HANDOFF and trylock out from
under the wait_lock.

The intention is to add an optimistic spin loop here, which requires
we do not hold the wait_lock, so shuffle code around in preparation.

Also clarify the purpose of taking the wait_lock in the wait loop, its
tempting to want to avoid it altogether, but the cancellation cases
need to to avoid losing wakeups.

Suggested-by: Waiman Long <waiman.long@hpe.com>
Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 11:31:53 +02:00
Peter Zijlstra
9d659ae14b locking/mutex: Add lock handoff to avoid starvation
Implement lock handoff to avoid lock starvation.

Lock starvation is possible because mutex_lock() allows lock stealing,
where a running (or optimistic spinning) task beats the woken waiter
to the acquire.

Lock stealing is an important performance optimization because waiting
for a waiter to wake up and get runtime can take a significant time,
during which everyboy would stall on the lock.

The down-side is of course that it allows for starvation.

This patch has the waiter requesting a handoff if it fails to acquire
the lock upon waking. This re-introduces some of the wait time,
because once we do a handoff we have to wait for the waiter to wake up
again.

A future patch will add a round of optimistic spinning to attempt to
alleviate this penalty, but if that turns out to not be enough, we can
add a counter and only request handoff after multiple failed wakeups.

There are a few tricky implementation details:

 - accepting a handoff must only be done in the wait-loop. Since the
   handoff condition is owner == current, it can easily cause
   recursive locking trouble.

 - accepting the handoff must be careful to provide the ACQUIRE
   semantics.

 - having the HANDOFF bit set on unlock requires care, we must not
   clear the owner.

 - we must be careful to not leave HANDOFF set after we've acquired
   the lock. The tricky scenario is setting the HANDOFF bit on an
   unlocked mutex.

Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <Waiman.Long@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 11:31:52 +02:00
Peter Zijlstra
a3ea3d9b86 locking/mutex: Allow MUTEX_SPIN_ON_OWNER when DEBUG_MUTEXES
Now that mutex::count and mutex::owner are the same field, we can
allow SPIN_ON_OWNER while DEBUG_MUTEX.

Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 11:31:51 +02:00
Peter Zijlstra
3ca0ff571b locking/mutex: Rework mutex::owner
The current mutex implementation has an atomic lock word and a
non-atomic owner field.

This disparity leads to a number of issues with the current mutex code
as it means that we can have a locked mutex without an explicit owner
(because the owner field has not been set, or already cleared).

This leads to a number of weird corner cases, esp. between the
optimistic spinning and debug code. Where the optimistic spinning
code needs the owner field updated inside the lock region, the debug
code is more relaxed because the whole lock is serialized by the
wait_lock.

Also, the spinning code itself has a few corner cases where we need to
deal with a held lock without an owner field.

Furthermore, it becomes even more of a problem when trying to fix
starvation cases in the current code. We end up stacking special case
on special case.

To solve this rework the basic mutex implementation to be a single
atomic word that contains the owner and uses the low bits for extra
state.

This matches how PI futexes and rt_mutex already work. By having the
owner an integral part of the lock state a lot of the problems
dissapear and we get a better option to deal with starvation cases,
direct owner handoff.

Changing the basic mutex does however invalidate all the arch specific
mutex code; this patch leaves that unused in-place, a later patch will
remove that.

Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 11:31:50 +02:00
Peter Zijlstra
a225023828 sched/core: Explain sleep/wakeup in a better way
There were a few questions wrt. how sleep-wakeup works. Try and explain
it more.

Requested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 11:27:56 +02:00
Tobias Klauser
119a0798dc padata: Remove unused but set variables
Remove the unused but set variable pinst in padata_parallel_worker to
fix the following warning when building with 'W=1':

  kernel/padata.c: In function ‘padata_parallel_worker’:
  kernel/padata.c:68:26: warning: variable ‘pinst’ set but not used [-Wunused-but-set-variable]

Also remove the now unused variable pd which is only used to set pinst.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-10-25 11:08:10 +08:00
Jon Hunter
1adb469b9b PM / suspend: Fix missing KERN_CONT for suspend message
Commit 4bcc595ccd (printk: reinstate KERN_CONT for printing
continuation lines) exposed a missing KERN_CONT from one of the
messages shown on entering suspend. With v4.9-rc1, the 'done.' shown
after syncing the filesystems no longer appears as a continuation but
a new message with its own timestamp.

[    9.259566] PM: Syncing filesystems ... [    9.264119] done.

Fix this by adding the KERN_CONT log level for the 'done.' part of the
message seen after syncing filesystems. While we are at it, convert
these suspend printks to pr_info and pr_cont, respectively.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-10-24 14:38:02 +02:00
Daniel Borkmann
2d0e30c30f bpf: add helper for retrieving current numa node id
Use case is mainly for soreuseport to select sockets for the local
numa node, but since generic, lets also add this for other networking
and tracing program types.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22 17:05:52 -04:00
Linus Torvalds
bfb7bfef6f Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
 "Mostly irqchip driver fixes, plus a symbol export"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kernel/irq: Export irq_set_parent()
  irqchip/gic: Add missing \n to CPU IF adjustment message
  irqchip/jcore: Don't show Kconfig menu item for driver
  irqchip/eznps: Drop pointless static qualifier in nps400_of_init()
  irqchip/gic-v3-its: Fix entry size mask for GITS_BASER
  irqchip/gic-v3-its: Fix 64bit GIC{R,ITS}_TYPER accesses
2016-10-22 09:33:51 -07:00
Sagi Grimberg
f660f60667 softirq: Display IRQ_POLL for irq-poll statistics
This library was moved to the generic area and was
renamed to irq-poll. Hence, update proc/softirqs output accordingly.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-21 15:45:47 -06:00
Sudip Mukherjee
3118dac501 kernel/irq: Export irq_set_parent()
The TPS65217 driver grew interrupt support which uses
irq_set_parent(). While it's not yet clear why this is used in the first
place, building the driver as a module fails with:

 ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined!

The correctness of the driver change is still investigated, but for now
it's less trouble to export irq_set_parent() than dealing with the build
wreckage.

[ tglx: Rewrote changelog and made the export GPL ]

Fixes: 6556bdacf6 ("mfd: tps65217: Add support for IRQs")
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Marcin Niestroj <m.niestroj@grinn-global.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Lee Jones <lee.jones@linaro.org>
Link: http://lkml.kernel.org/r/1475775403-27207-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-21 10:21:38 +02:00
Matt Fleming
3c3fcb45d5 sched/fair: Kill the unused 'sched_shares_window_ns' tunable
The last user of this tunable was removed in 2012 in commit:

  82958366cf ("sched: Replace update_shares weight distribution with per-entity computation")

Delete it since its very existence confuses people.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161019141059.26408-1-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 08:44:57 +02:00
Linus Torvalds
893e2c5c9f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "This fixes a group scheduling related performance/interactivity
  regression introduced in v4.8, which affects certain hardware
  environments where cpu_possible_mask != cpu_present_mask"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix incorrect task group ->load_avg
2016-10-19 10:03:55 -07:00
Tejun Heo
8bc4a04455 Merge branch 'for-4.9' into for-4.10 2016-10-19 12:12:40 -04:00
Tejun Heo
2186d9f940 workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early().  Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.

  Unable to handle kernel paging request for data at address 0x00000038
  Faulting instruction address: 0xc0000000000fc0cc
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
  task: c0000007f5400000 task.stack: c000001ffc084000
  NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
  REGS: c000001ffc087780 TRAP: 0300   Not tainted  (4.8.0-compiler_gcc-6.2.0-next-20161005)
  MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 48000424  XER: 00000000
  CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
  GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
  GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
  GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
  GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
  GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
  GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
  GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
  GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
  NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
  LR [c0000000000ed928] activate_task+0x78/0xe0
  Call Trace:
  [c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
  [c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
  [c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
  [c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
  [c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
  [c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
  [c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
  [c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
  [c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
  Instruction dump:
  62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
  60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
  ---[ end trace 0000000000000000 ]---

Fix it by moving wq_numa_init() to workqueue_init().  As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 12:12:26 -04:00
Linus Torvalds
8835ca59da printk: suppress empty continuation lines
We have a fairly common pattern where you print several things as
continuations on one single line in a loop, and then at the end you do

	printk(KERN_CONT "\n");

to flush the buffered output.

But if the output was flushed by something else (concurrent printk
activity, or just system logging), we don't want that final flushing to
just print an empty line.

So just suppress empty continuation lines when they couldn't be merged
into the line they are a continuation of.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19 09:11:24 -07:00
Linus Torvalds
63ae602cea Merge branch 'gup_flag-cleanups'
Merge the gup_flags cleanups from Lorenzo Stoakes:
 "This patch series adjusts functions in the get_user_pages* family such
  that desired FOLL_* flags are passed as an argument rather than
  implied by flags.

  The purpose of this change is to make the use of FOLL_FORCE explicit
  so it is easier to grep for and clearer to callers that this flag is
  being used.  The use of FOLL_FORCE is an issue as it overrides missing
  VM_READ/VM_WRITE flags for the VMA whose pages we are reading
  from/writing to, which can result in surprising behaviour.

  The patch series came out of the discussion around commit 38e0885465
  ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
  which addressed a BUG_ON() being triggered when a page was faulted in
  with PROT_NONE set but having been overridden by FOLL_FORCE.
  do_numa_page() was run on the assumption the page _must_ be one marked
  for NUMA node migration as an actual PROT_NONE page would have been
  dealt with prior to this code path, however FOLL_FORCE introduced a
  situation where this assumption did not hold.

  See

      https://marc.info/?l=linux-mm&m=147585445805166

  for the patch proposal"

Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
FOLL_WRITE by me.

[ This branch was rebased recently to add a few more acked-by's and
  reviewed-by's ]

* gup_flag-cleanups:
  mm: replace access_process_vm() write parameter with gup_flags
  mm: replace access_remote_vm() write parameter with gup_flags
  mm: replace __access_remote_vm() write parameter with gup_flags
  mm: replace get_user_pages_remote() write/force parameters with gup_flags
  mm: replace get_user_pages() write/force parameters with gup_flags
  mm: replace get_vaddr_frames() write/force parameters with gup_flags
  mm: replace get_user_pages_locked() write/force parameters with gup_flags
  mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
  mm: remove write/force parameters from __get_user_pages_unlocked()
  mm: remove write/force parameters from __get_user_pages_locked()
  mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
2016-10-19 08:39:47 -07:00
Lorenzo Stoakes
f307ab6dce mm: replace access_process_vm() write parameter with gup_flags
This removes the 'write' argument from access_process_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19 08:31:25 -07:00
Lorenzo Stoakes
9beae1ea89 mm: replace get_user_pages_remote() write/force parameters with gup_flags
This removes the 'write' and 'force' from get_user_pages_remote() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19 08:12:02 -07:00
Thomas Graf
57a09bf0a4 bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers
A BPF program is required to check the return register of a
map_elem_lookup() call before accessing memory. The verifier keeps
track of this by converting the type of the result register from
PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional
jump ensures safety. This check is currently exclusively performed
for the result register 0.

In the event the compiler reorders instructions, BPF_MOV64_REG
instructions may be moved before the conditional jump which causes
them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the
verifier objects when the register is accessed:

0: (b7) r1 = 10
1: (7b) *(u64 *)(r10 -8) = r1
2: (bf) r2 = r10
3: (07) r2 += -8
4: (18) r1 = 0x59c00000
6: (85) call 1
7: (bf) r4 = r0
8: (15) if r0 == 0x0 goto pc+1
 R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp
9: (7a) *(u64 *)(r4 +0) = 0
R4 invalid mem access 'map_value_or_null'

This commit extends the verifier to keep track of all identical
PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by
assigning them an ID and then marking them all when the conditional
jump is observed.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-19 11:09:28 -04:00
Vincent Guittot
b5a9b34078 sched/fair: Fix incorrect task group ->load_avg
A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:

  3d30544f02 ("sched/fair: Apply more PELT fixes)

The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.

The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().

But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.

The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.

( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
  of the task group will be much higher than sum of ".tg_load_avg_contrib"
  of online cfs_rqs of the task group. )

The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.

Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org> # 4.8.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f02 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-19 15:04:47 +02:00
Linus Torvalds
9d71bdfbcb Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixlet from Ingo Molnar:
 "Remove an unused variable"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  alarmtimer: Remove unused but set variable
2016-10-18 09:57:12 -07:00
Linus Torvalds
2c11fc87ca Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a crash that can trigger when racing with CPU hotplug: we didn't
  use sched-domains data structures carefully enough in select_idle_cpu()"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
2016-10-18 09:53:59 -07:00
Linus Torvalds
351267d941 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc fixes from Ingo Molnar:
 "A CPU hotplug debuggability fix and three objtool false positive
  warnings fixes for new GCC6 code generation patterns"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Use distinct name for cpu_hotplug.dep_map
  objtool: Skip all "unreachable instruction" warnings for gcov kernels
  objtool: Improve rare switch jump table pattern detection
  objtool: Support '-mtune=atom' stack frame setup instruction
2016-10-18 08:35:07 -07:00
Tobias Klauser
54e23845e9 alarmtimer: Remove unused but set variable
Remove the set but unused variable base in alarm_clock_get to fix the
following warning when building with 'W=1':

  kernel/time/alarmtimer.c: In function ‘alarm_timer_create’:
  kernel/time/alarmtimer.c:545:21: warning: variable ‘base’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20161017094702.10873-1-tklauser@distanz.ch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-17 11:59:37 +02:00
Joonas Lahtinen
a705e07b9c cpu/hotplug: Use distinct name for cpu_hotplug.dep_map
Use distinctive name for cpu_hotplug.dep_map to avoid the actual
cpu_hotplug.lock appearing as cpu_hotplug.lock#2 in lockdep splats.

Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Gautham R . Shenoy <ego@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-16 11:09:32 +02:00
Linus Torvalds
9ffc66941d This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
 possible, hoping to capitalize on any possible variation in CPU operation
 (due to runtime data differences, hardware differences, SMP ordering,
 thermal timing variation, cache behavior, etc).
 
 At the very least, this plugin is a much more comprehensive example for
 how to manipulate kernel code using the gcc plugin internals.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
 1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
 Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
 iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
 B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
 MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
 SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
 8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
 e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
 afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
 cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
 pa/A7CNQwibIV6YD8+/p
 =1dUK
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc plugins update from Kees Cook:
 "This adds a new gcc plugin named "latent_entropy". It is designed to
  extract as much possible uncertainty from a running system at boot
  time as possible, hoping to capitalize on any possible variation in
  CPU operation (due to runtime data differences, hardware differences,
  SMP ordering, thermal timing variation, cache behavior, etc).

  At the very least, this plugin is a much more comprehensive example
  for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  latent_entropy: Mark functions with __latent_entropy
  gcc-plugins: Add latent_entropy plugin
2016-10-15 10:03:15 -07:00
Linus Torvalds
f34d3606f7 Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - tracepoints for basic cgroup management operations added

 - kernfs and cgroup path formatting functions updated to behave in the
   style of strlcpy()

 - non-critical bug fixes

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
  cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
  cpuset: fix error handling regression in proc_cpuset_show()
  cgroup: add tracepoints for basic operations
  cgroup: make cgroup_path() and friends behave in the style of strlcpy()
  kernfs: remove kernfs_path_len()
  kernfs: make kernfs_path*() behave in the style of strlcpy()
  kernfs: add dummy implementation of kernfs_path_from_node()
2016-10-14 12:18:50 -07:00
Linus Torvalds
ef6000b4c6 Disable the __builtin_return_address() warning globally after all
This affectively reverts commit 377ccbb483 ("Makefile: Mute warning
for __builtin_return_address(>0) for tracing only") because it turns out
that it really isn't tracing only - it's all over the tree.

We already also had the warning disabled separately for mm/usercopy.c
(which this commit also removes), and it turns out that we will also
want to disable it for get_lock_parent_ip(), that is used for at least
TRACE_IRQFLAGS.  Which (when enabled) ends up being all over the tree.

Steven Rostedt had a patch that tried to limit it to just the config
options that actually triggered this, but quite frankly, the extra
complexity and abstraction just isn't worth it.  We have never actually
had a case where the warning is actually useful, so let's just disable
it globally and not worry about it.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-12 10:23:41 -07:00
John Siddle
48a6d64eda hung_task: allow hung_task_panic when hung_task_warnings is 0
Previously hung_task_panic would not be respected if enabled after
hung_task_warnings had already been decremented to 0.

Permit the kernel to panic if hung_task_panic is enabled after
hung_task_warnings has already been decremented to 0 and another task
hangs for hung_task_timeout_secs seconds.

Check if hung_task_panic is enabled so we don't return prematurely, and
check if hung_task_warnings is non-zero so we don't print the warning
unnecessarily.

[akpm@linux-foundation.org: fix off-by-one]
Link: http://lkml.kernel.org/r/1473450214-4049-1-git-send-email-jsiddle@redhat.com
Signed-off-by: John Siddle <jsiddle@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
dbf52682cb kthread: better support freezable kthread workers
This patch allows to make kthread worker freezable via a new @flags
parameter. It will allow to avoid an init work in some kthreads.

It currently does not affect the function of kthread_worker_fn()
but it might help to do some optimization or fixes eventually.

I currently do not know about any other use for the @flags
parameter but I believe that we will want more flags
in the future.

Finally, I hope that it will not cause confusion with @flags member
in struct kthread. Well, I guess that we will want to rework the
basic kthreads implementation once all kthreads are converted into
kthread workers or workqueues. It is possible that we will merge
the two structures.

Link: http://lkml.kernel.org/r/1470754545-17632-12-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
9a6b06c8d9 kthread: allow to modify delayed kthread work
There are situations when we need to modify the delay of a delayed kthread
work. For example, when the work depends on an event and the initial delay
means a timeout. Then we want to queue the work immediately when the event
happens.

This patch implements kthread_mod_delayed_work() as inspired workqueues.
It cancels the timer, removes the work from any worker list and queues it
again with the given timeout.

A very special case is when the work is being canceled at the same time.
It might happen because of the regular kthread_cancel_delayed_work_sync()
or by another kthread_mod_delayed_work(). In this case, we do nothing and
let the other operation win. This should not normally happen as the caller
is supposed to synchronize these operations a reasonable way.

Link: http://lkml.kernel.org/r/1470754545-17632-11-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
37be45d49d kthread: allow to cancel kthread work
We are going to use kthread workers more widely and sometimes we will need
to make sure that the work is neither pending nor running.

This patch implements cancel_*_sync() operations as inspired by
workqueues.  Well, we are synchronized against the other operations via
the worker lock, we use del_timer_sync() and a counter to count parallel
cancel operations.  Therefore the implementation might be easier.

First, we check if a worker is assigned.  If not, the work has newer been
queued after it was initialized.

Second, we take the worker lock.  It must be the right one.  The work must
not be assigned to another worker unless it is initialized in between.

Third, we try to cancel the timer when it exists.  The timer is deleted
synchronously to make sure that the timer call back is not running.  We
need to temporary release the worker->lock to avoid a possible deadlock
with the callback.  In the meantime, we set work->canceling counter to
avoid any queuing.

Fourth, we try to remove the work from a worker list. It might be
the list of either normal or delayed works.

Fifth, if the work is running, we call kthread_flush_work().  It might
take an arbitrary time.  We need to release the worker-lock again.  In the
meantime, we again block any queuing by the canceling counter.

As already mentioned, the check for a pending kthread work is done under a
lock.  In compare with workqueues, we do not need to fight for a single
PENDING bit to block other operations.  Therefore we do not suffer from
the thundering storm problem and all parallel canceling jobs might use
kthread_flush_work().  Any queuing is blocked until the counter gets zero.

Link: http://lkml.kernel.org/r/1470754545-17632-10-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
22597dc3d9 kthread: initial support for delayed kthread work
We are going to use kthread_worker more widely and delayed works
will be pretty useful.

The implementation is inspired by workqueues.  It uses a timer to queue
the work after the requested delay.  If the delay is zero, the work is
queued immediately.

In compare with workqueues, each work is associated with a single worker
(kthread).  Therefore the implementation could be much easier.  In
particular, we use the worker->lock to synchronize all the operations with
the work.  We do not need any atomic operation with a flags variable.

In fact, we do not need any state variable at all.  Instead, we add a list
of delayed works into the worker.  Then the pending work is listed either
in the list of queued or delayed works.  And the existing check of pending
works is the same even for the delayed ones.

A work must not be assigned to another worker unless reinitialized.
Therefore the timer handler might expect that dwork->work->worker is valid
and it could simply take the lock.  We just add some sanity checks to help
with debugging a potential misuse.

Link: http://lkml.kernel.org/r/1470754545-17632-9-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
8197b3d43b kthread: detect when a kthread work is used by more workers
Nothing currently prevents a work from queuing for a kthread worker when
it is already running on another one.  This means that the work might run
in parallel on more than one worker.  Also some operations are not
reliable, e.g.  flush.

This problem will be even more visible after we add kthread_cancel_work()
function.  It will only have "work" as the parameter and will use
worker->lock to synchronize with others.

Well, normally this is not a problem because the API users are sane.
But bugs might happen and users also might be crazy.

This patch adds a warning when we try to insert the work for another
worker.  It does not fully prevent the misuse because it would make the
code much more complicated without a big benefit.

It adds the same warning also into kthread_flush_work() instead of the
repeated attempts to get the right lock.

A side effect is that one needs to explicitly reinitialize the work if it
must be queued into another worker.  This is needed, for example, when the
worker is stopped and started again.  It is a bit inconvenient.  But it
looks like a good compromise between the stability and complexity.

I have double checked all existing users of the kthread worker API and
they all seems to initialize the work after the worker gets started.

Just for completeness, the patch adds a check that the work is not already
in a queue.

The patch also puts all the checks into a separate function.  It will be
reused when implementing delayed works.

Link: http://lkml.kernel.org/r/1470754545-17632-8-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
35033fe9cb kthread: add kthread_destroy_worker()
The current kthread worker users call flush() and stop() explicitly.
This function does the same plus it frees the kthread_worker struct
in one call.

It is supposed to be used together with kthread_create_worker*() that
allocates struct kthread_worker.

Link: http://lkml.kernel.org/r/1470754545-17632-7-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
fbae2d44aa kthread: add kthread_create_worker*()
Kthread workers are currently created using the classic kthread API,
namely kthread_run().  kthread_worker_fn() is passed as the @threadfn
parameter.

This patch defines kthread_create_worker() and
kthread_create_worker_on_cpu() functions that hide implementation details.

They enforce using kthread_worker_fn() for the main thread.  But I doubt
that there are any plans to create any alternative.  In fact, I think that
we do not want any alternative main thread because it would be hard to
support consistency with the rest of the kthread worker API.

The naming and function of kthread_create_worker() is inspired by the
workqueues API like the rest of the kthread worker API.

The kthread_create_worker_on_cpu() variant is motivated by the original
kthread_create_on_cpu().  Note that we need to bind per-CPU kthread
workers already when they are created.  It makes the life easier.
kthread_bind() could not be used later for an already running worker.

This patch does _not_ convert existing kthread workers.  The kthread
worker API need more improvements first, e.g.  a function to destroy the
worker.

IMPORTANT:

kthread_create_worker_on_cpu() allows to use any format of the worker
name, in compare with kthread_create_on_cpu().  The good thing is that it
is more generic.  The bad thing is that most users will need to pass the
cpu number in two parameters, e.g.  kthread_create_worker_on_cpu(cpu,
"helper/%d", cpu).

To be honest, the main motivation was to avoid the need for an empty
va_list.  The only legal way was to create a helper function that would be
called with an empty list.  Other attempts caused compilation warnings or
even errors on different architectures.

There were also other alternatives, for example, using #define or
splitting __kthread_create_worker().  The used solution looked like the
least ugly.

Link: http://lkml.kernel.org/r/1470754545-17632-6-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
255451e453 kthread: allow to call __kthread_create_on_node() with va_list args
kthread_create_on_node() implements a bunch of logic to create the
kthread.  It is already called by kthread_create_on_cpu().

We are going to extend the kthread worker API and will need to call
kthread_create_on_node() with va_list args there.

This patch does only a refactoring and does not modify the existing
behavior.

Link: http://lkml.kernel.org/r/1470754545-17632-5-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
a65d40961d kthread/smpboot: do not park in kthread_create_on_cpu()
kthread_create_on_cpu() was added by the commit 2a1d446019
("kthread: Implement park/unpark facility").  It is currently used only
when enabling new CPU.  For this purpose, the newly created kthread has to
be parked.

The CPU binding is a bit tricky.  The kthread is parked when the CPU has
not been allowed yet.  And the CPU is bound when the kthread is unparked.

The function would be useful for more per-CPU kthreads, e.g.
bnx2fc_thread, fcoethread.  For this purpose, the newly created kthread
should stay in the uninterruptible state.

This patch moves the parking into smpboot.  It binds the thread already
when created.  Then the function might be used universally.  Also the
behavior is consistent with kthread_create() and kthread_create_on_node().

Link: http://lkml.kernel.org/r/1470754545-17632-4-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
3989144f86 kthread: kthread worker API cleanup
A good practice is to prefix the names of functions by the name
of the subsystem.

The kthread worker API is a mix of classic kthreads and workqueues.  Each
worker has a dedicated kthread.  It runs a generic function that process
queued works.  It is implemented as part of the kthread subsystem.

This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:

__init_kthread_worker()		-> __kthread_init_worker()
init_kthread_worker()		-> kthread_init_worker()
init_kthread_work()		-> kthread_init_work()
insert_kthread_work()		-> kthread_insert_work()
queue_kthread_work()		-> kthread_queue_work()
flush_kthread_work()		-> kthread_flush_work()
flush_kthread_worker()		-> kthread_flush_worker()

Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.

Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:

  + "init" in the function names stands for the verb "initialize"
    aka "initialize worker". While "INIT" in the macro names
    stands for the noun "INITIALIZER" aka "worker initializer".

  + INIT() macros are used only in DEFINE() macros

  + init() functions are used close to the other kthread()
    functions. It looks much better if all the functions
    use the same scheme.

  + There will be also kthread_destroy_worker() that will
    be used close to kthread_cancel_work(). It is related
    to the init() function. Again it looks better if all
    functions use the same naming scheme.

  + there are several precedents for such init() function
    names, e.g. amd_iommu_init_device(), free_area_init_node(),
    jump_label_init_type(),  regmap_init_mmio_clk(),

  + It is not an argument but it was inconsistent even before.

[arnd@arndb.de: fix linux-next merge conflict]
 Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Petr Mladek
e700591ae0 kthread: rename probe_kthread_data() to kthread_probe_data()
Patch series "kthread: Kthread worker API improvements"

The intention of this patchset is to make it easier to manipulate and
maintain kthreads.  Especially, I want to replace all the custom main
cycles with a generic one.  Also I want to make the kthreads sleep in a
consistent state in a common place when there is no work.

This patch (of 11):

A good practice is to prefix the names of functions by the name of the
subsystem.

This patch fixes the name of probe_kthread_data().  The other wrong
functions names are part of the kthread worker API and will be fixed
separately.

Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Rob Herring
2489a1771a config: android: enable CONFIG_SECCOMP
As of Android N, SECCOMP is required. Without it, we will get
mediaextractor error:

E /system/bin/mediaextractor: libminijail: prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER): Invalid argument

Link: http://lkml.kernel.org/r/20160908185934.18098-3-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Rob Herring
d90ae51a3e config: android: set SELinux as default security mode
Android won't boot without SELinux enabled, so make it the default.

Link: http://lkml.kernel.org/r/20160908185934.18098-2-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Rob Herring
f023a3956f config: android: move device mapper options to recommended
CONFIG_MD is in recommended, but other dependent options like DM_CRYPT and
DM_VERITY options are in base.  The result is the options in base don't
get enabled when applying both base and recommended fragments.  Move all
the options to recommended.

Link: http://lkml.kernel.org/r/20160908185934.18098-1-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Borislav Petkov
a2c6a235db config/android: Remove CONFIG_IPV6_PRIVACY
Option is long gone, see commit 5d9efa7ee9 ("ipv6: Remove privacy
config option.")

Link: http://lkml.kernel.org/r/20160811170340.9859-1-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Peter Zijlstra
26b5679e43 relay: Use irq_work instead of plain timer for deferred wakeup
Relay avoids calling wake_up_interruptible() for doing the wakeup of
readers/consumers, waiting for the generation of new data, from the
context of a process which produced the data.  This is apparently done to
prevent the possibility of a deadlock in case Scheduler itself is is
generating data for the relay, after acquiring rq->lock.

The following patch used a timer (to be scheduled at next jiffy), for
delegating the wakeup to another context.
	commit 7c9cb38302
	Author: Tom Zanussi <zanussi@comcast.net>
	Date:   Wed May 9 02:34:01 2007 -0700

	relay: use plain timer instead of delayed work

	relay doesn't need to use schedule_delayed_work() for waking readers
	when a simple timer will do.

Scheduling a plain timer, at next jiffies boundary, to do the wakeup
causes a significant wakeup latency for the Userspace client, which makes
relay less suitable for the high-frequency low-payload use cases where the
data gets generated at a very high rate, like multiple sub buffers getting
filled within a milli second.  Moreover the timer is re-scheduled on every
newly produced sub buffer so the timer keeps getting pushed out if sub
buffers are filled in a very quick succession (less than a jiffy gap
between filling of 2 sub buffers).  As a result relay runs out of sub
buffers to store the new data.

By using irq_work it is ensured that wakeup of userspace client, blocked
in the poll call, is done at earliest (through self IPI or next timer
tick) enabling it to always consume the data in time.  Also this makes
relay consistent with printk & ring buffers (trace), as they too use
irq_work for deferred wake up of readers.

[arnd@arndb.de: select CONFIG_IRQ_WORK]
 Link: http://lkml.kernel.org/r/20160912154035.3222156-1-arnd@arndb.de
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1472906487-1559-1-git-send-email-akash.goel@intel.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Akash Goel <akash.goel@intel.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Hidehiro Kawai
0ee59413c9 x86/panic: replace smp_send_stop() with kdump friendly version in panic path
Daniel Walker reported problems which happens when
crash_kexec_post_notifiers kernel option is enabled
(https://lkml.org/lkml/2015/6/24/44).

In that case, smp_send_stop() is called before entering kdump routines
which assume other CPUs are still online.  As the result, for x86, kdump
routines fail to save other CPUs' registers and disable virtualization
extensions.

To fix this problem, call a new kdump friendly function,
crash_smp_send_stop(), instead of the smp_send_stop() when
crash_kexec_post_notifiers is enabled.  crash_smp_send_stop() is a weak
function, and it just call smp_send_stop().  Architecture codes should
override it so that kdump can work appropriately.  This patch only
provides x86-specific version.

For Xen's PV kernel, just keep the current behavior.

NOTES:

- Right solution would be to place crash_smp_send_stop() before
  __crash_kexec() invocation in all cases and remove smp_send_stop(), but
  we can't do that until all architectures implement own
  crash_smp_send_stop()

- crash_smp_send_stop()-like work is still needed by
  machine_crash_shutdown() because crash_kexec() can be called without
  entering panic()

Fixes: f06e5153f4 (kernel/panic.c: add "crash_kexec_post_notifiers" option)
Link: http://lkml.kernel.org/r/20160810080948.11028.15344.stgit@sysi4-13.yrl.intra.hitachi.co.jp
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Reported-by: Daniel Walker <dwalker@fifo99.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Daniel Walker <dwalker@fifo99.com>
Cc: Xunlei Pang <xpang@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Daney <david.daney@cavium.com>
Cc: Aaro Koskinen <aaro.koskinen@iki.fi>
Cc: "Steven J. Hill" <steven.hill@cavium.com>
Cc: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Ales Novak
0a5bf409d3 ptrace: clear TIF_SYSCALL_TRACE on ptrace detach
On __ptrace_detach(), called from do_exit()->exit_notify()->
forget_original_parent()->exit_ptrace(), the TIF_SYSCALL_TRACE in
thread->flags of the tracee is not cleared up.  This results in the
tracehook_report_syscall_* being called (though there's no longer a tracer
listening to that) upon its further syscalls.

Example scenario - attach "strace" to a running process and kill it (the
strace) with SIGKILL.  You'll see that the syscall trace hooks are still
being called.

The clearing of this flag should be moved from ptrace_detach() to
__ptrace_detach().

Link: http://lkml.kernel.org/r/1472759493-20554-1-git-send-email-alnovak@suse.cz
Signed-off-by: Ales Novak <alnovak@suse.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Christoph Hellwig
bfd45be0b8 kprobes: include <asm/sections.h> instead of <asm-generic/sections.h>
asm-generic headers are generic implementations for architecture specific
code and should not be included by common code.  Thus use the asm/ version
of sections.h to get at the linker sections.

Link: http://lkml.kernel.org/r/1473602302-6208-1-git-send-email-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Wanpeng Li
9cfb38a7ba sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
Commit:

  10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings()")

... improved select_idle_sibling(), but also triggered a regression (crash)
during CPU-hotplug:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
  IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
  Call Trace:
   <IRQ>
    select_task_rq_fair+0x749/0x930
    ? select_task_rq_fair+0xb4/0x930
    ? __lock_is_held+0x54/0x70
    try_to_wake_up+0x19a/0x5b0
    default_wake_function+0x12/0x20
    autoremove_wake_function+0x12/0x40
    __wake_up_common+0x55/0x90
    __wake_up+0x39/0x50
    wake_up_klogd_work_func+0x40/0x60
    irq_work_run_list+0x57/0x80
    irq_work_run+0x2c/0x30
    smp_irq_work_interrupt+0x2e/0x40
    irq_work_interrupt+0x96/0xa0
   <EOI>
    ? _raw_spin_unlock_irqrestore+0x45/0x80
    try_to_wake_up+0x4a/0x5b0
    wake_up_state+0x10/0x20
    __kthread_unpark+0x67/0x70
    kthread_unpark+0x22/0x30
    cpuhp_online_idle+0x3e/0x70
    cpu_startup_entry+0x6a/0x450
    start_secondary+0x154/0x180

This can be reproduced by running the ftrace test case of kselftest, the
test case will hot-unplug the CPU and the CPU will attach to the NULL
sched-domain during scheduler teardown.

The step 2 for the rewrite select_idle_siblings():

  | Step 2) tracks the average cost of the scan and compares this to the
  | average idle time guestimate for the CPU doing the wakeup.

If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL
sched domain will be dereferenced to acquire the average cost of the scan.

This patch fix it by failing the search of an idle CPU in the LLC process
if this sched domain is NULL.

Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-11 10:40:06 +02:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro
3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Emese Revfy
0766f788eb latent_entropy: Mark functions with __latent_entropy
The __latent_entropy gcc attribute can be used only on functions and
variables.  If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents.  The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:45 -07:00
Emese Revfy
38addce8b6 gcc-plugins: Add latent_entropy plugin
This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).

At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.

The need for very-early boot entropy tends to be very architecture or
system design specific, so this plugin is more suited for those sorts
of special cases. The existing kernel RNG already attempts to extract
entropy from reliable runtime variation, but this plugin takes the idea to
a logical extreme by permuting a global variable based on any variation
in code execution (e.g. a different value (and permutation function)
is used to permute the global based on loop count, case statement,
if/then/else branching, etc).

To do this, the plugin starts by inserting a local variable in every
marked function. The plugin then adds logic so that the value of this
variable is modified by randomly chosen operations (add, xor and rol) and
random values (gcc generates separate static values for each location at
compile time and also injects the stack pointer at runtime). The resulting
value depends on the control flow path (e.g., loops and branches taken).

Before the function returns, the plugin mixes this local variable into
the latent_entropy global variable. The value of this global variable
is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
though it does not credit any bytes of entropy to the pool; the contents
of the global are just used to mix the pool.

Additionally, the plugin can pre-initialize arrays with build-time
random contents, so that two different kernel builds running on identical
hardware will not have the same starting values.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message and code comments]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:44 -07:00
Linus Torvalds
abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Linus Torvalds
93c26d7dc0 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull protection keys syscall interface from Thomas Gleixner:
 "This is the final step of Protection Keys support which adds the
  syscalls so user space can actually allocate keys and protect memory
  areas with them. Details and usage examples can be found in the
  documentation.

  The mm side of this has been acked by Mel"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pkeys: Update documentation
  x86/mm/pkeys: Do not skip PKRU register if debug registers are not used
  x86/pkeys: Fix pkeys build breakage for some non-x86 arches
  x86/pkeys: Add self-tests
  x86/pkeys: Allow configuration of init_pkru
  x86/pkeys: Default to a restrictive init PKRU
  pkeys: Add details of system call use to Documentation/
  generic syscalls: Wire up memory protection keys syscalls
  x86: Wire up protection keys system calls
  x86/pkeys: Allocation/free syscalls
  x86/pkeys: Make mprotect_key() mask off additional vm_flags
  mm: Implement new pkey_mprotect() system call
  x86/pkeys: Add fault handling for PF_PK page fault bit
2016-10-10 11:01:51 -07:00
Linus Torvalds
84ed2da02f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A revert of a commit which pointelessly widened a preempt disabled
  section which in turn caused might_sleep() to trigger.

  The patch intended to prevent usage of smp_processor_id() in
  preemptible context, but the usage in that case is fine because the
  thread is pinned on a single cpu and therefore cannot be migrated off"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "sched/core: Do not use smp_processor_id() with preempt enabled in smpboot_thread_fn()"
2016-10-10 10:29:59 -07:00
Linus Torvalds
604a830d4f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A single fix for a regression introduced in 4.8 which causes the
  trace/perf clock to return random nonsense if CONFIG_DEBUG_TIMEKEEPING
  is set"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix __ktime_get_fast_ns() regression
2016-10-10 10:23:18 -07:00
Linus Torvalds
563873318d Merge branch 'printk-cleanups'
Merge my system logging cleanups, triggered by the broken '\n' patches.

The line continuation handling has been broken basically forever, and
the code to handle the system log records was both confusing and
dubious.  And it would do entirely the wrong thing unless you always had
a terminating newline, partly because it couldn't actually see whether a
message was marked KERN_CONT or not (but partly because the LOG_CONT
handling in the recording code was rather confusing too).

This re-introduces a real semantically meaningful KERN_CONT, and fixes
the few places I noticed where it was missing.  There are probably more
missing cases, since KERN_CONT hasn't actually had any semantic meaning
for at least four years (other than the checkpatch meaning of "no log
level necessary, this is a continuation line").

This also allows the combination of KERN_CONT and a log level.  In that
case the log level will be ignored if the merging with a previous line
is successful, but if a new record is needed, that new record will now
get the right log level.

That also means that you can at least in theory combine KERN_CONT with
the "pr_info()" style helpers, although any use of pr_fmt() prefixing
would make that just result in a mess, of course (the prefix would end
up in the middle of a continuing line).

* printk-cleanups:
  printk: make reading the kernel log flush pending lines
  printk: re-organize log_output() to be more legible
  printk: split out core logging code into helper function
  printk: reinstate KERN_CONT for printing continuation lines
2016-10-10 09:29:50 -07:00
Linus Torvalds
bfd8d3f23b printk: make reading the kernel log flush pending lines
That will mean that any possible subsequent continuation will now be
broken up onto a line of its own (since reading the log has finalized
the beginning og the line), but if user space has activated system
logging (or if there's a kernel message dump going on) that is the right
thing to do.

And now that we actually get the continuation flags _right_ for this
all, the user space logger that is reading the kernel messages can
actually see the continuation marker.  Not that anybody seems to really
bother with it (or care), but in theory user space can do its own
message stitching.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-09 12:23:40 -07:00
Linus Torvalds
5e467652ff printk: re-organize log_output() to be more legible
Avoid some duplicate logic now that we can return early, and update the
comments for the new LOG_CONT world order.

This also stops the continuation flushing from just using random record
flags for the flushing action, instead taking the flags from the proper
original line and updating them as we add continuations to it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-09 12:23:40 -07:00
Linus Torvalds
c362c7ff84 printk: split out core logging code into helper function
The code that actually decides how to log the message (whether to put it
directly into the record log, whether to append it to an existing
buffered log, or whether to start a new buffered log) is fairly
non-obvious code in the middle of the vprintk_emit() function.

Splitting that code up into a helper function makes it easier to
understand, but perhaps more importantly also allows for the code to
just return early out of the helper function once it has made the
decision about where the new log content goes.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-09 12:23:39 -07:00
Linus Torvalds
4bcc595ccd printk: reinstate KERN_CONT for printing continuation lines
Long long ago the kernel log buffer was a buffered stream of bytes, very
much like stdio in user space.  It supported log levels by scanning the
stream and noticing the log level markers at the beginning of each line,
but if you wanted to print a partial line in multiple chunks, you just
did multiple printk() calls, and it just automatically worked.

Except when it didn't, and you had very confusing output when different
lines got all mixed up with each other.  Then you got fragment lines
mixing with each other, or with non-fragment lines, because it was
traditionally impossible to tell whether a printk() call was a
continuation or not.

To at least help clarify the issue of continuation lines, we added a
KERN_CONT marker back in 2007 to mark continuation lines:

  4749252776 ("printk: add KERN_CONT annotation").

That continuation marker was initially an empty string, and didn't
actuall make any semantic difference.  But it at least made it possible
to annotate the source code, and have check-patch notice that a printk()
didn't need or want a log level marker, because it was a continuation of
a previous line.

To avoid the ambiguity between a continuation line that had that
KERN_CONT marker, and a printk with no level information at all, we then
in 2009 made KERN_CONT be a real log level marker which meant that we
could now reliably tell the difference between the two cases.

  5fd29d6ccb ("printk: clean up handling of log-levels and newlines")

and we could take advantage of that to make sure we didn't mix up
continuation lines with lines that just didn't have any loglevel at all.

Then, in 2012, the kernel log buffer was changed to be a "record" based
log, where each line was a record that has a loglevel and a timestamp.

You can see the beginning of that conversion in commits

  e11fea92e1 ("kmsg: export printk records to the /dev/kmsg interface")
  7ff9554bb5 ("printk: convert byte-buffer to variable-length record buffer")

with a number of follow-up commits to fix some painful fallout from that
conversion.  Over all, it took a couple of months to sort out most of
it.  But the upside was that you could have concurrent readers (and
writers) of the kernel log and not have lines with mixed output in them.

And one particular pain-point for the record-based kernel logging was
exactly the fragmentary lines that are generated in smaller chunks.  In
order to still log them as one recrod, the continuation lines need to be
attached to the previous record properly.

However the explicit continuation record marker that is actually useful
for this exact case was actually removed in aroundm the same time by commit

  61e99ab8e3 ("printk: remove the now unnecessary "C" annotation for KERN_CONT")

due to the incorrect belief that KERN_CONT wasn't meaningful.  The
ambiguity between "is this a continuation line" or "is this a plain
printk with no log level information" was reintroduced, and in fact
became an even bigger pain point because there was now the whole
record-level merging of kernel messages going on.

This patch reinstates the KERN_CONT as a real non-empty string marker,
so that the ambiguity is fixed once again.

But it's not a plain revert of that original removal: in the four years
since we made KERN_CONT an empty string again, not only has the format
of the log level markers changed, we've also had some usage changes in
this area.

For example, some ACPI code seems to use KERN_CONT _together_ with a log
level, and now uses both the KERN_CONT marker and (for example) a
KERN_INFO marker to show that it's an informational continuation of a
line.

Which is actually not a bad idea - if the continuation line cannot be
attached to its predecessor, without the log level information we don't
know what log level to assign to it (and we traditionally just assigned
it the default loglevel).  So having both a log level and the KERN_CONT
marker is not necessarily a bad idea, but it does mean that we need to
actually iterate over potentially multiple markers, rather than just a
single one.

Also, since KERN_CONT was still conceptually needed, and encouraged, but
didn't actually _do_ anything, we've also had the reverse problem:
rather than having too many annotations it has too few, and there is bit
rot with code that no longer marks the continuation lines with the
KERN_CONT marker.

So this patch not only re-instates the non-empty KERN_CONT marker, it
also fixes up the cases of bit-rot I noticed in my own logs.

There are probably other cases where KERN_CONT will be needed to be
added, either because it is new code that never dealt with the need for
KERN_CONT, or old code that has bitrotted without anybody noticing.

That said, we should strive to avoid the need for KERN_CONT.  It does
result in real problems for logging, and should generally not be seen as
a good feature.  If we some day can get rid of the feature entirely,
because nobody does any fragmented printk calls, that would be lovely.

But until that point, let's at mark the code that relies on the hacky
multi-fragment kernel printk's.  Not only does it avoid the ambiguity,
it also annotates code as "maybe this would be good to fix some day".

(That said, particularly during single-threaded bootup, the downsides of
KERN_CONT are very limited.  Things get much hairier when you have
multiple threads going on and user level reading and writing logs too).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-09 12:23:38 -07:00
Linus Torvalds
b66484cd74 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - fsnotify updates

 - ocfs2 updates

 - all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  console: don't prefer first registered if DT specifies stdout-path
  cred: simpler, 1D supplementary groups
  CREDITS: update Pavel's information, add GPG key, remove snail mail address
  mailmap: add Johan Hovold
  .gitattributes: set git diff driver for C source code files
  uprobes: remove function declarations from arch/{mips,s390}
  spelling.txt: "modeled" is spelt correctly
  nmi_backtrace: generate one-line reports for idle cpus
  arch/tile: adopt the new nmi_backtrace framework
  nmi_backtrace: do a local dump_stack() instead of a self-NMI
  nmi_backtrace: add more trigger_*_cpu_backtrace() methods
  min/max: remove sparse warnings when they're nested
  Documentation/filesystems/proc.txt: add more description for maps/smaps
  mm, proc: fix region lost in /proc/self/smaps
  proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
  proc: add LSM hook checks to /proc/<tid>/timerslack_ns
  proc: relax /proc/<tid>/timerslack_ns capability requirements
  meminfo: break apart a very long seq_printf with #ifdefs
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  proc: faster /proc/*/status
  ...
2016-10-07 21:38:00 -07:00
Paul Burton
05fd007e46 console: don't prefer first registered if DT specifies stdout-path
If a device tree specifies a preferred device for kernel console output
via the stdout-path or linux,stdout-path chosen node properties or the
stdout alias then the kernel ought to honor it & output the kernel
console to that device.  As it stands, this isn't the case.  Whilst we
parse the stdout-path properties & set an of_stdout variable from
of_alias_scan(), and use that from of_console_check() to determine
whether to add a console device as a preferred console whilst
registering it, we also prefer the first registered console if no other
has been selected at the time of its registration.

This means that if a console other than the one the device tree selects
via stdout-path is registered first, we will switch to using it & when
the stdout-path console is later registered the call to
add_preferred_console() via of_console_check() is too late to do
anything useful.  In practice this seems to mean that we switch to the
dummy console device fairly early & see no further console output:

    Console: colour dummy device 80x25
    console [tty0] enabled
    bootconsole [ns16550a0] disabled

Fix this by not automatically preferring the first registered console if
one is specified by the device tree.  This allows consoles to be
registered but not enabled, and once the driver for the console selected
by stdout-path calls of_console_check() the driver will be added to the
list of preferred consoles before any other console has been enabled.
When that console is then registered via register_console() it will be
enabled as expected.

Link: http://lkml.kernel.org/r/20160809151937.26118-1-paul.burton@imgtec.com
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Ivan Delalande <colona@arista.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Kara <jack@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Alexey Dobriyan
81243eacfa cred: simpler, 1D supplementary groups
Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes).  But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.

2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).

All of the above is unnecessary.  Switch to the usual
trailing-zero-len-array scheme.  Memory is allocated with
kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).

Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
think kernel can handle such allocation.

On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!

Nice side effects:

 - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,

 - fix little mess in net/ipv4/ping.c
   should have been using GROUP_AT macro but this point becomes moot,

 - aux group allocation is persistent and should be accounted as such.

Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Chris Metcalf
6727ad9e20 nmi_backtrace: generate one-line reports for idle cpus
When doing an nmi backtrace of many cores, most of which are idle, the
output is a little overwhelming and very uninformative.  Suppress
messages for cpus that are idling when they are interrupted and just
emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

We do this by grouping all the cpuidle code together into a new
.cpuidle.text section, and then checking the address of the interrupted
PC to see if it lies within that section.

This commit suitably tags x86 and tile idle routines, and only adds in
the minimal framework for other architectures.

Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
Tested-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Aaron Lu
6fcb52a56f thp: reduce usage of huge zero page's atomic counter
The global zero page is used to satisfy an anonymous read fault.  If
THP(Transparent HugePage) is enabled then the global huge zero page is
used.  The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults.  This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
the global counter in two cases:

 1 The first time it uses the global huge zero page;
 2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away.  With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either.  Since no kthread is using huge zero page today, there
is no difference after applying this patch.  But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

  usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

  CPU cycles from perf report for base commit:
      54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
  CPU cycles from perf report for this commit:
       0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Tetsuo Handa
38531201c1 mm, oom: enforce exit_oom_victim on current task
There are no users of exit_oom_victim on !current task anymore so enforce
the API to always work on the current.

Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Michal Hocko
7d2e7a22cf oom, suspend: fix oom_killer_disable vs. pm suspend properly
Commit 7407054209 ("oom, suspend: fix oom_reaper vs.
oom_killer_disable race") has workaround an existing race between
oom_killer_disable and oom_reaper by adding another round of
try_to_freeze_tasks after the oom killer was disabled.  This was the
easiest thing to do for a late 4.7 fix.  Let's fix it properly now.

After "oom: keep mm of the killed task available" we no longer have to
call exit_oom_victim from the oom reaper because we have stable mm
available and hide the oom_reaped mm by MMF_OOM_SKIP flag.  So let's
remove exit_oom_victim and the race described in the above commit
doesn't exist anymore if.

Unfortunately this alone is not sufficient for the oom_killer_disable
usecase because now we do not have any reliable way to reach
exit_oom_victim (the victim might get stuck on a way to exit for an
unbounded amount of time).  OOM killer can cope with that by checking mm
flags and move on to another victim but we cannot do the same for
oom_killer_disable as we would lose the guarantee of no further
interference of the victim with the rest of the system.  What we can do
instead is to cap the maximum time the oom_killer_disable waits for
victims.  The only current user of this function (pm suspend) already
has a concept of timeout for back off so we can reuse the same value
there.

Let's drop set_freezable for the oom_reaper kthread because it is no
longer needed as the reaper doesn't wake or thaw any processes.

Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko
862e3073b3 mm, oom: get rid of signal_struct::oom_victims
After "oom: keep mm of the killed task available" we can safely detect
an oom victim by checking task->signal->oom_mm so we do not need the
signal_struct counter anymore so let's get rid of it.

This alone wouldn't be sufficient for nommu archs because
exit_oom_victim doesn't hide the process from the oom killer anymore.
We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko
7283094ec3 kernel, oom: fix potential pgd_lock deadlock from __mmdrop
Lockdep complains that __mmdrop is not safe from the softirq context:

  =================================
  [ INFO: inconsistent lock state ]
  4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
  ---------------------------------
  inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
  swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
   (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
  {SOFTIRQ-ON-W} state was registered at:
     __lock_acquire+0xa06/0x196e
     lock_acquire+0x139/0x1e1
     _raw_spin_lock+0x32/0x41
     __change_page_attr_set_clr+0x2a5/0xacd
     change_page_attr_set_clr+0x16f/0x32c
     set_memory_nx+0x37/0x3a
     free_init_pages+0x9e/0xc7
     alternative_instructions+0xa2/0xb3
     check_bugs+0xe/0x2d
     start_kernel+0x3ce/0x3ea
     x86_64_start_reservations+0x2a/0x2c
     x86_64_start_kernel+0x17a/0x18d
  irq event stamp: 105916
  hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
  hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
  softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
  softirqs last disabled at (105879): irq_exit+0x6f/0xd1

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(pgd_lock);
    <Interrupt>
      lock(pgd_lock);

   *** DEADLOCK ***

  1 lock held by swapper/1/0:
   #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

  stack backtrace:
  CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
  Call Trace:
   <IRQ>
    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0
   <EOI>
    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

More over commit a79e53d856 ("x86/mm: Fix pgd_lock deadlock") was
explicit about pgd_lock not to be called from the irq context.  This
means that __mmdrop called from free_signal_struct has to be postponed
to a user context.  We already have a similar mechanism for mmput_async
so we can use it here as well.  This is safe because mm_count is pinned
by mm_users.

This fixes bug introduced by "oom: keep mm of the killed task available"

Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko
26db62f179 oom: keep mm of the killed task available
oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever.  This is, however,
opening new problems (e.g oom_killer_disable exclusion - see commit
7407054209 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
race")).  exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags.  We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:

  do_exit
    exit_mm
      tsk->mm = NULL;
      mmput
        __mmput
      exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time.  At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach.  We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
have to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct.  As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.

Increasing the signal_struct for something as unlikely as the oom killer
is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway.  In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Linus Torvalds
d1f5323370 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS splice updates from Al Viro:
 "There's a bunch of branches this cycle, both mine and from other folks
  and I'd rather send pull requests separately.

  This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
  (and introduction of such). Gets rid of a lot of code in fs/splice.c
  and elsewhere; there will be followups, but these are for the next
  cycle...  Some pipe/splice-related cleanups from Miklos in the same
  branch as well"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pipe: fix comment in pipe_buf_operations
  pipe: add pipe_buf_steal() helper
  pipe: add pipe_buf_confirm() helper
  pipe: add pipe_buf_release() helper
  pipe: add pipe_buf_get() helper
  relay: simplify relay_file_read()
  switch default_file_splice_read() to use of pipe-backed iov_iter
  switch generic_file_splice_read() to use of ->read_iter()
  new iov_iter flavour: pipe-backed
  fuse_dev_splice_read(): switch to add_to_pipe()
  skb_splice_bits(): get rid of callback
  new helper: add_to_pipe()
  splice: lift pipe_lock out of splice_to_pipe()
  splice: switch get_iovec_page_array() to iov_iter
  splice_to_pipe(): don't open-code wakeup_pipe_readers()
  consistent treatment of EFAULT on O_DIRECT read/write
2016-10-07 15:36:58 -07:00
Linus Torvalds
513a4befae Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main pull request for block layer changes in 4.9.

  As mentioned at the last merge window, I've changed things up and now
  do just one branch for core block layer changes, and driver changes.
  This avoids dependencies between the two branches. Outside of this
  main pull request, there are two topical branches coming as well.

  This pull request contains:

   - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

   - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
     Followup dependency fix from Geert.

   - General fixes from Bart, Baoyou, Guoqing, and Linus W.

   - CFQ async write starvation fix from Glauber.

   - Add supprot for delayed kick of the requeue list, from Mike.

   - Pull out the scalable bitmap code from blk-mq-tag.c and make it
     generally available under the name of sbitmap. Only blk-mq-tag uses
     it for now, but the blk-mq scheduling bits will use it as well.
     From Omar.

   - bdev thaw error progagation from Pierre.

   - Improve the blk polling statistics, and allow the user to clear
     them. From Stephen.

   - Set of minor cleanups from Christoph in block/blk-mq.

   - Set of cleanups and optimizations from me for block/blk-mq.

   - Various nvme/nvmet/nvmeof fixes from the various folks"

* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
  fs/block_dev.c: return the right error in thaw_bdev()
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme/scsi: Remove power management support
  nvmet: Make dsm number of ranges zero based
  nvmet: Use direct IO for writes
  admin-cmd: Added smart-log command support.
  nvme-fabrics: Add host_traddr options field to host infrastructure
  nvme-fabrics: revise host transport option descriptions
  nvme-fabrics: rework nvmf_get_address() for variable options
  nbd: use BLK_MQ_F_BLOCKING
  blkcg: Annotate blkg_hint correctly
  cfq: fix starvation of asynchronous writes
  blk-mq: add flag for drivers wanting blocking ->queue_rq()
  blk-mq: remove non-blocking pass in blk_mq_map_request
  blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
  block: export bio_free_pages to other modules
  lightnvm: propagate device_add() error code
  lightnvm: expose device geometry through sysfs
  lightnvm: control life of nvm_dev in driver
  blk-mq: register device instead of disk
  ...
2016-10-07 14:42:05 -07:00
Linus Torvalds
2ab704a47e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
 "The usual rocket science from the trivial tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  tracing/syscalls: fix multiline in error message text
  lib/Kconfig.debug: fix DEBUG_SECTION_MISMATCH description
  doc: vfs: fix fadvise() sycall name
  x86/entry: spell EBX register correctly in documentation
  securityfs: fix securityfs_create_dir comment
  irq: Fix typo in tracepoint.xml
2016-10-07 12:24:08 -07:00
Linus Torvalds
ddc4e6d232 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching updates from Jiri Kosina:

 - fix for patching modules that contain .altinstructions or
   .parainstructions sections, from Jessica Yu

 - make TAINT_LIVEPATCH a per-module flag (so that it's immediately
   clear which module caused the taint), from Josh Poimboeuf

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch/module: make TAINT_LIVEPATCH module-specific
  Documentation: livepatch: add section about arch-specific code
  livepatch/x86: apply alternatives and paravirt patches after relocations
  livepatch: use arch_klp_init_object_loaded() to finish arch-specific tasks
2016-10-07 12:02:24 -07:00
Linus Torvalds
95107b30be This release cycle is rather small. Just a few fixes to tracing.
The big change is the addition of the hwlat tracer. It not only detects
 SMIs, but also other latency that's caused by the hardware. I have detected
 some latency from large boxes having bus contention.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJX9a77AAoJEKKk/i67LK/8UPEH/jcqMmOMhQYVQsNaJViA5uJM
 SV96gaLCc9cxXY04Hf7vx8RkVIyIqTCCQZ+RVZt4RSeqpsB2IzZ1u0CNKs2Z0MTv
 MdvQJoazRoDgVuPzKAsdAlDd0ykqHEFA5ayF3XDK4P2J97La+B4rQIqEiJX/aDrz
 i0NQQFg2ZF46mXJXn4oXe6nmr6WnbiEduawVjd7JvgILJO2hojDicOTQlNG41Nys
 68fOV8mLk0OL7sFRjySLGcbdbKhP2YbNhxILXl8geLgS9+CFZXkE8oTRjjy9IMNA
 XrqbFLMWaRVv+Nig7bHIWKE8ZErC5WCYUw4LD2GTLMDx5AkAVLGFFp6TOiO4SG8=
 =ke23
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "This release cycle is rather small.  Just a few fixes to tracing.

  The big change is the addition of the hwlat tracer. It not only
  detects SMIs, but also other latency that's caused by the hardware. I
  have detected some latency from large boxes having bus contention"

* tag 'trace-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Call traceoff trigger after event is recorded
  ftrace/scripts: Add helper script to bisect function tracing problem functions
  tracing: Have max_latency be defined for HWLAT_TRACER as well
  tracing: Add NMI tracing in hwlat detector
  tracing: Have hwlat trace migrate across tracing_cpumask CPUs
  tracing: Add documentation for hwlat_detector tracer
  tracing: Added hardware latency tracer
  ftrace: Access ret_stack->subtime only in the function profiler
  function_graph: Handle TRACE_BPUTS in print_graph_comment
  tracing/uprobe: Drop isdigit() check in create_trace_uprobe
2016-10-06 11:48:41 -07:00
Linus Torvalds
6218590bcb KVM updates for v4.9-rc1
All architectures:
   Move `make kvmconfig` stubs from x86;  use 64 bits for debugfs stats.
 
 ARM:
   Important fixes for not using an in-kernel irqchip; handle SError
   exceptions and present them to guests if appropriate; proxying of GICV
   access at EL2 if guest mappings are unsafe; GICv3 on AArch32 on ARMv8;
   preparations for GICv3 save/restore, including ABI docs; cleanups and
   a bit of optimizations.
 
 MIPS:
   A couple of fixes in preparation for supporting MIPS EVA host kernels;
   MIPS SMP host & TLB invalidation fixes.
 
 PPC:
   Fix the bug which caused guests to falsely report lockups; other minor
   fixes; a small optimization.
 
 s390:
   Lazy enablement of runtime instrumentation; up to 255 CPUs for nested
   guests; rework of machine check deliver; cleanups and fixes.
 
 x86:
   IOMMU part of AMD's AVIC for vmexit-less interrupt delivery; Hyper-V
   TSC page; per-vcpu tsc_offset in debugfs; accelerated INS/OUTS in
   nVMX; cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJX9iDrAAoJEED/6hsPKofoOPoIAIUlgojkb9l2l1XVDgsXdgQL
 sRVhYSVv7/c8sk9vFImrD5ElOPZd+CEAIqFOu45+NM3cNi7gxip9yftUVs7wI5aC
 eDZRWm1E4trDZLe54ZM9ThcqZzZZiELVGMfR1+ZndUycybwyWzafpXYsYyaXp3BW
 hyHM3qVkoWO3dxBWFwHIoO/AUJrWYkRHEByKyvlC6KPxSdBPSa5c1AQwMCoE0Mo4
 K/xUj4gBn9eMelNhg4Oqu/uh49/q+dtdoP2C+sVM8bSdquD+PmIeOhPFIcuGbGFI
 B+oRpUhIuntN39gz8wInJ4/GRSeTuR2faNPxMn4E1i1u4LiuJvipcsOjPfe0a18=
 =fZRB
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "All architectures:
   - move `make kvmconfig` stubs from x86
   - use 64 bits for debugfs stats

  ARM:
   - Important fixes for not using an in-kernel irqchip
   - handle SError exceptions and present them to guests if appropriate
   - proxying of GICV access at EL2 if guest mappings are unsafe
   - GICv3 on AArch32 on ARMv8
   - preparations for GICv3 save/restore, including ABI docs
   - cleanups and a bit of optimizations

  MIPS:
   - A couple of fixes in preparation for supporting MIPS EVA host
     kernels
   - MIPS SMP host & TLB invalidation fixes

  PPC:
   - Fix the bug which caused guests to falsely report lockups
   - other minor fixes
   - a small optimization

  s390:
   - Lazy enablement of runtime instrumentation
   - up to 255 CPUs for nested guests
   - rework of machine check deliver
   - cleanups and fixes

  x86:
   - IOMMU part of AMD's AVIC for vmexit-less interrupt delivery
   - Hyper-V TSC page
   - per-vcpu tsc_offset in debugfs
   - accelerated INS/OUTS in nVMX
   - cleanups and fixes"

* tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (140 commits)
  KVM: MIPS: Drop dubious EntryHi optimisation
  KVM: MIPS: Invalidate TLB by regenerating ASIDs
  KVM: MIPS: Split kernel/user ASID regeneration
  KVM: MIPS: Drop other CPU ASIDs on guest MMU changes
  KVM: arm/arm64: vgic: Don't flush/sync without a working vgic
  KVM: arm64: Require in-kernel irqchip for PMU support
  KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register
  KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL
  KVM: PPC: Book3S: Remove duplicate setting of the B field in tlbie
  KVM: PPC: BookE: Fix a sanity check
  KVM: PPC: Book3S HV: Take out virtual core piggybacking code
  KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread
  ARM: gic-v3: Work around definition of gic_write_bpr1
  KVM: nVMX: Fix the NMI IDT-vectoring handling
  KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive
  KVM: nVMX: Fix reload apic access page warning
  kvmconfig: add virtio-gpu to config fragment
  config: move x86 kvm_guest.config to a common location
  arm64: KVM: Remove duplicating init code for setting VMID
  ARM: KVM: Support vgic-v3
  ...
2016-10-06 10:49:01 -07:00
Linus Torvalds
14986a34e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This set of changes is a number of smaller things that have been
  overlooked in other development cycles focused on more fundamental
  change. The devpts changes are small things that were a distraction
  until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
  trivial regression fix to autofs for the unprivileged mount changes
  that went in last cycle. A pair of ioctls has been added by Andrey
  Vagin making it is possible to discover the relationships between
  namespaces when referring to them through file descriptors.

  The big user visible change is starting to add simple resource limits
  to catch programs that misbehave. With namespaces in general and user
  namespaces in particular allowing users to use more kinds of
  resources, it has become important to have something to limit errant
  programs. Because the purpose of these limits is to catch errant
  programs the code needs to be inexpensive to use as it always on, and
  the default limits need to be high enough that well behaved programs
  on well behaved systems don't encounter them.

  To this end, after some review I have implemented per user per user
  namespace limits, and use them to limit the number of namespaces. The
  limits being per user mean that one user can not exhause the limits of
  another user. The limits being per user namespace allow contexts where
  the limit is 0 and security conscious folks can remove from their
  threat anlysis the code used to manage namespaces (as they have
  historically done as it root only). At the same time the limits being
  per user namespace allow other parts of the system to use namespaces.

  Namespaces are increasingly being used in application sand boxing
  scenarios so an all or nothing disable for the entire system for the
  security conscious folks makes increasing use of these sandboxes
  impossible.

  There is also added a limit on the maximum number of mounts present in
  a single mount namespace. It is nontrivial to guess what a reasonable
  system wide limit on the number of mount structure in the kernel would
  be, especially as it various based on how a system is using
  containers. A limit on the number of mounts in a mount namespace
  however is much easier to understand and set. In most cases in
  practice only about 1000 mounts are used. Given that some autofs
  scenarious have the potential to be 30,000 to 50,000 mounts I have set
  the default limit for the number of mounts at 100,000 which is well
  above every known set of users but low enough that the mount hash
  tables don't degrade unreaonsably.

  These limits are a start. I expect this estabilishes a pattern that
  other limits for resources that namespaces use will follow. There has
  been interest in making inotify event limits per user per user
  namespace as well as interest expressed in making details about what
  is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
  autofs:  Fix automounts by using current_real_cred()->uid
  mnt: Add a per mount namespace limit on the number of mounts
  netns: move {inc,dec}_net_namespaces into #ifdef
  nsfs: Simplify __ns_get_path
  tools/testing: add a test to check nsfs ioctl-s
  nsfs: add ioctl to get a parent namespace
  nsfs: add ioctl to get an owning user namespace for ns file descriptor
  kernel: add a helper to get an owning user namespace for a namespace
  devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
  devpts: Remove sync_filesystems
  devpts: Make devpts_kill_sb safe if fsi is NULL
  devpts: Simplify devpts_mount by using mount_nodev
  devpts: Move the creation of /dev/pts/ptmx into fill_super
  devpts: Move parse_mount_options into fill_super
  userns: When the per user per user namespace limit is reached return ENOSPC
  userns; Document per user per user namespace limits.
  mntns: Add a limit on the number of mount namespaces.
  netns: Add a limit on the number of net namespaces
  cgroupns: Add a limit on the number of cgroup namespaces
  ipcns: Add a  limit on the number of ipc namespaces
  ...
2016-10-06 09:52:23 -07:00
Al Viro
a7c2242166 relay: simplify relay_file_read()
to hell with actors...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:57 -04:00
Linus Torvalds
687ee0ad4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
    co. at Google. https://lwn.net/Articles/701165/

 2) Do TCP Small Queues for retransmits, from Eric Dumazet.

 3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
    Starovoitov.

 4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

 5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

 6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

 7) Support ndo_poll_controller in mlx5, from Calvin Owens.

 8) Move VRF processing to an output hook and allow l3mdev to be
    loopback, from David Ahern.

 9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

10) Congestion control in RXRPC, from David Howells.

11) Support geneve RX offload in ixgbe, from Emil Tantilov.

12) When hitting pressure for new incoming TCP data SKBs, perform a
    partial rathern than a full purge of the OFO queue (which could be
    huge). From Eric Dumazet.

13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

14) Support RX network flow classification to igb, from Gangfeng Huang.

15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

16) New skbmod packet action, from Jamal Hadi Salim.

17) Remove some inefficiencies in snmp proc output, from Jia He.

18) Add FIB notifications to properly propagate route changes to
    hardware which is doing forwarding offloading. From Jiri Pirko.

19) New dsa driver for qca8xxx chips, from John Crispin.

20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
    Żenczykowski.

21) Add L3 mode to ipvlan, from Mahesh Bandewar.

22) Support 802.1ad in mlx4, from Moshe Shemesh.

23) Support hardware LRO in mediatek driver, from Nelson Chang.

24) Add TC offloading to mlx5, from Or Gerlitz.

25) Convert various drivers to ethtool ksettings interfaces, from
    Philippe Reynes.

26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

27) NAPI support for ath10k, from Rajkumar Manoharan.

28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

29) UDP replicast support in TIPC, from Richard Alpe.

30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

31) Support BQL in thunderx driver, from Sunil Goutham.

32) TSO support in alx driver, from Tobias Regnery.

33) Add stream parser engine and use it in kcm.

34) Support async DHCP replies in ipconfig module, from Uwe
    Kleine-König.

35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
  mlxsw: switchx2: Fix misuse of hard_header_len
  mlxsw: spectrum: Fix misuse of hard_header_len
  net/faraday: Stop NCSI device on shutdown
  net/ncsi: Introduce ncsi_stop_dev()
  net/ncsi: Rework the channel monitoring
  net/ncsi: Allow to extend NCSI request properties
  net/ncsi: Rework request index allocation
  net/ncsi: Don't probe on the reserved channel ID (0x1f)
  net/ncsi: Introduce NCSI_RESERVED_CHANNEL
  net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
  net: Add netdev all_adj_list refcnt propagation to fix panic
  net: phy: Add Edge-rate driver for Microsemi PHYs.
  vmxnet3: Wake queue from reset work
  i40e: avoid NULL pointer dereference and recursive errors on early PCI error
  qed: Add RoCE ll2 & GSI support
  qed: Add support for memory registeration verbs
  qed: Add support for QP verbs
  qed: PD,PKEY and CQ verb support
  qed: Add support for RoCE hw init
  qede: Add qedr framework
  ...
2016-10-05 10:11:24 -07:00
John Stultz
58bfea9532 timekeeping: Fix __ktime_get_fast_ns() regression
In commit 27727df240 ("Avoid taking lock in NMI path with
CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
the timekeeping_get_ns() function, but I forgot to include
the unit conversion from cycles to nanoseconds, breaking the
function's output, which impacts users like perf.

This results in bogus perf timestamps like:
 swapper     0 [000]   253.427536:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.426573:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.426687:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.426800:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.426905:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.427022:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.427127:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.427239:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.427346:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   254.427463:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]   255.426572:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])

Instead of more reasonable expected timestamps like:
 swapper     0 [000]    39.953768:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.064839:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.175956:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.287103:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.398217:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.509324:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.620437:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.731546:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.842654:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    40.953772:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
 swapper     0 [000]    41.064881:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])

Add the proper use of timekeeping_delta_to_ns() to convert
the cycle delta to nanoseconds as needed.

Thanks to Brendan and Alexei for finding this quickly after
the v4.8 release. Unfortunately the problematic commit has
landed in some -stable trees so they'll need this fix as
well.

Many apologies for this mistake. I'll be looking to add a
perf-clock sanity test to the kselftest timers tests soon.

Fixes: 27727df240 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"
Reported-by: Brendan Gregg <bgregg@netflix.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-and-reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable <stable@vger.kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1475636148-26539-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-05 15:44:46 +02:00
Linus Torvalds
3cd013ab79 Merge branch 'stable-4.9' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "Another relatively small pull request for v4.9 with just two patches.

  The patch from Richard updates the list of features we support and
  report back to userspace; this should have been sent earlier with the
  rest of the v4.8 patches but it got lost in my inbox.

  The second patch fixes a problem reported by our Android friends where
  we weren't very consistent in recording PIDs"

* 'stable-4.9' of git://git.infradead.org/users/pcmoore/audit:
  audit: add exclude filter extension to feature bitmap
  audit: consistently record PIDs with task_tgid_nr()
2016-10-04 14:21:41 -07:00
Ingo Molnar
be6a2e4c46 Revert "sched/core: Do not use smp_processor_id() with preempt enabled in smpboot_thread_fn()"
This reverts commit 4fa5cd5245.

The original change widens a preempt-off section, to avoid a seemingly unsafe
smp_processor_id() use.

During review I overlooked two facts:

 - The code to calls a non-trivial function callback:

                                ht->park(td->cpu);

   ... which might (and does occasionally) sleep, triggering the warning.

 - More importantly, as pointed out by Peter Zijlstra, using
   smp_processor_id() in that context is safe, if it's done from
   a kernel thread that is pinned to a single CPU - which is the
   case here.

So revert to the original code that enables preemption sooner.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Alfred Chen <cchalpha@gmail.com>
Link: http://lkml.kernel.org/r/20160930015102.GB20189@yexl-desktop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-04 09:55:57 +02:00
Linus Torvalds
597f03f9d1 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
 "Yet another batch of cpu hotplug core updates and conversions:

   - Provide core infrastructure for multi instance drivers so the
     drivers do not have to keep custom lists.

   - Convert custom lists to the new infrastructure. The block-mq custom
     list conversion comes through the block tree and makes the diffstat
     tip over to more lines removed than added.

   - Handle unbalanced hotplug enable/disable calls more gracefully.

   - Remove the obsolete CPU_STARTING/DYING notifier support.

   - Convert another batch of notifier users.

   The relayfs changes which conflicted with the conversion have been
   shipped to me by Andrew.

   The remaining lot is targeted for 4.10 so that we finally can remove
   the rest of the notifiers"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  cpufreq: Fix up conversion to hotplug state machine
  blk/mq: Reserve hotplug states for block multiqueue
  x86/apic/uv: Convert to hotplug state machine
  s390/mm/pfault: Convert to hotplug state machine
  mips/loongson/smp: Convert to hotplug state machine
  mips/octeon/smp: Convert to hotplug state machine
  fault-injection/cpu: Convert to hotplug state machine
  padata: Convert to hotplug state machine
  cpufreq: Convert to hotplug state machine
  ACPI/processor: Convert to hotplug state machine
  virtio scsi: Convert to hotplug state machine
  oprofile/timer: Convert to hotplug state machine
  block/softirq: Convert to hotplug state machine
  lib/irq_poll: Convert to hotplug state machine
  x86/microcode: Convert to hotplug state machine
  sh/SH-X3 SMP: Convert to hotplug state machine
  ia64/mca: Convert to hotplug state machine
  ARM/OMAP/wakeupgen: Convert to hotplug state machine
  ARM/shmobile: Convert to hotplug state machine
  arm64/FP/SIMD: Convert to hotplug state machine
  ...
2016-10-03 19:43:08 -07:00
Linus Torvalds
999dcbe241 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The irq departement proudly presents:

   - A rework of the core infrastructure to optimally spread interrupt
     for multiqueue devices. The first version was a bit naive and
     failed to take thread siblings and other details into account.
     Developed in cooperation with Christoph and Keith.

   - Proper delegation of softirqs to ksoftirqd, so if ksoftirqd is
     active then no further softirq processsing on interrupt return
     happens. Otherwise we try to delegate and still run another batch
     of network packets in the irq return path, which then tries to
     delegate to ksoftirqd .....

   - A proper machine parseable sysfs based alternative for
     /proc/interrupts.

   - ACPI support for the GICV3-ITS and ARM interrupt remapping

   - Two new irq chips from the ARM SoC zoo: STM32-EXTI and MVEBU-PIC

   - A new irq chip for the JCore (SuperH)

   - The usual pile of small fixlets in core and irqchip drivers"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  softirq: Let ksoftirqd do its job
  genirq: Make function __irq_do_set_handler() static
  ARM/dts: Add EXTI controller node to stm32f429
  ARM/STM32: Select external interrupts controller
  drivers/irqchip: Add STM32 external interrupts support
  Documentation/dt-bindings: Document STM32 EXTI controller bindings
  irqchip/mips-gic: Use for_each_set_bit to iterate over local IRQs
  pci/msi: Retrieve affinity for a vector
  genirq/affinity: Remove old irq spread infrastructure
  genirq/msi: Switch to new irq spreading infrastructure
  genirq/affinity: Provide smarter irq spreading infrastructure
  genirq/msi: Add cpumask allocation to alloc_msi_entry
  genirq: Expose interrupt information through sysfs
  irqchip/gicv3-its: Use MADT ITS subtable to do PCI/MSI domain initialization
  irqchip/gicv3-its: Factor out PCI-MSI part that might be reused for ACPI
  irqchip/gicv3-its: Probe ITS in the ACPI way
  irqchip/gicv3-its: Refactor ITS DT init code to prepare for ACPI
  irqchip/gicv3-its: Cleanup for ITS domain initialization
  PCI/MSI: Setup MSI domain on a per-device basis using IORT ACPI table
  ACPI: Add new IORT functions to support MSI domain handling
  ...
2016-10-03 19:10:15 -07:00
Linus Torvalds
5e1b834b27 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A rather smalish set of updates for timers and timekeeping:

   - Two core fixes to prevent potential undefinded behaviour about
     which gcc is complaining rightfully.

   - A fix to prevent stopping the tick on an (soon) offline CPU so it
     can complete the shutdown procedure.

   - Wait for clocks to stabilize before making decisions, so a not yet
     validated clock is not rejected.

   - The usual pile of fixes to the various clocksource drivers.

   - Core code typo and include fixlets"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Include the correct header for errno definitions
  clocksource/drivers/ti-32k: Prevent ftrace recursion
  clocksource/mips-gic-timer: Stop checking cpu_has_counter
  clocksource/mips-gic-timer: Print an error if IRQ setup fails
  tick/nohz: Prevent stopping the tick on an offline CPU
  clocksource/drivers/oxnas: Add OX820 compatible
  clocksource/drivers/timer-atmel-pit: Simplify IRQ handler
  clocksource/drivers/timer-atmel-pit: Remove uselesss WARN_ON_ONCE
  clocksource/drivers/timer-atmel-pit: Drop at91sam926x_pit_common_init
  clocksource/drivers/moxart: Replace panic by pr_err
  clocksource/drivers/moxart: Replace setup_irq by request_irq
  clocksource/drivers/moxart: Add Aspeed support
  clocksource/drivers/moxart: Use struct to hold state
  clocksource/drivers/moxart: Refactor enable/disable
  time: Avoid undefined behaviour in ktime_add_safe()
  time: Avoid undefined behaviour in timespec64_add_safe()
  timekeeping: Prints the amounts of time spent during suspend
  clocksource: Defer override invalidation unless clock is unstable
  hrtimer: Spelling fixes
2016-10-03 18:09:13 -07:00
Linus Torvalds
8e4ef63867 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Ingo Molnar:
 "The main changes in this cycle centered around adding support for
  32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
  Safonov"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
  x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
  x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
  x86/signal: Add SA_{X32,IA32}_ABI sa_flags
  x86/ptrace: Down with test_thread_flag(TIF_IA32)
  x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
  x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
  x86/vdso: Replace calculate_addr in map_vdso() with addr
  x86/vdso: Unmap vdso blob on vvar mapping failure
2016-10-03 17:29:01 -07:00
Linus Torvalds
1a4a2bc460 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull low-level x86 updates from Ingo Molnar:
 "In this cycle this topic tree has become one of those 'super topics'
  that accumulated a lot of changes:

   - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
     x86 - preceded by an array of changes. v4.8 saw preparatory changes
     in this area already - this is the rest of the work. Includes the
     thread stack caching performance optimization. (Andy Lutomirski)

   - switch_to() cleanups and all around enhancements. (Brian Gerst)

   - A large number of dumpstack infrastructure enhancements and an
     unwinder abstraction. The secret long term plan is safe(r) live
     patching plus maybe another attempt at debuginfo based unwinding -
     but all these current bits are standalone enhancements in a frame
     pointer based debug environment as well. (Josh Poimboeuf)

   - More __ro_after_init and const annotations. (Kees Cook)

   - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

[ The virtually mapped stack changes are pretty fundamental, and not
  x86-specific per se, even if they are only used on x86 right now. ]

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  x86/asm: Get rid of __read_cr4_safe()
  thread_info: Use unsigned long for flags
  x86/alternatives: Add stack frame dependency to alternative_call_2()
  x86/dumpstack: Fix show_stack() task pointer regression
  x86/dumpstack: Remove dump_trace() and related callbacks
  x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
  oprofile/x86: Convert x86_backtrace() to use the new unwinder
  x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
  perf/x86: Convert perf_callchain_kernel() to use the new unwinder
  x86/unwind: Add new unwind interface and implementations
  x86/dumpstack: Remove NULL task pointer convention
  fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
  sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  lib/syscall: Pin the task stack in collect_syscall()
  x86/process: Pin the target stack in get_wchan()
  x86/dumpstack: Pin the target stack when dumping it
  kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
  sched/core: Add try_get_task_stack() and put_task_stack()
  x86/entry/64: Fix a minor comment rebase error
  iommu/amd: Don't put completion-wait semaphore on stack
  ...
2016-10-03 16:13:28 -07:00
Linus Torvalds
af79ad2b1f Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes are:

   - irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

   - schedstat debugging enhancements, make it more broadly runtime
     available. (Josh Poimboeuf)

   - More work on asymmetric topology/capacity scheduling. (Morten
     Rasmussen)

   - sched/wait fixes and cleanups. (Oleg Nesterov)

   - PELT (per entity load tracking) improvements. (Peter Zijlstra)

   - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

   - sched/numa enhancements/fixes (Rik van Riel)

   - sched/cputime scalability improvements (Stanislaw Gruszka)

   - Load calculation arithmetics fixes. (Dietmar Eggemann)

   - sched/deadline enhancements (Tommaso Cucinotta)

   - Fix utilization accounting when switching to the SCHED_NORMAL
     policy. (Vincent Guittot)

   - ... plus misc cleanups and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched/irqtime: Consolidate irqtime flushing code
  sched/irqtime: Consolidate accounting synchronization with u64_stats API
  u64_stats: Introduce IRQs disabled helpers
  sched/irqtime: Remove needless IRQs disablement on kcpustat update
  sched/irqtime: No need for preempt-safe accessors
  sched/fair: Fix min_vruntime tracking
  sched/debug: Add SCHED_WARN_ON()
  sched/core: Fix set_user_nice()
  sched/fair: Introduce set_curr_task() helper
  sched/core, ia64: Rename set_curr_task()
  sched/core: Fix incorrect utilization accounting when switching to fair class
  sched/core: Optimize SCHED_SMT
  sched/core: Rewrite and improve select_idle_siblings()
  sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  sched/core: Introduce 'struct sched_domain_shared'
  sched/core: Restructure destroy_sched_domain()
  sched/core: Remove unused @cpu argument from destroy_sched_domain*()
  sched/wait: Introduce init_wait_entry()
  sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
  sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
  ...
2016-10-03 13:39:00 -07:00
Linus Torvalds
12b7bcb43e Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main kernel side changes were:

   - uprobes enhancements (Masami Hiramatsu)

   - Uncore group events enhancements (David Carrillo-Cisneros)

   - x86 Intel: Add support for Skylake server uncore PMUs (Kan Liang)

   - x86 Intel: LBR cleanups and enhancements, for better branch
     annotation tracking (Peter Zijlstra)

   - x86 Intel: Add support for PTWRITE and power event tracing
     (Alexander Shishkin)

   - ... various fixes, cleanups and smaller enhancements.

  Lots of tooling changes - a couple of highlights:

   - Support event group view with hierarchy mode in 'perf top' and
     'perf report' (Namhyung Kim)

     e.g.:

     $ perf record -e '{cycles,instructions}' make
     $ perf report --hierarchy --stdio
     ...
     #   Overhead  Command / Shared Object / Symbol
     # ......................  ..................................
     ...
     25.74%  27.18%sh
     19.96%  24.14%libc-2.24.so
      9.55%  14.64%[.] __strcmp_sse2
      1.54%   0.00%[.] __tfind
      1.07%   1.13%[.] _int_malloc
      0.95%   0.00%[.] __strchr_sse2
      0.89%   1.39%[.] __tsearch
      0.76%   0.00%[.] strlen

   - Add branch stack / basic block info to 'perf annotate --stdio',
     where for each branch, we add an asm comment after the instruction
     with information on how often it was taken and predicted. See
     example with color output at:

       http://vger.kernel.org/~acme/perf/annotate_basic_blocks.png

     (Peter Zijlstra)

   - Add support for using symbols in address filters with Intel PT and
     ARM CoreSight (hardware assisted tracing facilities) (Adrian
     Hunter, Mathieu Poirier)

   - Add support for interacting with Coresight PMU ETMs/PTMs, that are
     IP blocks to perform hardware assisted tracing on a ARM CPU core
     (Mathieu Poirier)

   - Support generating cross arch probes, i.e. if you specify a vmlinux
     file for different arch than the one in the host machine,

        $ perf probe --definition function_name args

     will generate the probe definition string needed to append to the
     target machine /sys/kernel/debug/tracing/kprobes_events file, using
     scripting (Masami Hiramatsu).

   - Allow configuring the default 'perf report -s' sort order in
     ~/.perfconfig, for instance, "sym,dso" may be more fitting for
     kernel developers. (Arnaldo Carvalho de Melo)

   - ... plus lots of other changes, refactorings, features and fixes"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (149 commits)
  perf tests: Add dwarf unwind test for powerpc
  perf probe: Match linkage name with mangled name
  perf probe: Fix to cut off incompatible chars from group name
  perf probe: Skip if the function address is 0
  perf probe: Ignore the error of finding inline instance
  perf intel-pt: Fix decoding when there are address filters
  perf intel-pt: Enable decoder to handle TIP.PGD with missing IP
  perf intel-pt: Read address filter from AUXTRACE_INFO event
  perf intel-pt: Record address filter in AUXTRACE_INFO event
  perf intel-pt: Add a helper function for processing AUXTRACE_INFO
  perf intel-pt: Fix missing error codes processing auxtrace_info
  perf intel-pt: Add support for recording the max non-turbo ratio
  perf intel-pt: Fix snapshot overlap detection decoder errors
  perf probe: Increase debug level of SDT debug messages
  perf record: Add support for using symbols in address filters
  perf symbols: Add dso__last_symbol()
  perf record: Fix error paths
  perf record: Rename label 'out_symbol_exit'
  perf script: Fix vanished idle symbols
  perf evsel: Add support for address filters
  ...
2016-10-03 12:47:28 -07:00
Linus Torvalds
00bcf5cdd6 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - rwsem micro-optimizations (Davidlohr Bueso)

   - Improve the implementation and optimize the performance of
     percpu-rwsems. (Peter Zijlstra.)

   - Convert all lglock users to better facilities such as percpu-rwsems
     or percpu-spinlocks and remove lglocks. (Peter Zijlstra)

   - Remove the ticket (spin)lock implementation. (Peter Zijlstra)

   - Korean translation of memory-barriers.txt and related fixes to the
     English document. (SeongJae Park)

   - misc fixes and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cmpxchg, locking/atomics: Remove superfluous definitions
  x86, locking/spinlocks: Remove ticket (spin)lock implementation
  locking/lglock: Remove lglock implementation
  stop_machine: Remove stop_cpus_lock and lg_double_lock/unlock()
  fs/locks: Use percpu_down_read_preempt_disable()
  locking/percpu-rwsem: Add down_read_preempt_disable()
  fs/locks: Replace lg_local with a per-cpu spinlock
  fs/locks: Replace lg_global with a percpu-rwsem
  locking/percpu-rwsem: Add DEFINE_STATIC_PERCPU_RWSEMand percpu_rwsem_assert_held()
  locking/pv-qspinlock: Use cmpxchg_release() in __pv_queued_spin_unlock()
  locking/rwsem, x86: Drop a bogus cc clobber
  futex: Add some more function commentry
  locking/hung_task: Show all locks
  locking/rwsem: Scan the wait_list for readers only once
  locking/rwsem: Remove a few useless comments
  locking/rwsem: Return void in __rwsem_mark_wake()
  locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()
  locking/Documentation: Add Korean translation
  locking/Documentation: Fix a typo of example result
  locking/Documentation: Fix wrong section reference
  ...
2016-10-03 12:15:00 -07:00
Linus Torvalds
d7a0dab82f Merge branch 'core-smp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core SMP updates from Ingo Molnar:
 "Two main change is generic vCPU pinning and physical CPU SMP-call
  support, for Xen to be able to perform certain calls on specific
  physical CPUs - by Juergen Gross"

* 'core-smp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Allocate smp_call_on_cpu() workqueue on stack too
  hwmon: Use smp_call_on_cpu() for dell-smm i8k
  dcdbas: Make use of smp_call_on_cpu()
  xen: Add xen_pin_vcpu() to support calling functions on a dedicated pCPU
  smp: Add function to execute a function synchronously on a CPU
  virt, sched: Add generic vCPU pinning support
  xen: Sync xen header
2016-10-03 11:02:39 -07:00
Linus Torvalds
4b978934a4 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Expedited grace-period changes, most notably avoiding having user
     threads drive expedited grace periods, using a workqueue instead.

   - Miscellaneous fixes, including a performance fix for lists that was
     sent with the lists modifications.

   - CPU hotplug updates, most notably providing exact CPU-online
     tracking for RCU. This will in turn allow removal of the checks
     supporting RCU's prior heuristic that was based on the assumption
     that CPUs would take no longer than one jiffy to come online.

   - Torture-test updates.

   - Documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  list: Expand list_first_entry_or_null()
  torture: TOROUT_STRING(): Insert a space between flag and message
  rcuperf: Consistently insert space between flag and message
  rcutorture: Print out barrier error as document says
  torture: Add task state to writer-task stall printk()s
  torture: Convert torture_shutdown() to hrtimer
  rcutorture: Convert to hotplug state machine
  cpu/hotplug: Get rid of CPU_STARTING reference
  rcu: Provide exact CPU-online tracking for RCU
  rcu: Avoid redundant quiescent-state chasing
  rcu: Don't use modular infrastructure in non-modular code
  sched: Make wake_up_nohz_cpu() handle CPUs going offline
  rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads
  rcu: Use RCU's online-CPU state for expedited IPI retry
  rcu: Exclude RCU-offline CPUs from expedited grace periods
  rcu: Make expedited RCU CPU stall warnings respond to controls
  rcu: Stop disabling expedited RCU CPU stall warnings
  rcu: Drive expedited grace periods from workqueue
  rcu: Consolidate expedited grace period machinery
  documentation: Record reason for rcu_head two-byte alignment
  ...
2016-10-03 10:29:53 -07:00
Linus Torvalds
72ec94560d Power management material for v4.9-rc1
- Add a mechanism for passing hints from the scheduler to cpufreq governors
    via their utilization update callbacks and use it to introduce "IOwait
    boosting" into the schedutil governor and intel_pstate that will make them
    boost performance if the enqueued task was previously waiting on I/O
    (Rafael Wysocki).
 
  - Fix a schedutil governor problem that causes it to overestimate utilization
    if SMT is in use (Steve Muckle).
 
  - Update defconfigs trying to use the schedutil governor as a module which is
    not possible any more (Javier Martinez Canillas).
 
  - Update the intel_pstate's pstate_sample tracepoint to take "IOwait boosting"
    into account (Srinivas Pandruvada).
 
  - Fix a problem in the cpufreq core causing it to mishandle the initialization
    of CPUs registered after the cpufreq driver (Viresh Kumar, Rafael Wysocki).
 
  - Make the cpufreq-dt driver support per-policy governor tunables, clean it
    up and update its Kconfig description (Viresh Kumar).
 
  - Add support for more ARM platforms to the cpufreq-dt driver (Chanwoo Choi,
    Dave Gerlach, Geert Uytterhoeven).
 
  - Make the cpufreq CPPC driver report frequencies in KHz to avoid user space
    compatiblility issues (Al Stone, Hoan Tran).
 
  - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin Ian King,
    Markus Elfring).
 
  - Constify some local structures in the intel_pstate driver (Julia Lawall).
 
  - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).
 
  - Add support for PM domain removal to the generic power domains (genpd)
    framework, add new DT helper functions to it and make it always enable
    debugfs support if available (Jon Hunter, Tomeu Vizoso).
 
  - Clean up the generic power domains (genpd) framework and make it avoid
    measuring power-on and power-off latencies during system-wide PM transitions
    (Ulf Hansson).
 
  - Add support for the RockChip DFI controller and the rk3399 DMC to the
    devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).
 
  - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski, Stephen
    Rothwell).
 
  - Fix a minor issue in the exynos-ppmu devfreq driver and fix up devfreq
    Kconfig indentation style (Wei Yongjun, Jisheng Zhang).
 
  - Fix the system suspend interface to make suspend-to-idle work if platform
    suspend operations have not been registered (Sudeep Holla).
 
  - Make it possible to use hibernation with PAGE_POISONING_ZERO enabled
    (Anisse Astier).
 
  - Increas the default timeout of the system suspend/resume watchdog and make it
    depend on EXPERT (Chen Yu).
 
  - Make the operating performance points (OPP) framework avoid using OPPs that
    aren't supported by the platform and fix a build warning in it (Dave Gerlach,
    Arnd Bergmann).
 
  - Fix the ARM cpuidle driver's return value (Christophe Jaillet).
 
  - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more common
    logging style (Joe Perches).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJX8Y32AAoJEILEb/54YlRx8e0P/27zu8Lb6Aks1S2Zx9GEW0qr
 DvrO4kklCHqi3DgHlyFOYetf9cxMrUluojVJofnoSDvgAayWyg7VAd4gtOrMGCXG
 pJVJM73itcOUK+DsAVvoWJY3hk15nX77n2aiXPN2GqaMqennlQusdfzTmjCasqpm
 M84j+JwFYlJcfyMCcF5kGWqS7QBjzxhA0CjytUX1i3pL3NqRALZUEpaHwBD1W+4r
 tcF/jYTy3RsghCbuC6HoPxEF9NMOFGxeAXogmu6NvGu8gy0GqtywRSRrs5wA1a0z
 ZDAJ8krrFbzuFPMdjNIE8wtTeziofS5i9piQx3JlIMH3HpNGN86BRXVfzuHzJj11
 6ZMUI/FJy+fYukIXOEeVLtsLHUnMcMux8Jq1UF6N0InahaR9nbsjmGOmXh72+Scx
 7VJ+29l0oVwX6wkw/DjPP3rb1Swd1i3yY0/3uRoJ174mYTjhRGbrbDkIjPiDeuM5
 2Cx7QunscOjFmaNtPyr8niQ+7YhMEpn8VIbGNaX5ABz0fGftfi8nDHqliSNa391Z
 nK6YoKD0O6R0JHE6GavvJTcuMS9qE+HHHOwymWKxEdE9KYk0JUqen3gj1sSTaAZT
 BIPBsn6XlorqNy3dnqtWTHV7Nf0al9huolWvrL90s6g4Bh2BzTzDVydSgNWTMDUi
 G64nP0q1sJTqdoe30uvk
 =NYkv
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "Traditionally, cpufreq is the area with the greatest number of
  changes, but there are fewer of them than last time. There also is
  some activity in the generic power domains and the devfreq frameworks,
  a couple of system suspend and hibernation fixes and some assorted
  changes in other places.

  One new feature is the cpufreq change to allow the scheduler to pass
  hints to the governors' utilization update callbacks and some code
  rework based on that. Another one is the support for domain removal in
  the generic power domains framework. Also it is now possible to use
  hibernation with PAGE_POISONING_ZERO enabled and devfreq supports the
  RockChip DFI controller and the rk3399 DMC.

  The rest of the changes is mostly fixes and cleanups in a number of
  places.

  Specifics:

   - Add a mechanism for passing hints from the scheduler to cpufreq
     governors via their utilization update callbacks and use it to
     introduce "IOwait boosting" into the schedutil governor and
     intel_pstate that will make them boost performance if the enqueued
     task was previously waiting on I/O (Rafael Wysocki).

   - Fix a schedutil governor problem that causes it to overestimate
     utilization if SMT is in use (Steve Muckle).

   - Update defconfigs trying to use the schedutil governor as a module
     which is not possible any more (Javier Martinez Canillas).

   - Update the intel_pstate's pstate_sample tracepoint to take "IOwait
     boosting" into account (Srinivas Pandruvada).

   - Fix a problem in the cpufreq core causing it to mishandle the
     initialization of CPUs registered after the cpufreq driver (Viresh
     Kumar, Rafael Wysocki).

   - Make the cpufreq-dt driver support per-policy governor tunables,
     clean it up and update its Kconfig description (Viresh Kumar).

   - Add support for more ARM platforms to the cpufreq-dt driver
     (Chanwoo Choi, Dave Gerlach, Geert Uytterhoeven).

   - Make the cpufreq CPPC driver report frequencies in KHz to avoid
     user space compatiblility issues (Al Stone, Hoan Tran).

   - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin
     Ian King, Markus Elfring).

   - Constify some local structures in the intel_pstate driver (Julia
     Lawall).

   - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).

   - Add support for PM domain removal to the generic power domains
     (genpd) framework, add new DT helper functions to it and make it
     always enable debugfs support if available (Jon Hunter, Tomeu
     Vizoso).

   - Clean up the generic power domains (genpd) framework and make it
     avoid measuring power-on and power-off latencies during system-wide
     PM transitions (Ulf Hansson).

   - Add support for the RockChip DFI controller and the rk3399 DMC to
     the devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).

   - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski,
     Stephen Rothwell).

   - Fix a minor issue in the exynos-ppmu devfreq driver and fix up
     devfreq Kconfig indentation style (Wei Yongjun, Jisheng Zhang).

   - Fix the system suspend interface to make suspend-to-idle work if
     platform suspend operations have not been registered (Sudeep
     Holla).

   - Make it possible to use hibernation with PAGE_POISONING_ZERO
     enabled (Anisse Astier).

   - Increas the default timeout of the system suspend/resume watchdog
     and make it depend on EXPERT (Chen Yu).

   - Make the operating performance points (OPP) framework avoid using
     OPPs that aren't supported by the platform and fix a build warning
     in it (Dave Gerlach, Arnd Bergmann).

   - Fix the ARM cpuidle driver's return value (Christophe Jaillet).

   - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more
     common logging style (Joe Perches)"

* tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (58 commits)
  PM / OPP: Don't support OPP if it provides supported-hw but platform does not
  cpufreq: st: add missing \n to end of dev_err message
  cpufreq: kirkwood: add missing \n to end of dev_err messages
  PM / Domains: Rename pm_genpd_sync_poweron|poweroff()
  PM / Domains: Don't measure latency of ->power_on|off() during system PM
  PM / Domains: Remove redundant system PM callbacks
  PM / Domains: Simplify detaching a device from its genpd
  PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
  PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
  PM / OPP: avoid maybe-uninitialized warning
  PM / Domains: Allow holes in genpd_data.domains array
  cpufreq: CPPC: Avoid overflow when calculating desired_perf
  cpufreq: ti: Use generic platdev driver
  cpufreq: intel_pstate: Add io_boost trace
  partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
  cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
  cpufreq: schedutil: Add iowait boosting
  cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
  PM / Domains: Add support for removing nested PM domains by provider
  PM / Domains: Add support for removing PM domains
  ...
2016-10-03 09:33:40 -07:00
Linus Torvalds
7af8a0f808 arm64 updates for 4.9:
- Support for execute-only page permissions
 - Support for hibernate and DEBUG_PAGEALLOC
 - Support for heterogeneous systems with mismatches cache line sizes
 - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
 - arm64 PMU perf updates, including cpumasks for heterogeneous systems
 - Set UTS_MACHINE for building rpm packages
 - Yet another head.S tidy-up
 - Some cleanups and refactoring, particularly in the NUMA code
 - Lots of random, non-critical fixes across the board
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJX7k31AAoJELescNyEwWM0XX0H/iOaWCfKlWOhvBsStGUCsLrK
 XryTzQT2KjdnLKf3jwP+1ateCuBR5ROurYxoDCX5/7mD63c5KiI338Vbv61a1lE1
 AAwjt1stmQVUg/j+kqnuQwB/0DYg+2C8se3D3q5Iyn7zc19cDZJEGcBHNrvLMufc
 XgHrgHgl/rzBDDlHJXleknDFge/MfhU5/Q1vJMRRb4JYrpAtmIokzCO75CYMRcCT
 ND2QbmppKtsyuFPGUTVbAFzJlP6dGKb3eruYta7/ct5d0pJQxav3u98D2yWGfjdM
 YaYq1EmX5Pol7rWumqLtk0+mA9yCFcKLLc+PrJu20Vx0UkvOq8G8Xt70sHNvZU8=
 =gdPM
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "It's a bit all over the place this time with no "killer feature" to
  speak of.  Support for mismatched cache line sizes should help people
  seeing whacky JIT failures on some SoCs, and the big.LITTLE perf
  updates have been a long time coming, but a lot of the changes here
  are cleanups.

  We stray outside arch/arm64 in a few areas: the arch/arm/ arch_timer
  workaround is acked by Russell, the DT/OF bits are acked by Rob, the
  arch_timer clocksource changes acked by Marc, CPU hotplug by tglx and
  jump_label by Peter (all CC'd).

  Summary:

   - Support for execute-only page permissions
   - Support for hibernate and DEBUG_PAGEALLOC
   - Support for heterogeneous systems with mismatches cache line sizes
   - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
   - arm64 PMU perf updates, including cpumasks for heterogeneous systems
   - Set UTS_MACHINE for building rpm packages
   - Yet another head.S tidy-up
   - Some cleanups and refactoring, particularly in the NUMA code
   - Lots of random, non-critical fixes across the board"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (100 commits)
  arm64: tlbflush.h: add __tlbi() macro
  arm64: Kconfig: remove SMP dependence for NUMA
  arm64: Kconfig: select OF/ACPI_NUMA under NUMA config
  arm64: fix dump_backtrace/unwind_frame with NULL tsk
  arm/arm64: arch_timer: Use archdata to indicate vdso suitability
  arm64: arch_timer: Work around QorIQ Erratum A-008585
  arm64: arch_timer: Add device tree binding for A-008585 erratum
  arm64: Correctly bounds check virt_addr_valid
  arm64: migrate exception table users off module.h and onto extable.h
  arm64: pmu: Hoist pmu platform device name
  arm64: pmu: Probe default hw/cache counters
  arm64: pmu: add fallback probe table
  MAINTAINERS: Update ARM PMU PROFILING AND DEBUGGING entry
  arm64: Improve kprobes test for atomic sequence
  arm64/kvm: use alternative auto-nop
  arm64: use alternative auto-nop
  arm64: alternative: add auto-nop infrastructure
  arm64: lse: convert lse alternatives NOP padding to use __nops
  arm64: barriers: introduce nops and __nops macros for NOP sequences
  arm64: sysreg: replace open-coded mrs_s/msr_s with {read,write}_sysreg_s
  ...
2016-10-03 08:58:35 -07:00
David S. Miller
b50afd203a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes.  Nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:20:41 -04:00
Rafael J. Wysocki
993eb0aeae Merge branches 'pm-devfreq' and 'pm-sleep'
* pm-devfreq:
  PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
  PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
  partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
  PM / devfreq: rockchip: add devfreq driver for rk3399 dmc
  Documentation: bindings: add dt documentation for rk3399 dmc
  PM / devfreq: event: support rockchip dfi controller
  Documentation: bindings: add dt documentation for dfi controller
  PM / devfreq: event: remove duplicate devfreq_event_get_drvdata()
  PM / devfreq: fix Kconfig indent style
  PM / devfreq: Add COMPILE_TEST for build coverage
  PM / devfreq: exynos-ppmu: remove unneeded of_node_put()

* pm-sleep:
  PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO
  PM / sleep: enable suspend-to-idle even without registered suspend_ops
  PM / sleep: Increase default DPM watchdog timeout to 120
2016-10-02 01:43:45 +02:00
Rafael J. Wysocki
7005f6dc69 Merge branch 'pm-cpufreq'
* pm-cpufreq: (24 commits)
  cpufreq: st: add missing \n to end of dev_err message
  cpufreq: kirkwood: add missing \n to end of dev_err messages
  cpufreq: CPPC: Avoid overflow when calculating desired_perf
  cpufreq: ti: Use generic platdev driver
  cpufreq: intel_pstate: Add io_boost trace
  cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
  cpufreq: schedutil: Add iowait boosting
  cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
  cpufreq: CPPC: Force reporting values in KHz to fix user space interface
  cpufreq: create link to policy only for registered CPUs
  intel_pstate: constify local structures
  cpufreq: dt: Support governor tunables per policy
  cpufreq: dt: Update kconfig description
  cpufreq: dt: Remove unused code
  MAINTAINERS: Add Documentation/cpu-freq/
  cpufreq: dt: Add support for r8a7792
  cpufreq / sched: ignore SMT when determining max cpu capacity
  cpufreq: Drop unnecessary check from cpufreq_policy_alloc()
  ARM: multi_v7_defconfig: Don't attempt to enable schedutil governor as module
  ARM: exynos_defconfig: Don't attempt to enable schedutil governor as module
  ...
2016-10-02 01:42:45 +02:00
Rafael J. Wysocki
b6e2511782 Merge branch 'pm-cpufreq-sched' into pm-cpufreq 2016-10-02 01:42:33 +02:00
Eric W. Biederman
d29216842a mnt: Add a per mount namespace limit on the number of mounts
CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-30 12:46:48 -05:00
Thomas Gleixner
d7e25c66c9 Merge branch 'x86/urgent' into x86/asm
Get the cr4 fixes so we can apply the final cleanup
2016-09-30 12:38:28 +02:00
Frederic Weisbecker
447976ef4f sched/irqtime: Consolidate irqtime flushing code
The code performing irqtime nsecs stats flushing to kcpustat is roughly
the same for hardirq and softirq. So lets consolidate that common code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1474849761-12678-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:46:41 +02:00
Frederic Weisbecker
19d23dbfeb sched/irqtime: Consolidate accounting synchronization with u64_stats API
The irqtime accounting currently implement its own ad hoc implementation
of u64_stats API. Lets rather consolidate it with the appropriate
library.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1474849761-12678-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:46:40 +02:00
Frederic Weisbecker
2810f611f9 sched/irqtime: Remove needless IRQs disablement on kcpustat update
The callers of the functions performing irqtime kcpustat updates have
IRQS disabled, no need to disable them again.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1474849761-12678-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:46:39 +02:00
Frederic Weisbecker
f9094a6575 sched/irqtime: No need for preempt-safe accessors
We can safely use the preempt-unsafe accessors for irqtime when we
flush its counters to kcpustat as IRQs are disabled at this time.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1474849761-12678-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:46:38 +02:00
Peter Zijlstra
b60205c7c5 sched/fair: Fix min_vruntime tracking
While going through enqueue/dequeue to review the movement of
set_curr_task() I noticed that the (2nd) update_min_vruntime() call in
dequeue_entity() is suspect.

It turns out, its actually wrong because it will consider
cfs_rq->curr, which could be the entry we just normalized. This mixes
different vruntime forms and leads to fail.

The purpose of the second update_min_vruntime() is to move
min_vruntime forward if the entity we just removed is the one that was
holding it back; _except_ for the DEQUEUE_SAVE case, because then we
know its a temporary removal and it will come back.

However, since we do put_prev_task() _after_ dequeue(), cfs_rq->curr
will still be set (and per the above, can be tranformed into a
different unit), so update_min_vruntime() should also consider
curr->on_rq. This also fixes another corner case where the enqueue
(which also does update_curr()->update_min_vruntime()) happens on the
rq->lock break in schedule(), between dequeue and put_prev_task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 1e87623178 ("sched: Fix ->min_vruntime calculation in dequeue_entity()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:29 +02:00
Peter Zijlstra
9148a3a10e sched/debug: Add SCHED_WARN_ON()
Provide SCHED_WARN_ON as wrapper for WARN_ON_ONCE() to avoid
CONFIG_SCHED_DEBUG wrappery.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:29 +02:00
Peter Zijlstra
49bd21efe7 sched/core: Fix set_user_nice()
Almost all scheduler functions update state with the following
pattern:

	if (queued)
		dequeue_task(rq, p, DEQUEUE_SAVE);
	if (running)
		put_prev_task(rq, p);

	/* update state */

	if (queued)
		enqueue_task(rq, p, ENQUEUE_RESTORE);
	if (running)
		set_curr_task(rq, p);

set_user_nice() however misses the running part, cure this.

This was found by asserting we never enqueue 'current'.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:28 +02:00
Peter Zijlstra
b2bf6c314e sched/fair: Introduce set_curr_task() helper
Now that the ia64 only set_curr_task() symbol is gone, provide a
helper just like put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:28 +02:00
Peter Zijlstra
a458ae2ea6 sched/core, ia64: Rename set_curr_task()
Rename the ia64 only set_curr_task() function to free up the name.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:27 +02:00