Commit Graph

24024 Commits

Author SHA1 Message Date
Frederic Weisbecker
5613fda9a5 sched/cputime: Convert task/group cputime to nsecs
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:49 +01:00
Frederic Weisbecker
a1cecf2ba7 sched/cputime: Introduce special task_cputime_t() API to return old-typed cputime
This API returns a task's cputime in cputime_t in order to ease the
conversion of cputime internals to use nsecs units instead. Blindly
converting all cputime readers to use this API now will later let us
convert more smoothly and step by step all these places to use the
new nsec based cputime.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:48 +01:00
Frederic Weisbecker
16a6d9be90 sched/cputime: Convert guest time accounting to nsecs (u64)
cputime_t is being obsolete and replaced by nsecs units in order to make
internal timestamps less opaque and more granular.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:48 +01:00
Frederic Weisbecker
7fb1327ee9 sched/cputime: Convert kcpustat to nsecs
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.

Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:47 +01:00
Frederic Weisbecker
07e5f5e353 time: Introduce jiffies64_to_nsecs()
This will be needed for the cputime_t to nsec conversion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:45 +01:00
Frederic Weisbecker
93825f2ec7 jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY
NSEC_PER_JIFFY is an ad-hoc redefinition of TICK_NSEC. Let's rather
use a unique and well maintained version.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:45 +01:00
Ingo Molnar
ed5c8c854f Merge branch 'linus' into sched/core, to pick up fixes and refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:12:25 +01:00
Linus Torvalds
a2ca3d6179 It was reported to me that the thread created by the hwlat tracer does
not migrate after the first instance. I found that there was as small
 bug in the logic, and fixed it. It's minor, but should be fixed regardless.
 There's not much impact outside the hwlat tracer.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJYkP7tFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 srEH/iqi0mC4rJF57yLmwGXVAuwSRsxHlkGfyG3RBRnXOhiPLaM9Y6QZDWj8f99D
 DEnVJ7IMpW7tjJB0pDQBCMBi+NX6GvR9Zu4ub8k69UzDlw81CPdMiJ/HfEFsP5XA
 EOiDH/xGsYmbgGyqbU2VTb4lS8CijStht0jhydriLT1ga/W3bOl0/w6TeQW+AwqJ
 0tcksaKwAcH6iN11FUQfWPlirQ/aCTvj8FelgbT7MjYnwxDdbJSfKG+GwJjytLNQ
 Rj/STWD56OZpl91OuEYA50BZVxp3eF870xJXsLjg3mGIYBseW/T6ZTP8TOM+6WoH
 u2J08p8Owdtdo+9gkw+IwEuoFP4=
 =VCLk
 -----END PGP SIGNATURE-----

Merge tag 'trace-4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "It was reported to me that the thread created by the hwlat tracer does
  not migrate after the first instance. I found that there was as small
  bug in the logic, and fixed it. It's minor, but should be fixed
  regardless. There's not much impact outside the hwlat tracer"

* tag 'trace-4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix hwlat kthread migration
2017-01-31 16:32:40 -08:00
Linus Torvalds
f1774f46d4 Merge branch 'for-4.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "The cgroup creation path was getting the order of operations wrong and
  exposing cgroups which don't have their names set yet to controllers
  which can lead to NULL derefs.

  This contains the fix for the bug"

* 'for-4.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: don't online subsystems before cgroup_name/path() are operational
2017-01-31 13:54:41 -08:00
Christoph Hellwig
57292b58dd block: introduce blk_rq_is_passthrough
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 14:00:34 -07:00
Steven Rostedt (VMware)
79c6f448c8 tracing: Fix hwlat kthread migration
The hwlat tracer creates a kernel thread at start of the tracer. It is
pinned to a single CPU and will move to the next CPU after each period of
running. If the user modifies the migration thread's affinity, it will not
change after that happens.

The original code created the thread at the first instance it was called,
but later was changed to destroy the thread after the tracer was finished,
and would not be created until the next instance of the tracer was
established. The code that initialized the affinity was only called on the
initial instantiation of the tracer. After that, it was not initialized, and
the previous affinity did not match the current newly created one, making
it appear that the user modified the thread's affinity when it did not, and
the thread failed to migrate again.

Cc: stable@vger.kernel.org
Fixes: 0330f7aa8e ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-31 09:13:49 -05:00
Ingo Molnar
a8709fa4a0 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

 - Dynticks updates, consolidating open-coded counter accesses into a well-defined API

 - SRCU updates: Simplify algorithm, add formal verification

 - Documentation updates

 - Miscellaneous fixes

 - Torture-test updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-31 07:45:42 +01:00
Joerg Roedel
93fa6cf60a Merge branch 'iommu/guest-msi' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/core 2017-01-30 15:58:47 +01:00
Marc Zyngier
08d85f3ea9 irqdomain: Avoid activating interrupts more than once
Since commit f3b0946d62 ("genirq/msi: Make sure PCI MSIs are
activated early"), we can end-up activating a PCI/MSI twice (once
at allocation time, and once at startup time).

This is normally of no consequences, except that there is some
HW out there that may misbehave if activate is used more than once
(the GICv3 ITS, for example, uses the activate callback
to issue the MAPVI command, and the architecture spec says that
"If there is an existing mapping for the EventID-DeviceID
combination, behavior is UNPREDICTABLE").

While this could be worked around in each individual driver, it may
make more sense to tackle the issue at the core level. In order to
avoid getting in that situation, let's have a per-interrupt flag
to remember if we have already activated that interrupt or not.

Fixes: f3b0946d62 ("genirq/msi: Make sure PCI MSIs are activated early")
Reported-and-tested-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-30 15:18:56 +01:00
Kan Liang
40999312c7 perf/core: Try parent PMU first when initializing a child event
perf has additional overhead when monitoring the task which
frequently generates child tasks.

perf_init_event() is one of the hotspots for the additional overhead:

Currently, to get the PMU, it tries to search the type in pmu_idr at
first. But it is not always successful, especially for the widely used
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events. So it has to go to the
slow path which go through the whole PMUs list.

It will be a big performance issue, if the PMUs list is long (e.g. server
with many uncore boxes) and the task frequently generates child tasks.

The child event inherits its parent event. So the child event should
try its parent PMU first.

Here is some data from the overhead test on Broadwell server:

  perf record -e $TEST_EVENTS -- ./loop.sh 50000

  loop.sh
    start=$(date +%s%N)
    i=0
    while [ "$i" -le "$1" ]
    do
            date > /dev/null
            i=`expr $i + 1`
    done
    end=$(date +%s%N)
    elapsed=`expr $end - $start`

  Event#	Original elapsed time	Elapsed time with patch		delta
  1		196,573,192,397		189,162,029,998			-3.77%
  2		257,567,753,013		241,620,788,683			-6.19%
  4		398,730,726,971		370,518,938,714			-7.08%
  8		824,983,761,120		740,702,489,329			-10.22%
  16		1,883,411,923,498	1,672,027,508,355		-11.22%

... which shows a nice performance improvement.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1484745662-15928-2-git-send-email-kan.liang@intel.com
[ Tidied up the changelog and the code comment. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:16 +01:00
Alexander Shishkin
487f05e18a perf/core: Optimize event rescheduling on active contexts
When new events are added to an active context, we go and reschedule
all cpu groups and all task groups in order to preserve the priority
(cpu pinned, task pinned, cpu flexible, task flexible), but in
reality we only need to reschedule groups of the same priority as
that of the events being added, and below.

This patch changes the behavior so that only groups that need to be
rescheduled are rescheduled.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170119164330.22887-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:15 +01:00
Alexander Shishkin
fe45bafbd0 perf/core: Don't re-schedule CPU flexible events needlessly
In the sched-in path, we first remove a CPU's flexible events in order to
give priority to the task's pinned events. However, this step can be safely
skipped if the task doesn't have its own pinned events.

This patch implements this skipping.

Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170119164330.22887-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:14 +01:00
David Carrillo-Cisneros
1fd7e41699 perf/core: Remove perf_cpu_context::unique_pmu
cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
with shared pmus in order to avoid visiting the same cpuctx more than once
in a for_each_pmu loop.

cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
have only one pmu per cpuctx. Since perf_pmu_sched_task() is only called in
hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.

The change above, together with the previous patch in this series, removed
the remaining uses of cpuctx->unique_pmu, so we remove it altogether.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170118192454.58008-3-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:14 +01:00
David Carrillo-Cisneros
058fe1c044 perf/core: Make cgroup switch visit only cpuctxs with cgroup events
This patch follows from a conversation in CQM/CMT's last series about
speeding up the context switch for cgroup events:

  https://patchwork.kernel.org/patch/9478617/

This is a low-hanging fruit optimization. It replaces the iteration over
the "pmus" list in cgroup switch by an iteration over a new list that
contains only cpuctxs with at least one cgroup event.

This is necessary because the number of PMUs have increased over the years
e.g modern x86 server systems have well above 50 PMUs.

The iteration over the full PMU list is unneccessary and can be costly in
heavy cache contention scenarios.

Below are some instrumentation measurements with 10, 50 and 90 percentiles
of the total cost of context switch before and after this optimization for
a simple array read/write microbenchark.

  Contention
    Level    Nr events      Before (us)            After (us)       Median
  L2    L3     types      (10%, 50%, 90%)       (10%, 50%, 90%     Speedup
  --------------------------------------------------------------------------
  Low   Low       1       (1.72, 2.42, 5.85)    (1.35, 1.64, 5.46)     29%
  High  Low       1       (2.08, 4.56, 19.8)    (1720, 2.20, 13.7)     51%
  High  High      1       (2.86, 10.4, 12.7)    (2.54, 4.32, 12.1)     58%

  Low   Low       2       (1.98, 3.20, 6.89)    (1.68, 2.41, 8.89)     24%
  High  Low       2       (2.48, 5.28, 22.4)    (2150, 3.69, 14.6)     30%
  High  High      2       (3.32, 8.09, 13.9)    (2.80, 5.15, 13.7)     36%

where:

  1 event type  = cycles
  2 event types = cycles,intel_cqm/llc_occupancy/

   Contention L2 Low: workset  <  L2 cache size.
                 High:  "     >>  L2   "     " .
   Contention L3 Low: workset of task on all sockets  <  L3 cache size.
                 High:   "     "   "   "   "    "    >>  L3   "     " .

   Median Speedup is (50%ile Before - 50%ile After) /  50%ile Before

Unsurprisingly, the benefits of this optimization decrease with the number
of cpuctxs with a cgroup events, yet, is never detrimental.

Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 12:01:13 +01:00
Ingo Molnar
ae5112a825 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:47:00 +01:00
Sebastian Andrzej Siewior
619bd4a718 sched/rt: Add a missing rescheduling point
Since the change in commit:

  fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")

... we don't reschedule a task under certain circumstances:

Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.

The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.

The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:37 +01:00
Mathieu Poirier
4b12db9391 sched/core: Fix &rd->cpudl memory leak
While in the process of initialising a root domain, if function
cpupri_init() fails the memory allocated in cpudl_init() is not
reclaimed.

Adding a new goto target to cleanup the previous initialistion of
the root_domain's dl_bw structure reclaims said memory.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485292295-21298-2-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:37 +01:00
Mathieu Poirier
92c99ac829 sched/core: Fix &rd->rto_mask memory leak
If function cpudl_init() fails the memory allocated for &rd->rto_mask
needs to be freed, something this patch is addressing.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485292295-21298-1-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:36 +01:00
Matt Fleming
4d25b35ea3 sched/fair: Restore previous rq_flags when migrating tasks in hotplug
__migrate_task() can return with a different runqueue locked than the
one we passed as an argument. So that we can repin the lock in
migrate_tasks() (and keep the update_rq_clock() bit) we need to
restore the old rq_flags before repinning.

Note that it wouldn't be correct to change move_queued_task() to repin
because of the change of runqueue and the fact that having an
up-to-date clock on the initial rq doesn't mean the new rq has one
too.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:35 +01:00
Peter Zijlstra
1b1d62254d sched/core: Add missing update_rq_clock() call in sched_move_task()
Bug was noticed via this warning:

  WARNING: CPU: 6 PID: 1 at kernel/sched/sched.h:804 detach_task_cfs_rq+0x8e8/0xb80
  rq->clock_update_flags < RQCF_ACT_SKIP
  Modules linked in:
  CPU: 6 PID: 1 Comm: systemd Not tainted 4.10.0-rc5-00140-g0874170baf55-dirty #1
  Hardware name: Supermicro SYS-4048B-TRFT/X10QBi, BIOS 1.0 04/11/2014
  Call Trace:
   dump_stack+0x4d/0x65
   __warn+0xcb/0xf0
   warn_slowpath_fmt+0x5f/0x80
   detach_task_cfs_rq+0x8e8/0xb80
   ? allocate_cgrp_cset_links+0x59/0x80
   task_change_group_fair+0x27/0x150
   sched_change_group+0x48/0xf0
   sched_move_task+0x53/0x150
   cpu_cgroup_attach+0x36/0x70
   cgroup_taskset_migrate+0x175/0x300
   cgroup_migrate+0xab/0xd0
   cgroup_attach_task+0xf0/0x190
   __cgroup_procs_write+0x1ed/0x2f0
   cgroup_procs_write+0x14/0x20
   cgroup_file_write+0x3f/0x100
   kernfs_fop_write+0x104/0x180
   __vfs_write+0x37/0x140
   vfs_write+0xb8/0x1b0
   SyS_write+0x55/0xc0
   do_syscall_64+0x61/0x170
   entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:34 +01:00
Peter Zijlstra
49ee576809 sched/core: Optimize pick_next_task() for idle_sched_class
Steve noticed that when we switch from IDLE to SCHED_OTHER we fail to
take the shortcut, even though all runnable tasks are of the fair
class, because prev->sched_class != &fair_sched_class.

Since I reworked the put_prev_task() stuff, we don't really care about
prev->class here, so removing that condition will allow this case.

This increases the likely case from 78% to 98% correct for Steve's
workload.

Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119174408.GN6485@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:46:34 +01:00
Peter Zijlstra
b9c16a0e1f locking/mutex: Fix lockdep_assert_held() fail
In commit:

  659cf9f582 ("locking/ww_mutex: Optimize ww-mutexes by waking at most one waiter for backoff when acquiring the lock")

I replaced a comment with a lockdep_assert_held(). However it turns out
we hide that lock from lockdep for hysterical raisins, which results
in the assertion always firing.

Remove the old debug code as lockdep will easily spot the abuse it was
meant to catch, which will make the lock visible to lockdep and make
the assertion work as intended.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolai Haehnle <Nicolai.Haehnle@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 659cf9f582 ("locking/ww_mutex: Optimize ww-mutexes by waking at most one waiter for backoff when acquiring the lock")
Link: http://lkml.kernel.org/r/20170117150609.GB32474@worktop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:42:59 +01:00
Steven Rostedt (VMware)
4009f4b3a9 locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
Running my likely/unlikely profiler for 3 weeks on two production
machines, I discovered that the unlikely() test in
__rt_mutex_slowlock() checking if state is TASK_INTERRUPTIBLE is hit
100% of the time, making it a very likely case.

The reason is, on a vanilla kernel, the majority case of calling
rt_mutex() is from the futex code. This code is always called as
TASK_INTERRUPTIBLE. In the -rt patch, this code is commonly called when
PREEMPT_RT is enabled with TASK_UNINTERRUPTIBLE. But that's not the
likely scenario.

The rt_mutex() code should be optimized for the common vanilla case,
and that is from a futex, with TASK_INTERRUPTIBLE as the state.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119113234.1efeedd1@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:42:59 +01:00
Peter Zijlstra
0b3589be9b perf/core: Fix PERF_RECORD_MMAP2 prot/flags for anonymous memory
Andres reported that MMAP2 records for anonymous memory always have
their protection field 0.

Turns out, someone daft put the prot/flags generation code in the file
branch, leaving them unset for anonymous memory.

Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Don Zickus <dzickus@redhat.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: anton@ozlabs.org
Cc: namhyung@kernel.org
Cc: stable@vger.kernel.org # v3.16+
Fixes: f972eb63b1 ("perf: Pass protection and flags bits through mmap2 interface")
Link: http://lkml.kernel.org/r/20170126221508.GF6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:41:26 +01:00
Peter Zijlstra
a76a82a3e3 perf/core: Fix use-after-free bug
Dmitry reported a KASAN use-after-free on event->group_leader.

It turns out there's a hole in perf_remove_from_context() due to
event_function_call() not calling its function when the task
associated with the event is already dead.

In this case the event will have been detached from the task, but the
grouping will have been retained, such that group operations might
still work properly while there are live child events etc.

This does however mean that we can miss a perf_group_detach() call
when the group decomposes, this in turn can then lead to
use-after-free.

Fix it by explicitly doing the group detach if its still required.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # v4.5+
Cc: syzkaller <syzkaller@googlegroups.com>
Fixes: 63b6da39bb ("perf: Fix perf_event_exit_task() race")
Link: http://lkml.kernel.org/r/20170126153955.GD6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30 11:41:25 +01:00
Thomas Gleixner
9556ad6ad0 Merge branch 'fortglx/4.11/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
- Remove unused functions
 - Document udelay inaccuracy
 - Remove posix timer data from task struct when posix timers are off
2017-01-30 11:22:39 +01:00
Rafael J. Wysocki
858a0d7eb5 Merge back earlier suspend/hibernation changes for v4.11. 2017-01-30 09:00:02 +01:00
David S. Miller
4e8f2fc1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28 10:33:06 -05:00
Christoph Hellwig
48b77ad608 block: cleanup tracing
A couple tweaks to the tracing code:

 - trace the request size for all requests
 - trace request sector and nr_sectors only for fs requests, enforced by
   helpers
 - drop SCSI CDB tracing - we have SCSI tracing for this and are going
   to me the CDB out of the generic struct request soon.

With this the tracing code stops to know about BLOCK_PC requests entirely,
it's just FS vs passthrough requests now, where the latter includes any
driver-private requests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 15:08:35 -07:00
Nicolas Pitre
b18b6a9cef timers: Omit POSIX timer stuff from task_struct when disabled
When CONFIG_POSIX_TIMERS is disabled, it is preferable to remove related
structures from struct task_struct and struct signal_struct as they
won't contain anything useful and shouldn't be relied upon by mistake.
Code still referencing those structures is also disabled here.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-01-27 13:05:26 -08:00
Linus Torvalds
1b1bc42c16 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) GTP fixes from Andreas Schultz (missing genl module alias, clear IP
    DF on transmit).

 2) Netfilter needs to reflect the fwmark when sending resets, from Pau
    Espin Pedrol.

 3) nftable dump OOPS fix from Liping Zhang.

 4) Fix erroneous setting of VIRTIO_NET_HDR_F_DATA_VALID on transmit,
    from Rolf Neugebauer.

 5) Fix build error of ipt_CLUSTERIP when procfs is disabled, from Arnd
    Bergmann.

 6) Fix regression in handling of NETIF_F_SG in harmonize_features(),
    from Eric Dumazet.

 7) Fix RTNL deadlock wrt. lwtunnel module loading, from David Ahern.

 8) tcp_fastopen_create_child() needs to setup tp->max_window, from
    Alexey Kodanev.

 9) Missing kmemdup() failure check in ipv6 segment routing code, from
    Eric Dumazet.

10) Don't execute unix_bind() under the bindlock, otherwise we deadlock
    with splice. From WANG Cong.

11) ip6_tnl_parse_tlv_enc_lim() potentially reallocates the skb buffer,
    therefore callers must reload cached header pointers into that skb.
    Fix from Eric Dumazet.

12) Fix various bugs in legacy IRQ fallback handling in alx driver, from
    Tobias Regnery.

13) Do not allow lwtunnel drivers to be unloaded while they are
    referenced by active instances, from Robert Shearman.

14) Fix truncated PHY LED trigger names, from Geert Uytterhoeven.

15) Fix a few regressions from virtio_net XDP support, from John
    Fastabend and Jakub Kicinski.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (102 commits)
  ISDN: eicon: silence misleading array-bounds warning
  net: phy: micrel: add support for KSZ8795
  gtp: fix cross netns recv on gtp socket
  gtp: clear DF bit on GTP packet tx
  gtp: add genl family modules alias
  tcp: don't annotate mark on control socket from tcp_v6_send_response()
  ravb: unmap descriptors when freeing rings
  virtio_net: reject XDP programs using header adjustment
  virtio_net: use dev_kfree_skb for small buffer XDP receive
  r8152: check rx after napi is enabled
  r8152: re-schedule napi for tx
  r8152: avoid start_xmit to schedule napi when napi is disabled
  r8152: avoid start_xmit to call napi_schedule during autosuspend
  net: dsa: Bring back device detaching in dsa_slave_suspend()
  net: phy: leds: Fix truncated LED trigger names
  net: phy: leds: Break dependency of phy.h on phy_led_triggers.h
  net: phy: leds: Clear phy_num_led_triggers on failure to avoid crash
  net-next: ethernet: mediatek: change the compatible string
  Documentation: devicetree: change the mediatek ethernet compatible string
  bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status().
  ...
2017-01-27 12:54:16 -08:00
Geliang Tang
47087eeb74 PM / Hibernate: Use rb_entry() instead of container_of()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-01-27 11:31:12 +01:00
Linus Torvalds
7d3a0fa52e Power management fixes for v4.10-rc6
- Revert the recent change that caused suspend-to-idle to be used
    as the default suspend method on systems where it is indicated to
    be efficient by the ACPI tables, as that turned out to be premature
    and introduced suspend regressions on some systems with missing
    power management support in device drivers (Rafael Wysocki).
 
  - Fix up the intel_pstate driver to take changes of the global
    limits via sysfs correctly when the performance policy is used
    which has been broken by a recent change in it (Srinivas Pandruvada).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYiomwAAoJEILEb/54YlRxgGEP/i2Z+MVZbIwifod6wz6Yt0mj
 Rm2uri24qEdJJSWCejDAngySiU+ymNJIYfVx2q5l99x1W4WTwyNWuEdWhGMTBDwG
 tI3eeLSCZcAmPeMb7V/l8lPvJiKGBf8tM162j09zpZpgO4LFsunPX9PA5IWbK55M
 U4hYeho3OlLjT9yiS8Yc9iLAZPrf7MRDBqtM0PeCklJHHyYmberbmSirH/TPm1Sq
 TPHHbBk6d2sFxA5mkUEItm5y7g9Wq9kN/E08a6TA0HthQBnjEaJ9GTQCpSBOEGHF
 MgEu/MOxm1Kou9YvOQMN3B1L9/VOb5JatV8RkMDltEctJseymQejdg8gFMLIKZ5j
 lDAfC/tSpXAXvxUPl/ObYloKJP3HV4ly8urxZ8rqqUWLPq7vK/jo5OwA97Kjp5a0
 /qW0LACoK8B96WOYYaNR1hWullH7+hDItKkbbBSBKKSNPXCgOmzqkGjCqvze4yYl
 yS2PeOgr6cA3D0iAMIFhmiRkaauAv/Dl++yiF7oVEpn8JI0UfpjOKBegPOdAysFw
 ADb5wLAHZ1LCyTn2CnLz1i2F87HxrYrFrKdTwjjBLyEu2aIw9sFuWSvbxuzZ7asf
 u2JaUA+KmvMhkZMUkgX2jhzR3siHLEFOqnfWQh2rnZV2ukNCWxL/YjjcWkoUuLTc
 H+Kx+VxcU6wKJPVWsnsK
 =YF+u
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix two regressions introduced recently, one by reverting the
  problematic commit and one by fixing up the behavior in an overlooked
  case.

  Specifics:

   - Revert the recent change that caused suspend-to-idle to be used as
     the default suspend method on systems where it is indicated to be
     efficient by the ACPI tables, as that turned out to be premature
     and introduced suspend regressions on some systems with missing
     power management support in device drivers (Rafael Wysocki).

   - Fix up the intel_pstate driver to take changes of the global limits
     via sysfs correctly when the performance policy is used which has
     been broken by a recent change in it (Srinivas Pandruvada)"

* tag 'pm-4.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy
  Revert "PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag"
2017-01-26 17:14:17 -08:00
Rafael J. Wysocki
ff7e593c9c Merge branches 'pm-sleep' and 'pm-cpufreq'
* pm-sleep:
  Revert "PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag"

* pm-cpufreq:
  cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy
2017-01-27 00:08:59 +01:00
Tejun Heo
07cd129455 cgroup: don't online subsystems before cgroup_name/path() are operational
While refactoring cgroup creation, a5bca21520 ("cgroup: factor out
cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems
before the new cgroup is associated with it kernfs_node.  This is fine
for cgroup proper but cgroup_name/path() depend on the associated
kernfs_node and if a subsystem makes the new cgroup_subsys_state
visible, which they're allowed to after onlining, it can lead to NULL
dereference.

The current code performs cgroup creation and subsystem onlining in
cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems
visible afterwards.  There's no reason to online the subsystems early
and we can simply drop cgroup_apply_control_enable() call from
cgroup_create() so that the subsystems are onlined and made visible at
the same time.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: a5bca21520 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()") 
Cc: stable@vger.kernel.org # v4.6+
2017-01-26 16:47:28 -05:00
Eric Dumazet
ff9f8a7cf9 sysctl: fix proc_doulongvec_ms_jiffies_minmax()
We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.

We need to do the opposite operation when value is written by user.

Only matters when HZ != 1000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-26 09:21:24 -08:00
Paul E. McKenney
31945aa9f1 Merge branches 'doc.2017.01.15b', 'dyntick.2017.01.23a', 'fixes.2017.01.23a', 'srcu.2017.01.25a' and 'torture.2017.01.15b' into HEAD
doc.2017.01.15b: Documentation updates
dyntick.2017.01.23a: Dyntick tracking consolidation
fixes.2017.01.23a: Miscellaneous fixes
srcu.2017.01.25a: SRCU rewrite, fixes, and verification
torture.2017.01.15b: Torture-test updates
2017-01-25 12:56:05 -08:00
Paul E. McKenney
7f554a3d05 srcu: Reduce probability of SRCU ->unlock_count[] counter overflow
Because there are no memory barriers between the srcu_flip() ->completed
increment and the summation of the read-side ->unlock_count[] counters,
both the compiler and the CPU can reorder the summation with the
->completed increment.  If the updater is preempted long enough during
this process, the read-side counters could overflow, resulting in a
too-short grace period.

This commit therefore adds a memory barrier just after the ->completed
increment, ensuring that if the summation misses an increment of
->unlock_count[] from __srcu_read_unlock(), the next __srcu_read_lock()
will see the new value of ->completed, thus bounding the number of
->unlock_count[] increments that can be missed to NR_CPUS.  The actual
overflow computation is more complex due to the possibility of nesting
of __srcu_read_lock().

Reported-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-01-25 12:54:22 -08:00
Paul E. McKenney
d85b62f18d srcu: Force full grace-period ordering
If a process invokes synchronize_srcu(), is delayed just the right amount
of time, and thus does not sleep when waiting for the grace period to
complete, there is no ordering between the end of the grace period and
the code following the synchronize_srcu().  Similarly, there can be a
lack of ordering between the end of the SRCU grace period and callback
invocation.

This commit adds the necessary ordering.

Reported-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Further smp_mb() adjustment per email with Lance Roy. ]
2017-01-25 12:54:22 -08:00
Lance Roy
f2c4689640 srcu: Implement more-efficient reader counts
SRCU uses two per-cpu counters: a nesting counter to count the number of
active critical sections, and a sequence counter to ensure that the nesting
counters don't change while they are being added together in
srcu_readers_active_idx_check().

This patch instead uses per-cpu lock and unlock counters. Because both
counters only increase and srcu_readers_active_idx_check() reads the unlock
counter before the lock counter, this achieves the same end without having
to increment two different counters in srcu_read_lock(). This also saves a
smp_mb() in srcu_readers_active_idx_check().

Possible bug: There is no guarantee that the lock counter won't overflow
during srcu_readers_active_idx_check(), as there are no memory barriers
around srcu_flip() (see comment in srcu_readers_active_idx_check() for
details). However, this problem was already present before this patch.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Lance Roy <ldr709@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-01-25 12:53:20 -08:00
Daniel Borkmann
a67edbf4fb bpf: add initial bpf tracepoints
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.

For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.

Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:

  # ./perf record -a -e bpf:* sleep 10
  # ./perf script
  sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
  sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
  sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
  sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
  [...]
  sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
       swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER

  [1] https://lwn.net/Articles/705270/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Daniel Borkmann
2acae0d5b0 trace: add variant without spacing in trace_print_hex_seq
For upcoming tracepoint support for BPF, we want to dump the program's
tag. Format should be similar to __print_hex(), but without spacing.
Add a __print_hex_str() variant for exactly that purpose that reuses
trace_print_hex_seq().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Ingo Molnar
47cd95a632 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-25 15:52:46 +01:00
Linus Torvalds
883af14e67 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "26 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  MAINTAINERS: add Dan Streetman to zbud maintainers
  MAINTAINERS: add Dan Streetman to zswap maintainers
  mm: do not export ioremap_page_range symbol for external module
  mn10300: fix build error of missing fpu_save()
  romfs: use different way to generate fsid for BLOCK or MTD
  frv: add missing atomic64 operations
  mm, page_alloc: fix premature OOM when racing with cpuset mems update
  mm, page_alloc: move cpuset seqcount checking to slowpath
  mm, page_alloc: fix fast-path race with cpuset update or removal
  mm, page_alloc: fix check for NULL preferred_zone
  kernel/panic.c: add missing \n
  fbdev: color map copying bounds checking
  frv: add atomic64_add_unless()
  mm/mempolicy.c: do not put mempolicy before using its nodemask
  radix-tree: fix private list warnings
  Documentation/filesystems/proc.txt: add VmPin
  mm, memcg: do not retry precharge charges
  proc: add a schedule point in proc_pid_readdir()
  mm: alloc_contig: re-allow CMA to compact FS pages
  mm/slub.c: trace free objects at KERN_INFO
  ...
2017-01-24 16:54:39 -08:00
Jiri Slaby
ff7a28a074 kernel/panic.c: add missing \n
When a system panics, the "Rebooting in X seconds.." message is never
printed because it lacks a new line.  Fix it.

Link: http://lkml.kernel.org/r/20170119114751.2724-1-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Don Zickus
b94f51183b kernel/watchdog: prevent false hardlockup on overloaded system
On an overloaded system, it is possible that a change in the watchdog
threshold can be delayed long enough to trigger a false positive.

This can easily be achieved by having a cpu spinning indefinitely on a
task, while another cpu updates watchdog threshold.

What happens is while trying to park the watchdog threads, the hrtimers
on the other cpus trigger and reprogram themselves with the new slower
watchdog threshold.  Meanwhile, the nmi watchdog is still programmed
with the old faster threshold.

Because the one cpu is blocked, it prevents the thread parking on the
other cpus from completing, which is needed to shutdown the nmi watchdog
and reprogram it correctly.  As a result, a false positive from the nmi
watchdog is reported.

Fix this by setting a park_in_progress flag to block all lockups until
the parking is complete.

Fix provided by Ulrich Obergfell.

[akpm@linux-foundation.org: s/park_in_progress/watchdog_park_in_progress/]
Link: http://lkml.kernel.org/r/1481041033-192236-1-git-send-email-dzickus@redhat.com
Signed-off-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
Linus Torvalds
19ca2c8fec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fix from Eric Biederman:
 "This has a single brown bag fix.

  The possible deadlock with dec_pid_namespaces that I had thought was
  fixed earlier turned out only to have been moved. So instead of being
  cleaver this change takes ucounts_lock with irqs disabled. So
  dec_ucount can be used from any context without fear of deadlock.

  The items accounted for dec_ucount and inc_ucount are all
  comparatively heavy weight objects so I don't exepct this will have
  any measurable performance impact"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  userns: Make ucounts lock irq-safe
2017-01-24 12:21:51 -08:00
Daniel Borkmann
3fadc80115 bpf: enable verifier to better track const alu ops
William reported couple of issues in relation to direct packet
access. Typical scheme is to check for data + [off] <= data_end,
where [off] can be either immediate or coming from a tracked
register that contains an immediate, depending on the branch, we
can then access the data. However, in case of calculating [off]
for either the mentioned test itself or for access after the test
in a more "complex" way, then the verifier will stop tracking the
CONST_IMM marked register and will mark it as UNKNOWN_VALUE one.

Adding that UNKNOWN_VALUE typed register to a pkt() marked
register, the verifier then bails out in check_packet_ptr_add()
as it finds the registers imm value below 48. In the first below
example, that is due to evaluate_reg_imm_alu() not handling right
shifts and thus marking the register as UNKNOWN_VALUE via helper
__mark_reg_unknown_value() that resets imm to 0.

In the second case the same happens at the time when r4 is set
to r4 &= r5, where it transitions to UNKNOWN_VALUE from
evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside
evaluate_reg_alu(), where the register's imm turns into 3. That
is, for registers with type UNKNOWN_VALUE, imm of 0 means that
we don't know what value the register has, and for imm > 0 it
means that the value has [imm] upper zero bits. F.e. when shifting
an UNKNOWN_VALUE register by 3 to the right, no matter what value
it had, we know that the 3 upper most bits must be zero now.
This is to make sure that ALU operations with unknown registers
don't overflow. Meaning, once we know that we have more than 48
upper zero bits, or, in other words cannot go beyond 0xffff offset
with ALU ops, such an addition will track the target register
as a new pkt() register with a new id, but 0 offset and 0 range,
so for that a new data/data_end test will be required. Is the source
register a CONST_IMM one that is to be added to the pkt() register,
or the source instruction is an add instruction with immediate
value, then it will get added if it stays within max 0xffff bounds.
>From there, pkt() type, can be accessed should reg->off + imm be
within the access range of pkt().

  [...]
  from 28 to 30: R0=imm1,min_value=1,max_value=1
    R1=pkt(id=0,off=0,r=22) R2=pkt_end
    R3=imm144,min_value=144,max_value=144
    R4=imm0,min_value=0,max_value=0
    R5=inv48,min_value=2054,max_value=2054 R10=fp
  30: (bf) r5 = r3
  31: (07) r5 += 23
  32: (77) r5 >>= 3
  33: (bf) r6 = r1
  34: (0f) r6 += r5
  cannot add integer value with 0 upper zero bits to ptr_to_packet

  [...]
  from 52 to 80: R0=imm1,min_value=1,max_value=1
    R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
    R4=imm272 R5=inv56,min_value=17,max_value=17
    R6=pkt(id=0,off=26,r=34) R10=fp
  80: (07) r4 += 71
  81: (18) r5 = 0xfffffff8
  83: (5f) r4 &= r5
  84: (77) r4 >>= 3
  85: (0f) r1 += r4
  cannot add integer value with 3 upper zero bits to ptr_to_packet

Thus to get above use-cases working, evaluate_reg_imm_alu() has
been extended for further ALU ops. This is fine, because we only
operate strictly within realm of CONST_IMM types, so here we don't
care about overflows as they will happen in the simulated but also
real execution and interaction with pkt() in check_packet_ptr_add()
will check actual imm value once added to pkt(), but it's irrelevant
before.

With regards to 06c1c04972 ("bpf: allow helpers access to variable
memory") that works on UNKNOWN_VALUE registers, the verifier becomes
now a bit smarter as it can better resolve ALU ops, so we need to
adapt two test cases there, as min/max bound tracking only becomes
necessary when registers were spilled to stack. So while mask was
set before to track upper bound for UNKNOWN_VALUE case, it's now
resolved directly as CONST_IMM, and such contructs are only necessary
when f.e. registers are spilled.

For commit 6b17387307 ("bpf: recognize 64bit immediate loads as
consts") that initially enabled dw load tracking only for nfp jit/
analyzer, I did couple of tests on large, complex programs and we
don't increase complexity badly (my tests were in ~3% range on avg).
I've added a couple of tests similar to affected code above, and
it works fine with verifier now.

Reported-by: William Tu <u9012063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Gianluca Borello <g.borello@gmail.com>
Cc: William Tu <u9012063@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 14:46:06 -05:00
Daniel Borkmann
d140199af5 bpf, lpm: fix kfree of im_node in trie_update_elem
We need to initialize im_node to NULL, otherwise in case of error path
it gets passed to kfree() as uninitialized pointer.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 21:17:35 -05:00
Daniel Mack
b95a5c4db0 bpf: add a longest prefix match trie map implementation
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23 16:10:38 -05:00
Paul E. McKenney
38d30b336c rcu: Adjust FQS offline checks for exact online-CPU detection
Commit 7ec99de36f ("rcu: Provide exact CPU-online tracking for RCU"),
as its title suggests, got rid of RCU's remaining CPU-hotplug timing
guesswork.  This commit therefore removes the one-jiffy kludge that was
used to paper over this guesswork.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:44:18 -08:00
Paul E. McKenney
3a19b46a5c rcu: Check cond_resched_rcu_qs() state less often to reduce GP overhead
Commit 4a81e8328d ("rcu: Reduce overhead of cond_resched() checks
for RCU") moved quiescent-state generation out of cond_resched()
and commit bde6c3aa99 ("rcu: Provide cond_resched_rcu_qs() to force
quiescent states in long loops") introduced cond_resched_rcu_qs(), and
commit 5cd37193ce ("rcu: Make cond_resched_rcu_qs() apply to normal RCU
flavors") introduced the per-CPU rcu_qs_ctr variable, which is frequently
polled by the RCU core state machine.

This frequent polling can increase grace-period rate, which in turn
increases grace-period overhead, which is visible in some benchmarks
(for example, the "open1" benchmark in Anton Blanchard's "will it scale"
suite).  This commit therefore reduces the rate at which rcu_qs_ctr
is polled by moving that polling into the force-quiescent-state (FQS)
machinery, and by further polling it only after the grace period has
been in effect for at least jiffies_till_sched_qs jiffies.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:44:18 -08:00
Paul E. McKenney
02a5c550b2 rcu: Abstract extended quiescent state determination
This commit is the fourth step towards full abstraction of all accesses
to the ->dynticks counter, implementing previously open-coded checks and
comparisons in new rcu_dynticks_in_eqs() and rcu_dynticks_in_eqs_since()
functions.  This abstraction will ease changes to the ->dynticks counter
operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:44:18 -08:00
Paul E. McKenney
2625d469ba rcu: Abstract dynticks extended quiescent state enter/exit operations
This commit is the third step towards full abstraction of all accesses
to the ->dynticks counter, implementing the previously open-coded atomic
add of 1 and entry checks in a new rcu_dynticks_eqs_enter() function, and
the same but with exit checks in a new rcu_dynticks_eqs_exit() function.
This abstraction will ease changes to the ->dynticks counter operation.

Note that this commit gets rid of the smp_mb__before_atomic() and the
smp_mb__after_atomic() calls that were previously present.  The reason
that this is OK from a memory-ordering perspective is that the atomic
operation is now atomic_add_return(), which, as a value-returning atomic,
guarantees full ordering.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fixed RCU_TRACE() statements added by this commit. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:42:43 -08:00
Paul E. McKenney
8dc79888a7 rcu: Add lockdep checks to synchronous expedited primitives
The non-expedited synchronize_*rcu() primitives have lockdep checks, but
their expedited counterparts lack these checks.  This commit therefore
adds these checks to the expedited synchronize_*rcu() primitives.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:14 -08:00
Paul E. McKenney
bb4e2c08bb rcu: Eliminate unused expedited_normal counter
Expedited grace periods no longer fall back to normal grace periods
in response to lock contention, given that expedited grace periods
now use the rcu_node tree so as to avoid contention.  This commit
therfore removes the expedited_normal counter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:14 -08:00
Paul E. McKenney
9831ce3bb4 rcu: Fix comment in rcu_organize_nocb_kthreads()
It used to be that the rcuo callback-offload kthreads were spawned
in rcu_organize_nocb_kthreads(), and the comment before the "for"
loop says as much.  However, this spawning has long since moved to
the CPU-hotplug code, so this commit fixes this comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney
fdbb9b315c rcu: Make rcu_cpu_starting() use its "cpu" argument
The rcu_cpu_starting() function uses this_cpu_ptr() to locate the
incoming CPU's rcu_data structure.  This works for the boot CPU and for
all CPUs onlined after rcu_init() executes (during very early boot).
Currently, this is the full set of CPUs, so all is well.  But if
anyone ever parallelizes boot before rcu_init() time, it will fail.
This commit therefore substitutes the rcu_cpu_starting() function's
this_cpu_pointer() for per_cpu_ptr(), future-proofing the code and
(arguably) improving readability.

This commit inadvertently fixes a latent bug: If there ever had been
more than just the boot CPU online at rcu_init() time, the old code
would not initialize the non-boot CPUs, but rather would repeatedly
initialize the boot CPU.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney
09e2db37ec rcu: Add comment headers to expedited-grace-period counter functions
These functions (rcu_exp_gp_seq_start(), rcu_exp_gp_seq_end(),
rcu_exp_gp_seq_snap(), and rcu_exp_gp_seq_done() seemed too obvious
to comment when written, but not so much when being documented.
This commit therefore adds header comments to each of them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney
630c7ed9ca rcu: Don't wake rcuc/X kthreads on NOCB CPUs
Chris Friesen notice that rcuc/X kthreads were consuming CPU even on
NOCB CPUs.  This makes no sense because the only purpose or these
kthreads is to invoke normal (non-offloaded) callbacks, of which there
will never be any on NOCB CPUs.  This problem was due to a bug in
cpu_has_callbacks_ready_to_invoke(), which should have been checking
->nxttail[RCU_NEXT_TAIL] for NULL, but which was instead (incorrectly)
checking ->nxttail[RCU_DONE_TAIL].  Because ->nxttail[RCU_DONE_TAIL] is
never NULL, the only effect is to cause the rcuc/X kthread to execute
when it should not do so.

This commit therefore checks ->nxttail[RCU_NEXT_TAIL], which is NULL
for NOCB CPUs.

Reported-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney
7aa92230c9 rcu: Once again use NMI-based stack traces in stall warnings
This commit is for all intents and purposes a revert of bc1dce514e
("rcu: Don't use NMIs to dump other CPUs' stacks").  The reason to suppose
that this can now safely be reverted is the presence of 42a0bb3f71
("printk/nmi: generic solution for safe printk in NMI"), which is said
to have made NMI-based stack dumps safe.

However, this reversion keeps one nice property of bc1dce514e
("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that
only those CPUs blocking the grace period are dumped.  The new
trigger_single_cpu_backtrace() is used to make this happen, as
suggested by Josh Poimboeuf.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:12 -08:00
Paul E. McKenney
b201fa6737 rcu: Remove short-term CPU kicking
Commit 4914950aaa ("rcu: Stop treating in-kernel CPU-bound workloads
as errors") added a (relatively) short-timeout call to resched_cpu().
This was inspired by as issue that was fixed by b7e7ade34e ("sched/core:
Fix remote wakeups").  But given that this issue was fixed, it is time
for the current commit to remove this call to resched_cpu().

Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:12 -08:00
Paul E. McKenney
28053bc72c rcu: Add long-term CPU kicking
This commit prepares for the removal of short-term CPU kicking (in a
subsequent commit).  It does so by starting to invoke resched_cpu()
for each holdout at each force-quiescent-state interval that is more
than halfway through the stall-warning interval.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:33:02 -08:00
Tobias Klauser
94060d2235 rcu: Remove unused but set variable
Since commit 7ec99de36f ("rcu: Provide exact CPU-online tracking for
RCU"), the variable mask in rcu_init_percpu_data is set but no longer
used. Remove it to fix the following warning when building with 'W=1':

  kernel/rcu/tree.c: In function ‘rcu_init_percpu_data’:
  kernel/rcu/tree.c:3765:16: warning: variable ‘mask’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:32:35 -08:00
Paul E. McKenney
2535db485c rcu: Remove unneeded rcu_process_callbacks() declarations
The declarations of __rcu_process_callbacks() and rcu_process_callbacks()
are not needed, as the definition of both of these functions appear before
any uses.  This commit therefore removes both declarations.

Reported-by: "Ahmed, Iftekhar" <ahmedi@oregonstate.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:32:29 -08:00
Byungchul Park
c4402b27f1 rcu: Only dump stalled-tasks stacks if there was a real stall
The print_other_cpu_stall() function currently unconditionally invokes
rcu_print_detail_task_stall().  This is OK because if there was a stall
sufficient to cause print_other_cpu_stall() to be invoked, that stall
is very likely to persist through the entire print_other_cpu_stall()
execution.  However, if the stall did not persist, the variable ndetected
will be zero, and that variable is already tested in an "if" statement.
Therefore, this commit moves the call to rcu_print_detail_task_stall()
under that pre-existing "if" to improve readability, with a very rare
reduction in overhead.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ paulmck: Reworked commit log. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:32:22 -08:00
Mathieu Desnoyers
907565337e Fix: Disable sys_membarrier when nohz_full is enabled
Userspace applications should be allowed to expect the membarrier system
call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
nohz_full CPUs, but synchronize_sched() does not take those into
account.

Given that we do not want unrelated processes to be able to affect
real-time sensitive nohz_full CPUs, simply return ENOSYS when membarrier
is invoked on a kernel with enabled nohz_full CPUs.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: <stable@vger.kernel.org>	[3.10+]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:32:16 -08:00
Paul E. McKenney
4d4f88fa23 lockdep: Make RCU suspicious-access splats use pr_err
This commit switches RCU suspicious-access splats use pr_err()
instead of the current INFO printk()s.  This change makes it easier
to automatically classify splats.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-01-23 11:31:54 -08:00
Nikolay Borisov
880a38547f userns: Make ucounts lock irq-safe
The ucounts_lock is being used to protect various ucounts lifecycle
management functionalities. However, those services can also be invoked
when a pidns is being freed in an RCU callback (e.g. softirq context).
This can lead to deadlocks. There were already efforts trying to
prevent similar deadlocks in add7c65ca4 ("pid: fix lockdep deadlock
warning due to ucount_lock"), however they just moved the context
from hardirq to softrq. Fix this issue once and for all by explictly
making the lock disable irqs altogether.

Dmitry Vyukov <dvyukov@google.com> reported:

> I've got the following deadlock report while running syzkaller fuzzer
> on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
> device if it matters):
>
> =================================
> [ INFO: inconsistent lock state ]
> 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
> ---------------------------------
> inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
>  (ucounts_lock){+.?...}, at: [<     inline     >] spin_lock
> ./include/linux/spinlock.h:302
>  (ucounts_lock){+.?...}, at: [<ffff2000081678c8>]
> put_ucounts+0x60/0x138 kernel/ucount.c:162
> {SOFTIRQ-ON-W} state was registered at:
> [<ffff2000081c82d8>] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
> [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2941
> [<ffff2000081c97a8>] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
> [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
> [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
> [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
> [<     inline     >] spin_lock ./include/linux/spinlock.h:302
> [<     inline     >] get_ucounts kernel/ucount.c:131
> [<ffff200008167c28>] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
> [<     inline     >] inc_mnt_namespaces fs/namespace.c:2818
> [<ffff200008481850>] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
> [<ffff200008487298>] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
> [<     inline     >] init_mount_tree fs/namespace.c:3199
> [<ffff200009bd6674>] mnt_init+0x258/0x384 fs/namespace.c:3251
> [<ffff200009bd60bc>] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
> [<ffff200009bb1114>] start_kernel+0x414/0x460 init/main.c:648
> [<ffff200009bb01e8>] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
> irq event stamp: 2316924
> hardirqs last  enabled at (2316924): [<     inline     >] rcu_do_batch
> kernel/rcu/tree.c:2911
> hardirqs last  enabled at (2316924): [<     inline     >]
> invoke_rcu_callbacks kernel/rcu/tree.c:3182
> hardirqs last  enabled at (2316924): [<     inline     >]
> __rcu_process_callbacks kernel/rcu/tree.c:3149
> hardirqs last  enabled at (2316924): [<ffff200008210414>]
> rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
> hardirqs last disabled at (2316923): [<     inline     >] rcu_do_batch
> kernel/rcu/tree.c:2900
> hardirqs last disabled at (2316923): [<     inline     >]
> invoke_rcu_callbacks kernel/rcu/tree.c:3182
> hardirqs last disabled at (2316923): [<     inline     >]
> __rcu_process_callbacks kernel/rcu/tree.c:3149
> hardirqs last disabled at (2316923): [<ffff20000820fe80>]
> rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
> softirqs last  enabled at (2316912): [<ffff20000811b4c4>]
> _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
> softirqs last disabled at (2316913): [<     inline     >]
> do_softirq_own_stack ./include/linux/interrupt.h:488
> softirqs last disabled at (2316913): [<     inline     >]
> invoke_softirq kernel/softirq.c:371
> softirqs last disabled at (2316913): [<ffff20000811c994>]
> irq_exit+0x264/0x308 kernel/softirq.c:405
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(ucounts_lock);
>   <Interrupt>
>     lock(ucounts_lock);
>
>  *** DEADLOCK ***
>
> 1 lock held by swapper/2/0:
>  #0:  (rcu_callback){......}, at: [<     inline     >] __rcu_reclaim
> kernel/rcu/rcu.h:108
>  #0:  (rcu_callback){......}, at: [<     inline     >] rcu_do_batch
> kernel/rcu/tree.c:2919
>  #0:  (rcu_callback){......}, at: [<     inline     >]
> invoke_rcu_callbacks kernel/rcu/tree.c:3182
>  #0:  (rcu_callback){......}, at: [<     inline     >]
> __rcu_process_callbacks kernel/rcu/tree.c:3149
>  #0:  (rcu_callback){......}, at: [<ffff200008210390>]
> rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
>
> stack backtrace:
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
> Hardware name: Hardkernel ODROID-C2 (DT)
> Call trace:
> [<ffff20000808fa60>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
> [<ffff20000808fec0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
> [<ffff2000088a99e0>] dump_stack+0x110/0x168
> [<ffff2000082fa2b4>] print_usage_bug.part.27+0x49c/0x4bc
> kernel/locking/lockdep.c:2387
> [<     inline     >] print_usage_bug kernel/locking/lockdep.c:2357
> [<     inline     >] valid_state kernel/locking/lockdep.c:2400
> [<     inline     >] mark_lock_irq kernel/locking/lockdep.c:2617
> [<ffff2000081c89ec>] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
> [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2923
> [<ffff2000081c9a60>] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
> [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
> [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
> [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
> [<     inline     >] spin_lock ./include/linux/spinlock.h:302
> [<ffff2000081678c8>] put_ucounts+0x60/0x138 kernel/ucount.c:162
> [<ffff200008168364>] dec_ucount+0xf4/0x158 kernel/ucount.c:214
> [<     inline     >] dec_pid_namespaces kernel/pid_namespace.c:89
> [<ffff200008293dc8>] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
> [<     inline     >] __rcu_reclaim kernel/rcu/rcu.h:118
> [<     inline     >] rcu_do_batch kernel/rcu/tree.c:2919
> [<     inline     >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
> [<     inline     >] __rcu_process_callbacks kernel/rcu/tree.c:3149
> [<ffff2000082103d8>] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
> [<ffff2000080821dc>] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
> [<     inline     >] do_softirq_own_stack ./include/linux/interrupt.h:488
> [<     inline     >] invoke_softirq kernel/softirq.c:371
> [<ffff20000811c994>] irq_exit+0x264/0x308 kernel/softirq.c:405
> [<ffff2000081ecc28>] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
> [<ffff200008081c80>] gic_handle_irq+0x68/0xd8
> Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
> 7dc0:                                   ffff8000648d4b3c 0000000000000007
> 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
> 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
> 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
> 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
> 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
> 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
> 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
> 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
> 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
> [<ffff2000080837f8>] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
> [<     inline     >] arch_local_irq_restore
> ./arch/arm64/include/asm/irqflags.h:81
> [<ffff20000820a47c>] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
> [<     inline     >] cpuidle_idle_call kernel/sched/idle.c:200
> [<ffff2000081bcbfc>] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
> [<ffff2000081bd1cc>] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
> [<ffff200008099f8c>] secondary_start_kernel+0x2cc/0x358
> arch/arm64/kernel/smp.c:276
> [<000000000279f1a4>] 0x279f1a4

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: add7c65ca4 ("pid: fix lockdep deadlock warning due to ucount_lock")
Fixes: f333c700c6 ("pidns: Add a limit on the number of pid namespaces")
Cc: stable@vger.kernel.org
Link: https://www.spinics.net/lists/kernel/msg2426637.html
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-24 06:23:51 +13:00
Eric Auger
c7b41f0af3 irqdomain: irq_domain_check_msi_remap
This new function checks whether all MSI irq domains
implement IRQ remapping. This is useful to understand
whether VFIO passthrough is safe with respect to interrupts.

On ARM typically an MSI controller can sit downstream
to the IOMMU without preventing VFIO passthrough.
As such any assigned device can write into the MSI doorbell.
In case the MSI controller implements IRQ remapping, assigned
devices will not be able to trigger interrupts towards the
host. On the contrary, the assignment must be emphasized as
unsafe with respect to interrupts.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23 15:00:45 +00:00
Eric Auger
88156f0090 genirq/msi: Set IRQ_DOMAIN_FLAG_MSI on MSI domain creation
Now we have a flag value indicating an IRQ domain implements MSI,
let's set it on msi_create_irq_domain().

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23 15:00:45 +00:00
Eric Auger
631a9639ac irqdomain: Add irq domain MSI and MSI_REMAP flags
We introduce two new enum values for the irq domain flag:
- IRQ_DOMAIN_FLAG_MSI indicates the irq domain corresponds to
  an MSI domain
- IRQ_DOMAIN_FLAG_MSI_REMAP indicates the irq domain has MSI
  remapping capabilities.

Those values will be useful to check all MSI irq domains have
MSI remapping support when assessing the safety of IRQ assignment
to a guest.

irq_domain_hierarchical_is_msi_remap() allows to check if an
irq domain or any parent implements MSI remapping.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23 15:00:44 +00:00
Mike Frysinger
b25e67161c seccomp: dump core when using SECCOMP_RET_KILL
The SECCOMP_RET_KILL mode is documented as immediately killing the
process as if a SIGSYS had been sent and not caught (similar to a
SIGKILL).  However, a SIGSYS is documented as triggering a coredump
which does not happen today.

This has the advantage of being able to more easily debug a process
that fails a seccomp filter.  Today, most apps need to recompile and
change their filter in order to get detailed info out, or manually run
things through strace, or enable detailed kernel auditing.  Now we get
coredumps that fit into existing system-wide crash reporting setups.

From a security pov, this shouldn't be a problem.  Unhandled signals
can already be sent externally which trigger a coredump independent of
the status of the seccomp filter.  The act of dumping core itself does
not cause change in execution of the program.

URL: https://crbug.com/676357
Signed-off-by: Mike Frysinger <vapier@chromium.org>
Acked-by: Jorge Lucangeli Obes <jorgelo@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-01-23 21:42:42 +11:00
Linus Torvalds
24b86839fa Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp/hotplug fix from Thomas Gleixner:
 "Remove an unused variable which is a leftover from the notifier
  removal"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Remove unused but set variable in _cpu_down()
2017-01-22 12:45:47 -08:00
Waiman Long
bcc9a76d5a locking/rwsem: Reinit wake_q after use
In __rwsem_down_write_failed_common(), the same wake_q variable name
is defined twice, with the inner wake_q hiding the one in outer scope.
We can either use different names for the two wake_q's.

Even better, we can use the same wake_q twice, if necessary.

To enable the latter change, we need to define a new helper function
wake_q_init() to enable reinitalization of wake_q after use.

Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485052415-9611-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-22 09:54:00 +01:00
Gianluca Borello
a5e8c07059 bpf: add bpf_probe_read_str helper
Provide a simple helper with the same semantics of strncpy_from_unsafe():

int bpf_probe_read_str(void *dst, int size, const void *unsafe_addr)

This gives more flexibility to a bpf program. A typical use case is
intercepting a file name during sys_open(). The current approach is:

SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
	char buf[PATHLEN]; // PATHLEN is defined to 256
	bpf_probe_read(buf, sizeof(buf), ctx->di);

	/* consume buf */
}

This is suboptimal because the size of the string needs to be estimated
at compile time, causing more memory to be copied than often necessary,
and can become more problematic if further processing on buf is done,
for example by pushing it to userspace via bpf_perf_event_output(),
since the real length of the string is unknown and the entire buffer
must be copied (and defining an unrolled strnlen() inside the bpf
program is a very inefficient and unfeasible approach).

With the new helper, the code can easily operate on the actual string
length rather than the buffer size:

SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
	char buf[PATHLEN]; // PATHLEN is defined to 256
	int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di);

	/* consume buf, for example push it to userspace via
	 * bpf_perf_event_output(), but this time we can use
	 * res (the string length) as event size, after checking
	 * its boundaries.
	 */
}

Another useful use case is when parsing individual process arguments or
individual environment variables navigating current->mm->arg_start and
current->mm->env_start: using this helper and the return value, one can
quickly iterate at the right offset of the memory area.

The code changes simply leverage the already existent
strncpy_from_unsafe() kernel function, which is safe to be called from a
bpf program as it is used in bpf_trace_printk().

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:08:43 -05:00
Rafael J. Wysocki
e326ce013a Revert "PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag"
Revert commit 08b98d3291 (PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0
flag) as it caused system suspend (in the default configuration) to fail
on Dell XPS13 (9360) with the Kaby Lake processor.

Fixes: 08b98d3291 (PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag)
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-01-20 03:33:57 +01:00
Peter Zijlstra
acb04058de sched/clock: Fix hotplug crash
Mike reported that he could trigger the WARN_ON_ONCE() in
set_sched_clock_stable() using hotplug.

This exposed a fundamental problem with the interface, we should never
mark the TSC stable if we ever find it to be unstable. Therefore
set_sched_clock_stable() is a broken interface.

The reason it existed is that not having it is a pain, it means all
relevant architecture code needs to call clear_sched_clock_stable()
where appropriate.

Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64
and parisc are trivial in that they never called
set_sched_clock_stable(), so add an unconditional call to
clear_sched_clock_stable() to them.

For x86 the story is a lot more involved, and what this patch tries to
do is ensure we preserve the status quo. So even is Cyrix or Transmeta
have usable TSC they never called set_sched_clock_stable() so they now
get an explicit mark unstable.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9881b024b7 ("sched/clock: Delay switching sched_clock to stable")
Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-20 02:38:46 +01:00
Greg Kroah-Hartman
64e90a8acb Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()
Some usermode helper applications are defined at kernel build time, while
others can be changed at runtime.  To provide a sane way to filter these, add a
new kernel option "STATIC_USERMODEHELPER".  This option routes all
call_usermodehelper() calls through this binary, no matter what the caller
wishes to have called.

The new binary (by default set to /sbin/usermode-helper, but can be changed
through the STATIC_USERMODEHELPER_PATH option) can properly filter the
requested programs to be run by the kernel by looking at the first argument
that is passed to it.  All other options should then be passed onto the proper
program if so desired.

To disable all call_usermodehelper() calls by the kernel, set
STATIC_USERMODEHELPER_PATH to an empty string.

Thanks to Neil Brown for the idea of this feature.

Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 12:59:45 +01:00
Greg Kroah-Hartman
6d2c5d6c46 kmod: make usermodehelper path a const string
This is in preparation for making it so that usermode helper programs
can't be changed, if desired, by userspace.  We will tackle the mess of
cleaning up the write-ability of argv and env later, that's going to
take more work, for much less gain...

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 12:45:33 +01:00
Daniel Borkmann
d407bd25a2 bpf: don't trigger OOM killer under pressure with map alloc
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.

Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.

Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.

Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 17:12:26 -05:00
Richard Guy Briggs
92c82e8a32 audit: add feature audit_lost reset
Add a method to reset the audit_lost value.

An AUDIT_SET message with the AUDIT_STATUS_LOST flag set by itself
will return a positive value repesenting the current audit_lost value
and reset the counter to zero.  If AUDIT_STATUS_LOST is not the
only flag set, the reset command will be ignored.  The value sent with
the command is ignored.  The return value will be the +ve lost value at
reset time.

An AUDIT_CONFIG_CHANGE message will be queued to the listening audit
daemon.  The message will be a standard CONFIG_CHANGE message with the
fields "lost=0" and "old=" with the latter containing the value of
audit_lost at reset time.

See: https://github.com/linux-audit/audit-kernel/issues/3

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-01-18 14:32:52 -05:00
Linus Torvalds
ca92e6c7e6 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug update from Thomas Gleixner:
 "This contains a trivial typo fix and an extension to the core code for
  dynamically allocating states in the prepare stage.

  The extension is necessary right now because we need a proper way to
  unbreak LTTNG, which iscurrently non functional due to the removal of
  the notifiers. Surely it's out of tree, but it's widely used by
  distros.

  The simple solution would have been to reserve a state for LTTNG, but
  I'm not fond about unused crap in the kernel and the dynamic range,
  which we admittedly should have done right away, allows us to remove
  quite some of the hardcoded states, i.e. those which have no ordering
  requirements. So doing the right thing now is better than having an
  smaller intermediate solution which needs to be reworked anyway"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Provide dynamic range for prepare stage
  perf/x86/amd/ibs: Fix typo after cleanup state names in cpu/hotplug
2017-01-18 11:13:41 -08:00
Linus Torvalds
49b550fee8 Merge branch 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fixes from Ingo Molnar:
 "This fixes sporadic ACPI related hangs in synchronize_rcu() that were
  caused by the ACPI code mistakenly relying on an aspect of RCU that
  was neither promised to work nor reliable but which happened to work -
  until in v4.9 we changed the RCU implementation, which made the hangs
  more prominent.

  Since the mis-use of the RCU facility wasn't properly detected and
  prevented either, these fixes make the RCU side work reliably instead
  of working around the problem in the ACPI code.

  Hence the slightly larger diffstat that goes beyond the normal scope
  of RCU fixes in -rc kernels"

* 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Narrow early boot window of illegal synchronous grace periods
  rcu: Remove cond_resched() from Tiny synchronize_sched()
2017-01-18 10:47:11 -08:00
Tobias Klauser
0fec9557fd cpu/hotplug: Remove unused but set variable in _cpu_down()
After the recent removal of the hotplug notifiers the variable 'hasdied' in
_cpu_down() is set but no longer read, leading to the following GCC warning
when building with 'make W=1':

  kernel/cpu.c:767:7: warning: variable ‘hasdied’ set but not used [-Wunused-but-set-variable]

Fix it by removing the variable.

Fixes: 530e9b76ae ("cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20170117143501.20893-1-tklauser@distanz.ch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-18 11:55:09 +01:00
Linus Torvalds
0aa0313f9d Modules fixes for v4.10-rc5
- Fix out-of-tree module breakage when it supplies its own
   definitions of true and false
 
 Signed-off-by: Jessica Yu <jeyu@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYfm7sAAoJEMBFfjjOO8FyrmMP/j1Sa179+uBWWPE0Td7ip6yj
 EgvOtZGcnZfMuHbs5Evn8Fnz5K3of3IriiJNPePuQPu/YnoidltxkWOMXYkCj+Fn
 acW8VRtrh2urec70gRapuTmSpxs1I/XLUdNG+Ozm0FFX+L+k0ydCqEPGuVkwyHNK
 Wn31lVTiqx+zWm5PAJBzD6dEchQ0h2uppHRmZ+mIn3GyvYavIGnMMkdjqEEq9v8w
 UYdw52AJFAGMDO8LoSihX5cFbe0E28A58jkJuJ5AKXglaY6Nvl2xWOxfLhFnxO1m
 7KFuf+q2YO10hoJtdItEmPw2iC8pIgoAUGpZ+4h0iSWxyUC5V4QEmrhe4q9CtOLD
 +dfcd+44UekvWiWL4AQUO6IsUzIo8UqsJYf4Tic4/EjAKZtGTseKjQqCgBv3kJA+
 nN3hJ9gMN4NZWOMLihSn7Ml/whrxchdqlEP520nzGTnWUaLOUPp4XhfNlDaAH58K
 WfxiT0L6w+Cbg3xMCZRxQyqlJWWw8x1CM7B6eScHvN67TulC2enIQYTkv6eDOzQX
 DPz4lvcFisjASFP+i+3ouYL2pfLnm/IUG9K1wieqBvPHEdeZBuGr7+VEHjFmhhmG
 f3kKvYsRgUQF8tKeGtI0uPxnZNw4z4QaYOVKf8bISzpsIHOezdaouOm+KGUctHqO
 DNIWMf34W7fE5AVrGKp6
 =ybIc
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules fix from Jessica Yu:

 - fix out-of-tree module breakage when it supplies its own definitions
   of true and false

* tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  taint/module: Fix problems when out-of-kernel driver defines true or false
2017-01-17 14:49:21 -08:00
David S. Miller
580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Larry Finger
5eb7c0d04f taint/module: Fix problems when out-of-kernel driver defines true or false
Commit 7fd8329ba5 ("taint/module: Clean up global and module taint
flags handling") used the key words true and false as character members
of a new struct. These names cause problems when out-of-kernel modules
such as VirtualBox include their own definitions of true and false.

Fixes: 7fd8329ba5 ("taint/module: Clean up global and module taint flags handling")
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-01-17 10:56:45 -08:00
Linus Torvalds
4b19a9e20b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle multicast packets properly in fast-RX path of mac80211, from
    Johannes Berg.

 2) Because of a logic bug, the user can't actually force SW
    checksumming on r8152 devices. This makes diagnosis of hw
    checksumming bugs really annoying. Fix from Hayes Wang.

 3) VXLAN route lookup does not take the source and destination ports
    into account, which means IPSEC policies cannot be matched properly.
    Fix from Martynas Pumputis.

 4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.

 5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.

 6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
    construction properly in fib_dump_info(), from David Ahern.

 7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
    Schmidt.

 8) Openvswitch conntack actions need to maintain a correct checksum,
    fix from Lance Richardson.

 9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
    fact it already checks this, but not in all of the necessary spots.
    Fix from Basil Gunn.

10) Action GET operations in the packet scheduler can erroneously bump
    the reference count of the entry, making it unreleasable. Fix from
    Jamal Hadi Salim. Jamal gives a great set of example command lines
    that trigger this in the commit message.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
  net sched actions: fix refcnt when GETing of action after bind
  net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
  net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
  net/mlx4_core: Fix racy CQ (Completion Queue) free
  net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
  net/mlx5e: Fix a -Wmaybe-uninitialized warning
  ax25: Fix segfault after sock connection timeout
  bpf: rework prog_digest into prog_tag
  tipc: allocate user memory with GFP_KERNEL flag
  net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
  ip6_tunnel: Account for tunnel header in tunnel MTU
  mld: do not remove mld souce list info when set link down
  be2net: fix MAC addr setting on privileged BE3 VFs
  be2net: don't delete MAC on close on unprivileged BE3 VFs
  be2net: fix status check in be_cmd_pmac_add()
  cpmac: remove hopeless #warning
  ravb: do not use zero-length alignment DMA descriptor
  mlx4: do not call napi_schedule() without care
  openvswitch: maintain correct checksum state in conntrack actions
  tcp: fix tcp_fastopen unaligned access complaints on sparc
  ...
2017-01-17 09:33:10 -08:00
Sebastian Andrzej Siewior
7c6094db59 rcu: update: Make RCU_EXPEDITE_BOOT be the default
RCU_EXPEDITE_BOOT should speed up the boot process by enforcing
synchronize_rcu_expedited() instead of synchronize_rcu() during the boot
process. There should be no reason why one does not want this and there
is no need worry about real time latency at this point.
Therefore make it default.

Note that users wishing to avoid expediting entirely, for example when
bringing up new hardware possibly having flaky IPIs, can use the
rcu_normal boot parameter to override boot-time expediting.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Reworded commit log. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-16 16:56:39 -08:00
Paul E. McKenney
8b2f63ab05 rcu: Abstract the dynticks snapshot operation
This commit is the second step towards full abstraction of all accesses to
the ->dynticks counter, implementing the previously open-coded atomic
add of zero in a new rcu_dynticks_snap() function.  This abstraction will
ease changes o the ->dynticks counter operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-16 15:47:53 -08:00
Paul E. McKenney
6563de9d6f rcu: Abstract the dynticks momentary-idle operation
This commit is the first step towards full abstraction of all accesses to
the ->dynticks counter, implementing the previously open-coded atomic add
of two in a new rcu_dynticks_momentary_idle() function.  This abstraction
will ease changes to the ->dynticks counter operation.

Note that this commit gets rid of the smp_mb__before_atomic() and the
smp_mb__after_atomic() calls that were previously present.  The reason
that this is OK from a memory-ordering perspective is that the atomic
operation is now atomic_add_return(), which, as a value-returning atomic,
guarantees full ordering.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-16 14:59:54 -08:00
Daniel Borkmann
2d071c643f bpf, trace: make ctx access checks more robust
Make sure that ctx cannot potentially be accessed oob by asserting
explicitly that ctx access size into pt_regs for BPF_PROG_TYPE_KPROBE
programs must be within limits. In case some 32bit archs have pt_regs
not being a multiple of 8, then BPF_DW access could cause such access.

BPF_PROG_TYPE_KPROBE progs don't have a ctx conversion function since
there's no extra mapping needed. kprobe_prog_is_valid_access() didn't
enforce sizeof(long) as the only allowed access size, since LLVM can
generate non BPF_W/BPF_DW access to regs from time to time.

For BPF_PROG_TYPE_TRACEPOINT we don't have a ctx conversion either, so
add a BUILD_BUG_ON() check to make sure that BPF_DW access will not be
a similar issue in future (ctx works on event buffer as opposed to
pt_regs there).

Fixes: 2541517c32 ("tracing, perf: Implement BPF programs attached to kprobes")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 14:41:42 -05:00
Daniel Borkmann
f1f7714ea5 bpf: rework prog_digest into prog_tag
Commit 7bd509e311 ("bpf: add prog_digest and expose it via
fdinfo/netlink") was recently discussed, partially due to
admittedly suboptimal name of "prog_digest" in combination
with sha1 hash usage, thus inevitably and rightfully concerns
about its security in terms of collision resistance were
raised with regards to use-cases.

The intended use cases are for debugging resp. introspection
only for providing a stable "tag" over the instruction sequence
that both kernel and user space can calculate independently.
It's not usable at all for making a security relevant decision.
So collisions where two different instruction sequences generate
the same tag can happen, but ideally at a rather low rate. The
"tag" will be dumped in hex and is short enough to introspect
in tracepoints or kallsyms output along with other data such
as stack trace, etc. Thus, this patch performs a rename into
prog_tag and truncates the tag to a short output (64 bits) to
make it obvious it's not collision-free.

Should in future a hash or facility be needed with a security
relevant focus, then we can think about requirements, constraints,
etc that would fit to that situation. For now, rework the exposed
parts for the current use cases as long as nothing has been
released yet. Tested on x86_64 and s390x.

Fixes: 7bd509e311 ("bpf: add prog_digest and expose it via fdinfo/netlink")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 14:03:31 -05:00
Thomas Gleixner
4205e4786d cpu/hotplug: Provide dynamic range for prepare stage
Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to
the removal of the cpu hotplug notifiers.

Usually I don't care much about out of tree modules, but LTTNG is widely
used in distros. There are two ways to solve that:

1) Reserve a hotplug state for LTTNG

2) Add a dynamic range for the prepare states.

While #1 is the simplest solution, #2 is the proper one as we can convert
in tree users, which do not care about ordering, to the dynamic range as
well.

Add a dynamic range which allows LTTNG to request states in the prepare
stage.

Reported-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-16 13:20:05 +01:00
Ingo Molnar
3e4f7a4956 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into rcu/urgent
Pull an urgent RCU fix from Paul E. McKenney:

 "This series contains a pair of commits that permit RCU synchronous grace
  periods (synchronize_rcu() and friends) to work correctly throughout boot.
  This eliminates the current "dead time" starting when the scheduler spawns
  its first taks and ending when the last of RCU's kthreads is spawned
  (this last happens during early_initcall() time).  Although RCU's
  synchronous grace periods have long been documented as not working
  during this time, prior to 4.9, the expedited grace periods worked by
  accident, and some ACPI code came to rely on this unintentional behavior.
  (Note that this unintentional behavior was -not- reliable.  For example,
  failures from ACPI could occur on !SMP systems and on systems booting
  with the rcu_normal kernel boot parameter.)

  Either way, there is a bug that needs fixing, and the 4.9 switch of RCU's
  expedited grace periods to workqueues could be considered to have caused
  a regression.  This series therefore makes RCU's expedited grace periods
  operate correctly throughout the boot process.  This has been demonstrated
  to fix the problems ACPI was encountering, and has the added longer-term
  benefit of simplifying RCU's behavior."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-16 07:45:44 +01:00
Linus Torvalds
99421c1cb2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
 "This tree contains 4 fixes.

  The first is a fix for a race that can causes oopses under the right
  circumstances, and that someone just recently encountered.

  Past that are several small trivial correct fixes. A real issue that
  was blocking development of an out of tree driver, but does not appear
  to have caused any actual problems for in-tree code. A potential
  deadlock that was reported by lockdep. And a deadlock people have
  experienced and took the time to track down caused by a cleanup that
  removed the code to drop a reference count"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sysctl: Drop reference added by grab_header in proc_sys_readdir
  pid: fix lockdep deadlock warning due to ucount_lock
  libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
  mnt: Protect the mountpoint hashtable with mount_lock
2017-01-15 16:09:50 -08:00
Linus Torvalds
a11ce3a4ac Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ fix from Ingo Molnar:
 "This fixes an old NOHZ race where we incorrectly calculate the next
  timer interrupt in certain circumstances where hrtimers are pending,
  that can cause hard to reproduce stalled-values artifacts in
  /proc/stat"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  nohz: Fix collision between tick and other hrtimers
2017-01-15 12:00:37 -08:00
Linus Torvalds
79078c53ba Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
  driver fixes, plus miscellanous tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Reject non sampling events with precise_ip
  perf/x86/intel: Account interrupts for PEBS errors
  perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
  perf/core: Fix sys_perf_event_open() vs. hotplug
  perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
  perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
  perf/x86: Set pmu->module in Intel PMU modules
  perf probe: Fix to probe on gcc generated symbols for offline kernel
  perf probe: Fix --funcs to show correct symbols for offline module
  perf symbols: Robustify reading of build-id from sysfs
  perf tools: Install tools/lib/traceevent plugins with install-bin
  tools lib traceevent: Fix prev/next_prio for deadline tasks
  perf record: Fix --switch-output documentation and comment
  perf record: Make __record_options static
  tools lib subcmd: Add OPT_STRING_OPTARG_SET option
  perf probe: Fix to get correct modname from elf header
  samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
  samples/bpf sock_example: Avoid getting ethhdr from two includes
  perf sched timehist: Show total scheduling time
2017-01-15 11:37:43 -08:00
Yang Shi
f4dbba5919 locktorture: Fix potential memory leak with rw lock test
When running locktorture module with the below commands with kmemleak enabled:

$ modprobe locktorture torture_type=rw_lock_irq
$ rmmod locktorture

The below kmemleak got caught:

root@10:~# echo scan > /sys/kernel/debug/kmemleak
[  323.197029] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
root@10:~# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffffffc07592d500 (size 128):
  comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 c3 7b 02 00 00 00 00 00  .........{......
    00 00 00 00 00 00 00 00 d7 9b 02 00 00 00 00 00  ................
  backtrace:
    [<ffffff80081e5a88>] create_object+0x110/0x288
    [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0
    [<ffffff80081d5acc>] __kmalloc+0x234/0x318
    [<ffffff80006fa130>] 0xffffff80006fa130
    [<ffffff8008083ae4>] do_one_initcall+0x44/0x138
    [<ffffff800817e28c>] do_init_module+0x68/0x1cc
    [<ffffff800811c848>] load_module+0x1a68/0x22e0
    [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0
    [<ffffff80080836f0>] el0_svc_naked+0x24/0x28
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffffffc07592d480 (size 128):
  comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 3b 6f 01 00 00 00 00 00  ........;o......
    00 00 00 00 00 00 00 00 23 6a 01 00 00 00 00 00  ........#j......
  backtrace:
    [<ffffff80081e5a88>] create_object+0x110/0x288
    [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0
    [<ffffff80081d5acc>] __kmalloc+0x234/0x318
    [<ffffff80006fa22c>] 0xffffff80006fa22c
    [<ffffff8008083ae4>] do_one_initcall+0x44/0x138
    [<ffffff800817e28c>] do_init_module+0x68/0x1cc
    [<ffffff800811c848>] load_module+0x1a68/0x22e0
    [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0
    [<ffffff80080836f0>] el0_svc_naked+0x24/0x28
    [<ffffffffffffffff>] 0xffffffffffffffff

It is because cxt.lwsa and cxt.lrsa don't get freed in module_exit, so free
them in lock_torture_cleanup() and free writer_tasks if reader_tasks is
failed at memory allocation.

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-14 21:36:06 -08:00
Paul E. McKenney
52d7e48b86 rcu: Narrow early boot window of illegal synchronous grace periods
The current preemptible RCU implementation goes through three phases
during bootup.  In the first phase, there is only one CPU that is running
with preemption disabled, so that a no-op is a synchronous grace period.
In the second mid-boot phase, the scheduler is running, but RCU has
not yet gotten its kthreads spawned (and, for expedited grace periods,
workqueues are not yet running.  During this time, any attempt to do
a synchronous grace period will hang the system (or complain bitterly,
depending).  In the third and final phase, RCU is fully operational and
everything works normally.

This has been OK for some time, but there has recently been some
synchronous grace periods showing up during the second mid-boot phase.
This code worked "by accident" for awhile, but started failing as soon
as expedited RCU grace periods switched over to workqueues in commit
8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue").
Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter).  The callchain from the failure case is as follows:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited

The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.

This commit therefore reworks RCU to permit synchronous grace periods
to proceed during this mid-boot phase.  This commit is therefore a
fix to a regression introduced in v4.9, and is therefore being put
forward post-merge-window in v4.10.

This commit sets a flag from the existing rcu_scheduler_starting()
function which causes all synchronous grace periods to take the expedited
path.  The expedited path now checks this flag, using the requesting task
to drive the expedited grace period forward during the mid-boot phase.
Finally, this flag is updated by a core_initcall() function named
rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

Note that this arrangement assumes that tasks are not sent POSIX signals
(or anything similar) from the time that the first task is spawned
through core_initcall() time.

Fixes: 8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue")
Reported-by: "Zheng, Lv" <lv.zheng@intel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Stan Kain <stan.kain@gmail.com>
Tested-by: Ivan <waffolz@hotmail.com>
Tested-by: Emanuel Castelo <emanuel.castelo@gmail.com>
Tested-by: Bruno Pesavento <bpesavento@infinito.it>
Tested-by: Borislav Petkov <bp@suse.de>
Tested-by: Frederic Bezies <fredbezies@gmail.com>
Cc: <stable@vger.kernel.org> # 4.9.0-
2017-01-14 21:23:48 -08:00
Paul E. McKenney
f466ae66fa rcu: Remove cond_resched() from Tiny synchronize_sched()
It is now legal to invoke synchronize_sched() at early boot, which causes
Tiny RCU's synchronize_sched() to emit spurious splats.  This commit
therefore removes the cond_resched() from Tiny RCU's synchronize_sched().

Fixes: 8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue")
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.9.0-
2017-01-14 21:22:20 -08:00
Peter Zijlstra
1e24edca05 locking/atomic, kref: Add KREF_INIT()
Since we need to change the implementation, stop exposing internals.

Provide KREF_INIT() to allow static initialization of struct kref.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:18 +01:00
Chris Wilson
2a0c112828 locking/ww_mutex: Add kselftests for ww_mutex stress
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-8-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:17 +01:00
Chris Wilson
d1b42b800e locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks
Check that ww_mutexes can detect cyclic deadlocks (generalised ABBA
cycles) and resolve them by lock reordering.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-7-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:16 +01:00
Chris Wilson
70207686e4 locking/ww_mutex: Add kselftests for ww_mutex ABBA deadlock detection
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-6-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:16 +01:00
Chris Wilson
c22fb3807f locking/ww_mutex: Add kselftests for ww_mutex AA deadlock detection
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-5-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:15 +01:00
Chris Wilson
f2a5fec173 locking/ww_mutex: Begin kselftests for ww_mutex
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-4-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:14 +01:00
Chris Wilson
0186a6cbdc locking/ww_mutex: Add ww_mutex to locktorture test
Although ww_mutexes degenerate into mutexes, it would be useful to
torture the deadlock handling between multiple ww_mutexes in addition to
torturing the regular mutexes.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Nicolai Hähnle <nhaehnle@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161201114711.28697-3-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:14 +01:00
Tejun Heo
1460cb65a1 locking/mutex, sched/wait: Add mutex_lock_io()
We sometimes end up propagating IO blocking through mutexes; however,
because there currently is no way of annotating mutex sleeps as
iowait, there are cases where iowait and /proc/stat:procs_blocked
report misleading numbers obscuring the actual state of the system.

This patch adds mutex_lock_io() so that mutex sleeps can be marked as
iowait in those cases.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1477673892-28940-4-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:05 +01:00
Tejun Heo
10ab56434f sched/core: Separate out io_schedule_prepare() and io_schedule_finish()
Now that IO schedule accounting is done inside __schedule(),
io_schedule() can be split into three steps - prep, schedule, and
finish - where the schedule part doesn't need any special annotation.
This allows marking a sleep as iowait by simply wrapping an existing
blocking function with io_schedule_prepare() and io_schedule_finish().

Because task_struct->in_iowait is single bit, the caller of
io_schedule_prepare() needs to record and the pass its state to
io_schedule_finish() to be safe regarding nesting.  While this isn't
the prettiest, these functions are mostly gonna be used by core
functions and we don't want to use more space for ->in_iowait.

While at it, as it's simple to do now, reimplement io_schedule()
without unnecessarily going through io_schedule_timeout().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:04 +01:00
Tejun Heo
e33a9bba85 sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler
For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside the
scheduler.

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

done close to the actual scheduling.  This patch moves #3 into the
scheduler so that later patches can separate out preparation and
finish steps from io_schedule().

Patch-originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adilger.kernel@dilger.ca
Cc: akpm@linux-foundation.org
Cc: axboe@kernel.dk
Cc: jack@suse.com
Cc: kernel-team@fb.com
Cc: mingbo@fb.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/20161207204841.GA22296@htj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:03 +01:00
Dietmar Eggemann
b8fd842369 sched/fair: Explain why MIN_SHARES isn't scaled in calc_cfs_shares()
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e9a4d858-bcf3-36b9-e3a9-449953e34569@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:02 +01:00
Vincent Guittot
89ee048f3c sched/core: Fix group_entity's share update
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.

Let take the example of a task a that is dequeued of its task group A:
   root
  (cfs_rq)
    \
    (se)
     A
    (cfs_rq)
      \
      (se)
       a

Task "a" was the only task in task group A which becomes idle when a is
dequeued.

We have the sequence:

- dequeue_entity a->se
    - update_load_avg(a->se)
    - dequeue_entity_load_avg(A->cfs_rq, a->se)
    - update_cfs_shares(A->cfs_rq)
	A->cfs_rq->load.weight == 0
        A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
    - update_load_avg(A->se) but its weight is now null so the last time
      slot (up to a tick) will be accounted with a weight of 0 instead of
      its real weight during the time slot. The last time slot will be
      accounted as an idle one whereas it was a running one.

If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.

Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.

update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:02 +01:00
Peter Zijlstra
da9647e076 sched/completions: Fix complete_all() semantics
Documentation/scheduler/completion.txt says this about complete_all():

  "calls complete_all() to signal all current and future waiters."

Which doesn't strictly match the current semantics. Currently
complete_all() is equivalent to UINT_MAX/2 complete() invocations,
which is distinctly less than 'all current and future waiters'
(enumerable vs innumerable), although it has worked in practice.

However, Dmitry had a weird case where it might matter, so change
completions to use saturation semantics for complete()/complete_all().
Once done hits UINT_MAX (and complete_all() sets it there) it will
never again be decremented.

Requested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: der.herr@hofr.at
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:01 +01:00
Tommaso Cucinotta
59f8c29892 sched/deadline: Show leftover runtime and abs deadline in /proc/*/sched
This patch allows for reading the current (leftover) runtime and
absolute deadline of a SCHED_DEADLINE task through /proc/*/sched
(entries dl.runtime and dl.deadline), while debugging/testing.

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Daniel Bistrot de Oliveira <danielbristot@gmail.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1477473437-10346-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:00 +01:00
Peter Zijlstra
5680d8094f sched/clock: Provide better clock continuity
When switching between the unstable and stable variants it is
currently possible that clock discontinuities occur.

And while these will mostly be 'small', attempt to do better.

As observed on my IVB-EP, the sched_clock() is ~1.5s ahead of the
ktime_get_ns() based timeline at the point of switchover
(sched_clock_init_late()) after SMP bringup.

Equally, when the TSC is later found to be unstable -- typically
because SMM tries to hide its SMI latencies by mucking with the TSC --
we want to avoid large jumps.

Since the clocksource watchdog reports the issue after the fact we
cannot exactly fix up time, but since SMI latencies are typically
small (~10ns range), the discontinuity is mainly due to drift between
sched_clock() and ktime_get_ns() (which on my desktop is ~79s over
24days).

I dislike this patch because it adds overhead to the good case in
favour of dealing with badness. But given the widespread failure of
TSC stability this is worth it.

Note that in case the TSC makes drastic jumps after SMP bringup we're
still hosed. There's just not much we can do in that case without
stupid overhead.

If we were to somehow expose tsc_clocksource_reliable (which is hard
because this code is also used on ia64 and parisc) we could avoid some
of the newly introduced overhead.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:00 +01:00
Peter Zijlstra
9881b024b7 sched/clock: Delay switching sched_clock to stable
Currently we switch to the stable sched_clock if we guess the TSC is
usable, and then switch back to the unstable path if it turns out TSC
isn't stable during SMP bringup after all.

Delay switching to the stable path until after SMP bringup is
complete. This way we'll avoid switching during the time we detect the
worst of the TSC offences.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:59 +01:00
Peter Zijlstra
555570d744 sched/clock: Update static_key usage
sched_clock was still using the deprecated static_key interface.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:57 +01:00
Thomas Gleixner
12907fbb1a sched/clock, clocksource: Add optional cs::mark_unstable() method
PeterZ reported that we'd fail to mark the TSC unstable when the
clocksource watchdog finds it unsuitable.

Allow a clocksource to run a custom action when its being marked
unstable and hook up the TSC unstable code.

Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:43 +01:00
Matt Fleming
cb42c9a3eb sched/core: Add debugging code to catch missing update_rq_clock() calls
There's no diagnostic checks for figuring out when we've accidentally
missed update_rq_clock() calls. Let's add some by piggybacking on the
rq_*pin_lock() wrappers.

The idea behind the diagnostic checks is that upon pining rq lock the
rq clock should be updated, via update_rq_clock(), before anybody
reads the clock with rq_clock() or rq_clock_task().

The exception to this rule is when updates have explicitly been
disabled with the rq_clock_skip_update() optimisation.

There are some functions that only unpin the rq lock in order to grab
some other lock and avoid deadlock. In that case we don't need to
update the clock again and the previous diagnostic state can be
carried over in rq_repin_lock() by saving the state in the rq_flags
context.

Since this patch adds a new clock update flag and some already exist
in rq::clock_skip_update, that field has now been renamed. An attempt
has been made to keep the flag manipulation code small and fast since
it's used in the heart of the __schedule() fast path.

For the !CONFIG_SCHED_DEBUG case the only object code change (other
than addresses) is the following change to reset RQCF_ACT_SKIP inside
of __schedule(),

  -       c7 83 38 09 00 00 00    movl   $0x0,0x938(%rbx)
  -       00 00 00
  +       83 a3 38 09 00 00 fc    andl   $0xfffffffc,0x938(%rbx)

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-8-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:35 +01:00
Peter Zijlstra
2fb8d36787 sched/core: Add missing update_rq_clock() call in set_user_nice()
Address this rq-clock update bug:

  WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    set_next_entity()
    ? _raw_spin_lock()
    set_curr_task_fair()
    set_user_nice.part.85()
    set_user_nice()
    create_worker()
    worker_thread()
    kthread()
    ret_from_fork()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:34 +01:00
Peter Zijlstra
3bed5e2166 sched/core: Add missing update_rq_clock() call for task_hot()
Add the update_rq_clock() call at the top of the callstack instead of
at the bottom where we find it missing, this to aid later effort to
minimize the number of update_rq_lock() calls.

  WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    assert_clock_updated.isra.63.part.64()
    can_migrate_task()
    load_balance()
    pick_next_task_fair()
    __schedule()
    schedule()
    worker_thread()
    kthread()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:34 +01:00
Peter Zijlstra
80f5c1b84b sched/core: Add missing update_rq_clock() in detach_task_cfs_rq()
Instead of adding the update_rq_clock() all the way at the bottom of
the callstack, add one at the top, this to aid later effort to
minimize update_rq_lock() calls.

  WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    detach_task_cfs_rq()
    switched_from_fair()
    __sched_setscheduler()
    _sched_setscheduler()
    sched_set_stop_task()
    cpu_stop_create()
    __smpboot_create_thread.part.2()
    smpboot_register_percpu_thread_cpumask()
    cpu_stop_init()
    do_one_initcall()
    ? print_cpu_info()
    kernel_init_freeable()
    ? rest_init()
    kernel_init()
    ret_from_fork()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:33 +01:00
Peter Zijlstra
4126bad671 sched/core: Add missing update_rq_clock() in post_init_entity_util_avg()
Address this rq-clock update bug:

  WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    __warn()
    post_init_entity_util_avg()
    wake_up_new_task()
    _do_fork()
    kernel_thread()
    rest_init()
    start_kernel()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:32 +01:00
Matt Fleming
46f69fa337 sched/fair: Push rq lock pin/unpin into idle_balance()
Future patches will emit warnings if rq_clock() is called before
update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.

Since there is only one caller of idle_balance() we can push the
unpin/repin there.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:32 +01:00
Matt Fleming
92509b732b sched/core: Reset RQCF_ACT_SKIP before unpinning rq->lock
rq_clock() is called from sched_info_{depart,arrive}() after resetting
RQCF_ACT_SKIP but prior to a call to update_rq_clock().

In preparation for pending patches that check whether the rq clock has
been updated inside of a pin context before rq_clock() is called, move
the reset of rq->clock_skip_update immediately before unpinning the rq
lock.

This will avoid the new warnings which check if update_rq_clock() is
being actively skipped.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-6-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:31 +01:00
Matt Fleming
d8ac897137 sched/core: Add wrappers for lockdep_(un)pin_lock()
In preparation for adding diagnostic checks to catch missing calls to
update_rq_clock(), provide wrappers for (re)pinning and unpinning
rq->lock.

Because the pending diagnostic checks allow state to be maintained in
rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
for 'struct rq_flags *'.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:30 +01:00
Nicolai Hähnle
977625a693 locking/mutex: Initialize mutex_waiter::ww_ctx with poison when debugging
Help catch cases where mutex_lock is used directly on w/w mutexes, which
otherwise result in the w/w tasks reading uninitialized data.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-12-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:53 +01:00
Nicolai Hähnle
c516df978d locking/ww_mutex: Optimize ww-mutexes by yielding to other waiters from optimistic spin
Lock stealing is less beneficial for w/w mutexes since we may just end up
backing off if we stole from a thread with an earlier acquire stamp that
already holds another w/w mutex that we also need. So don't spin
optimistically unless we are sure that there is no other waiter that might
cause us to back off.

Median timings taken of a contention-heavy GPU workload:

Before:

  real    0m52.946s
  user    0m7.272s
  sys     1m55.964s

After:

  real    0m53.086s
  user    0m7.360s
  sys     1m46.204s

This particular workload still spends 20%-25% of CPU in mutex_spin_on_owner
according to perf, but my attempts to further reduce this spinning based on
various heuristics all lead to an increase in measured wall time despite
the decrease in sys time.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-11-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:52 +01:00
Nicolai Hähnle
25f13b4040 locking/ww_mutex: Re-check ww->ctx in the inner optimistic spin loop
In the following scenario, thread #1 should back off its attempt to lock
ww1 and unlock ww2 (assuming the acquire context stamps are ordered
accordingly).

    Thread #0               Thread #1
    ---------               ---------
                            successfully lock ww2
    set ww1->base.owner
                            attempt to lock ww1
                            confirm ww1->ctx == NULL
                            enter mutex_spin_on_owner
    set ww1->ctx

What was likely to happen previously is:

    attempt to lock ww2
    refuse to spin because
      ww2->ctx != NULL
    schedule()
                            detect thread #0 is off CPU
                            stop optimistic spin
                            return -EDEADLK
                            unlock ww2
                            wakeup thread #0
    lock ww2

Now, we are more likely to see:

                            detect ww1->ctx != NULL
                            stop optimistic spin
                            return -EDEADLK
                            unlock ww2
    successfully lock ww2

... because thread #1 will stop its optimistic spin as soon as possible.

The whole scenario is quite unlikely, since it requires thread #1 to get
between thread #0 setting the owner and setting the ctx. But since we're
idling here anyway, the additional check is basically free.

Found by inspection.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-10-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:51 +01:00
Peter Zijlstra
427b18207a locking/mutex: Improve inlining
Instead of inlining __mutex_lock_common() 5 times, once for each
{state,ww} variant. Reduce this to two, ww and !ww.

Then add __always_inline to mutex_optimistic_spin(), so that that will
get inlined all 4 remaining times, for all {waiter,ww} variants.

   text    data     bss     dec     hex filename

   6301       0       0    6301    189d defconfig-build/kernel/locking/mutex.o
   4053       0       0    4053     fd5 defconfig-build/kernel/locking/mutex.o
   4257       0       0    4257    10a1 defconfig-build/kernel/locking/mutex.o

This reduces total text size and better separates the ww and !ww mutex
code generation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:48 +01:00
Nicolai Hähnle
659cf9f582 locking/ww_mutex: Optimize ww-mutexes by waking at most one waiter for backoff when acquiring the lock
The wait list is sorted by stamp order, and the only waiting task that may
have to back off is the first waiter with a context.

The regular slow path does not have to wake any other tasks at all, since
all other waiters that would have to back off were either woken up when
the waiter was added to the list, or detected the condition before they
added themselves.

Median timings taken of a contention-heavy GPU workload:

Without this series:

  real    0m59.900s
  user    0m7.516s
  sys     2m16.076s

With changes up to and including this patch:

  real    0m52.946s
  user    0m7.272s
  sys     1m55.964s

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-9-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:44 +01:00
Nicolai Hähnle
200b187440 locking/ww_mutex: Notify waiters that have to back off while adding tasks to wait list
While adding our task as a waiter, detect if another task should back off
because of us.

With this patch, we establish the invariant that the wait list contains
at most one (sleeping) waiter with ww_ctx->acquired > 0, and this waiter
will be the first waiter with a context.

Since only waiters with ww_ctx->acquired > 0 have to back off, this allows
us to be much more economical with wakeups.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-8-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:43 +01:00
Nicolai Hähnle
6baa5c60a9 locking/ww_mutex: Add waiters in stamp order
Add regular waiters in stamp order. Keep adding waiters that have no
context in FIFO order and take care not to starve them.

While adding our task as a waiter, back off if we detect that there is
a waiter with a lower stamp in front of us.

Make sure to call lock_contended even when we back off early.

For w/w mutexes, being first in the wait list is only stable when
taking the lock without a context. Therefore, the purpose of the first
flag is split into two: 'first' remains to indicate whether we want to
spin optimistically, while 'handoff' indicates that we should be
prepared to accept a handoff.

For w/w locking with a context, we always accept handoffs after the
first schedule(), to handle the following sequence of events:

 1. Task #0 unlocks and hands off to Task #2 which is first in line

 2. Task #1 adds itself in front of Task #2

 3. Task #2 wakes up and must accept the handoff even though it is no
    longer first in line

Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: =?UTF-8?q?Nicolai=20H=C3=A4hnle?= <Nicolai.Haehnle@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-7-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:42 +01:00
Nicolai Hähnle
c5470b22d1 locking/ww_mutex: Remove the __ww_mutex_lock*() inline wrappers
Keep the documentation in the header file since there is no good place
for it in mutex.c: there are two rather different implementations with
different EXPORT_SYMBOLs for each function.

Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: =?UTF-8?q?Nicolai=20H=C3=A4hnle?= <Nicolai.Haehnle@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-6-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:41 +01:00
Nicolai Hähnle
ea9e0fb8fe locking/ww_mutex: Set use_ww_ctx even when locking without a context
We will add a new field to struct mutex_waiter.  This field must be
initialized for all waiters if any waiter uses the ww_use_ctx path.

So there is a trade-off: Keep ww_mutex locking without a context on
the faster non-use_ww_ctx path, at the cost of adding the
initialization to all mutex locks (including non-ww_mutexes), or avoid
the additional cost for non-ww_mutex locks, at the cost of adding
additional checks to the use_ww_ctx path.

We take the latter choice.  It may be worth eliminating the users of
ww_mutex_lock(lock, NULL), but there are a lot of them.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-5-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:40 +01:00
Nicolai Hähnle
3822da3ed0 locking/ww_mutex: Extract stamp comparison to __ww_mutex_stamp_after()
The function will be re-used in subsequent patches.

Signed-off-by: Nicolai Hähnle <Nicolai.Haehnle@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <dev@mblankhorst.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1482346000-9927-4-git-send-email-nhaehnle@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:39 +01:00
Peter Zijlstra
e274795ea7 locking/mutex: Fix mutex handoff
While reviewing the ww_mutex patches, I noticed that it was still
possible to (incorrectly) succeed for (incorrect) code like:

	mutex_lock(&a);
	mutex_lock(&a);

This was possible if the second mutex_lock() would block (as expected)
but then receive a spurious wakeup. At that point it would find itself
at the front of the queue, request a handoff and instantly claim
ownership and continue, since owner would point to itself.

Avoid this scenario and simplify the code by introducing a third low
bit to signal handoff pickup. So once we request handoff, unlock
clears the handoff bit and sets the pickup bit along with the new
owner.

This also removes the need for the .handoff argument to
__mutex_trylock(), since that becomes superfluous with PICKUP.

In order to guarantee enough low bits, ensure task_struct alignment is
at least L1_CACHE_BYTES (which seems a good ideal regardless).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9d659ae14b ("locking/mutex: Add lock handoff to avoid starvation")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:38 +01:00
Davidlohr Bueso
52b94129f2 locking/percpu-rwsem: Replace waitqueue with rcuwait
The use of any kind of wait queue is an overkill for pcpu-rwsems.
While one option would be to use the less heavy simple (swait)
flavor, this is still too much for what pcpu-rwsems needs. For one,
we do not care about any sort of queuing in that the only (rare) time
writers (and readers, for that matter) are queued is when trying to
acquire the regular contended rw_sem. There cannot be any further
queuing as writers are serialized by the rw_sem in the first place.

Given that percpu_down_write() must not be called after exit_notify(),
we can replace the bulky waitqueue with rcuwait such that a writer
can wait for its turn to take the lock. As such, we can avoid the
queue handling and locking overhead.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1484148146-14210-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:35 +01:00
Davidlohr Bueso
8f95c90ceb sched/wait, RCU: Introduce rcuwait machinery
rcuwait provides support for (single) RCU-safe task wait/wake functionality,
with the caveat that it must not be called after exit_notify(), such that
we avoid racing with rcu delayed_put_task_struct callbacks, task_struct
being rcu unaware in this context -- for which we similarly have
task_rcu_dereference() magic, but with different return semantics, which
can conflict with the wakeup side.

The interfaces are quite straightforward:

  rcuwait_wait_event()
  rcuwait_wake_up()

More details are in the comments, but it's perhaps worth mentioning at least,
that users must provide proper serialization when waiting on a condition, and
avoid corrupting a concurrent waiter. Also care must be taken between the task
and the condition for when calling the wakeup -- we cannot miss wakeups. When
porting users, this is for example, a given when using waitqueues in that
everything is done under the q->lock. As such, it can remove sources of non
preemptable unbounded work for realtime.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1484148146-14210-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:33 +01:00
Davidlohr Bueso
642fa448ae sched/core: Remove set_task_state()
This is a nasty interface and setting the state of a foreign task must
not be done. As of the following commit:

  be628be095 ("bcache: Make gc wakeup sane, remove set_task_state()")

... everyone in the kernel calls set_task_state() with current, allowing
the helper to be removed.

However, as the comment indicates, it is still around for those archs
where computing current is more expensive than using a pointer, at least
in theory. An important arch that is affected is arm64, however this has
been addressed now [1] and performance is up to par making no difference
with either calls.

Of all the callers, if any, it's the locking bits that would care most
about this -- ie: we end up passing a tsk pointer to a lot of the lock
slowpath, and setting ->state on that. The following numbers are based
on two tests: a custom ad-hoc microbenchmark that just measures
latencies (for ~65 million calls) between get_task_state() vs
get_current_state().

Secondly for a higher overview, an unlink microbenchmark was used,
which pounds on a single file with open, close,unlink combos with
increasing thread counts (up to 4x ncpus). While the workload is quite
unrealistic, it does contend a lot on the inode mutex or now rwsem.

[1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com

== 1. x86-64 ==

Avg runtime set_task_state():    601 msecs
Avg runtime set_current_state(): 552 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      36089.26 (  0.00%)    38977.33 (  8.00%)
Hmean    unlink1-processes-5      28555.01 (  0.00%)    29832.55 (  4.28%)
Hmean    unlink1-processes-8      37323.75 (  0.00%)    44974.57 ( 20.50%)
Hmean    unlink1-processes-12     43571.88 (  0.00%)    44283.01 (  1.63%)
Hmean    unlink1-processes-21     34431.52 (  0.00%)    38284.45 ( 11.19%)
Hmean    unlink1-processes-30     34813.26 (  0.00%)    37975.17 (  9.08%)
Hmean    unlink1-processes-48     37048.90 (  0.00%)    39862.78 (  7.59%)
Hmean    unlink1-processes-79     35630.01 (  0.00%)    36855.30 (  3.44%)
Hmean    unlink1-processes-110    36115.85 (  0.00%)    39843.91 ( 10.32%)
Hmean    unlink1-processes-141    32546.96 (  0.00%)    35418.52 (  8.82%)
Hmean    unlink1-processes-172    34674.79 (  0.00%)    36899.21 (  6.42%)
Hmean    unlink1-processes-203    37303.11 (  0.00%)    36393.04 ( -2.44%)
Hmean    unlink1-processes-224    35712.13 (  0.00%)    36685.96 (  2.73%)

== 2. ppc64le ==

Avg runtime set_task_state():  938 msecs
Avg runtime set_current_state: 940 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      19269.19 (  0.00%)    30704.50 ( 59.35%)
Hmean    unlink1-processes-5      20106.15 (  0.00%)    21804.15 (  8.45%)
Hmean    unlink1-processes-8      17496.97 (  0.00%)    17243.28 ( -1.45%)
Hmean    unlink1-processes-12     14224.15 (  0.00%)    17240.21 ( 21.20%)
Hmean    unlink1-processes-21     14155.66 (  0.00%)    15681.23 ( 10.78%)
Hmean    unlink1-processes-30     14450.70 (  0.00%)    15995.83 ( 10.69%)
Hmean    unlink1-processes-48     16945.57 (  0.00%)    16370.42 ( -3.39%)
Hmean    unlink1-processes-79     15788.39 (  0.00%)    14639.27 ( -7.28%)
Hmean    unlink1-processes-110    14268.48 (  0.00%)    14377.40 (  0.76%)
Hmean    unlink1-processes-141    14023.65 (  0.00%)    16271.69 ( 16.03%)
Hmean    unlink1-processes-172    13417.62 (  0.00%)    16067.55 ( 19.75%)
Hmean    unlink1-processes-203    15293.08 (  0.00%)    15440.40 (  0.96%)
Hmean    unlink1-processes-234    13719.32 (  0.00%)    16190.74 ( 18.01%)
Hmean    unlink1-processes-265    16400.97 (  0.00%)    16115.22 ( -1.74%)
Hmean    unlink1-processes-296    14388.60 (  0.00%)    16216.13 ( 12.70%)
Hmean    unlink1-processes-320    15771.85 (  0.00%)    15905.96 (  0.85%)

x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
and ppc64 (with paca) show similar improvements in the unlink microbenches.
The small delta for ppc64 (2ms), does not represent the gains on the unlink
runs. In the case of x86, there was a decent amount of variation in the
latency runs, but always within a 20 to 50ms increase), ppc was more constant.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:16 +01:00
Davidlohr Bueso
d269a8b8c5 kernel/locking: Compute 'current' directly
This patch effectively replaces the tsk pointer dereference
(which is obviously == current), to directly use get_current()
macro. This is to make the removal of setting foreign task
states smoother and painfully obvious. Performance win on some
archs such as x86-64 and ppc64. On a microbenchmark that calls
set_task_state() vs set_current_state() and an inode rwsem
pounding benchmark doing unlink:

== 1. x86-64 ==

Avg runtime set_task_state():    601 msecs
Avg runtime set_current_state(): 552 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      36089.26 (  0.00%)    38977.33 (  8.00%)
Hmean    unlink1-processes-5      28555.01 (  0.00%)    29832.55 (  4.28%)
Hmean    unlink1-processes-8      37323.75 (  0.00%)    44974.57 ( 20.50%)
Hmean    unlink1-processes-12     43571.88 (  0.00%)    44283.01 (  1.63%)
Hmean    unlink1-processes-21     34431.52 (  0.00%)    38284.45 ( 11.19%)
Hmean    unlink1-processes-30     34813.26 (  0.00%)    37975.17 (  9.08%)
Hmean    unlink1-processes-48     37048.90 (  0.00%)    39862.78 (  7.59%)
Hmean    unlink1-processes-79     35630.01 (  0.00%)    36855.30 (  3.44%)
Hmean    unlink1-processes-110    36115.85 (  0.00%)    39843.91 ( 10.32%)
Hmean    unlink1-processes-141    32546.96 (  0.00%)    35418.52 (  8.82%)
Hmean    unlink1-processes-172    34674.79 (  0.00%)    36899.21 (  6.42%)
Hmean    unlink1-processes-203    37303.11 (  0.00%)    36393.04 ( -2.44%)
Hmean    unlink1-processes-224    35712.13 (  0.00%)    36685.96 (  2.73%)

== 2. ppc64le ==

Avg runtime set_task_state():  938 msecs
Avg runtime set_current_state: 940 msecs

                                            vanilla                 dirty
Hmean    unlink1-processes-2      19269.19 (  0.00%)    30704.50 ( 59.35%)
Hmean    unlink1-processes-5      20106.15 (  0.00%)    21804.15 (  8.45%)
Hmean    unlink1-processes-8      17496.97 (  0.00%)    17243.28 ( -1.45%)
Hmean    unlink1-processes-12     14224.15 (  0.00%)    17240.21 ( 21.20%)
Hmean    unlink1-processes-21     14155.66 (  0.00%)    15681.23 ( 10.78%)
Hmean    unlink1-processes-30     14450.70 (  0.00%)    15995.83 ( 10.69%)
Hmean    unlink1-processes-48     16945.57 (  0.00%)    16370.42 ( -3.39%)
Hmean    unlink1-processes-79     15788.39 (  0.00%)    14639.27 ( -7.28%)
Hmean    unlink1-processes-110    14268.48 (  0.00%)    14377.40 (  0.76%)
Hmean    unlink1-processes-141    14023.65 (  0.00%)    16271.69 ( 16.03%)
Hmean    unlink1-processes-172    13417.62 (  0.00%)    16067.55 ( 19.75%)
Hmean    unlink1-processes-203    15293.08 (  0.00%)    15440.40 (  0.96%)
Hmean    unlink1-processes-234    13719.32 (  0.00%)    16190.74 ( 18.01%)
Hmean    unlink1-processes-265    16400.97 (  0.00%)    16115.22 ( -1.74%)
Hmean    unlink1-processes-296    14388.60 (  0.00%)    16216.13 ( 12.70%)
Hmean    unlink1-processes-320    15771.85 (  0.00%)    15905.96 (  0.85%)

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:14 +01:00
Davidlohr Bueso
0039962a14 kernel/exit: Compute 'current' directly
This patch effectively replaces the tsk pointer dereference (which is
obviously == current), to directly use get_current() macro. In this
case, do_exit() always passes current to exit_mm(), hence we can
simply get rid of the argument. This is also a performance win on some
archs such as x86-64 and ppc64 -- arm64 is no longer an issue.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1483479794-14013-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:14:11 +01:00
Jiri Olsa
475113d937 perf/x86/intel: Account interrupts for PEBS errors
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:

    taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:

  NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
  ...
  RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
  ...
  Call Trace:
   ? trace_hardirqs_on_caller+0xf5/0x1b0
   ? perf_cgroup_attach+0x70/0x70
   perf_install_in_context+0x199/0x1b0
   ? ctx_resched+0x90/0x90
   SYSC_perf_event_open+0x641/0xf90
   SyS_perf_event_open+0x9/0x10
   do_syscall_64+0x6c/0x1f0
   entry_SYSCALL64_slow_path+0x25/0x25

Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.

We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:06:49 +01:00