Commit Graph

27099 Commits

Author SHA1 Message Date
Jiri Olsa
705feaf321 hw_breakpoint: Add perf_event_attr fields check in __modify_user_hw_breakpoint()
And rename it to modify_user_hw_breakpoint_check().

We are about to use modify_user_hw_breakpoint_check() for user space
breakpoints modification, we must be very strict to check only the
fields we can change have changed. As Peter explained:

 "Suppose someone does:

        attr = malloc(sizeof(*attr)); // uninitialized memory
        attr->type = BP;
        attr->bp_addr = new_addr;
        attr->bp_type = bp_type;
        attr->bp_len = bp_len;
        ioctl(fd, PERF_IOC_MOD_ATTR, &attr);

  And feeds absolute shite for the rest of the fields.
  Then we later want to extend IOC_MOD_ATTR to allow changing
  attr::sample_type but we can't, because that would break the
  above application."

I'm making this check optional because we already export
modify_user_hw_breakpoint() and with this check we could
break existing users.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-6-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:08 +01:00
Jiri Olsa
18ff57b220 hw_breakpoint: Factor out __modify_user_hw_breakpoint() function
Moving out all the functionality without the events
disabling/enabling calls, because we want to call another
disabling/enabling functions in following change.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-5-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:08 +01:00
Jiri Olsa
ea6a9d530c hw_breakpoint: Add modify_bp_slot() function
Add the modify_bp_slot() function to keep slot numbers
correct when changing the breakpoint type.

Using existing __release_bp_slot()/__reserve_bp_slot()
call sequence to update the slot counts.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-4-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:07 +01:00
Jiri Olsa
1ad9ff7dea hw_breakpoint: Pass bp_type argument to __reserve_bp_slot|__release_bp_slot()
Passing bp_type argument to __reserve_bp_slot() and __release_bp_slot()
functions, so we can pass another bp_type than the one defined in
bp->attr.bp_type. This will be handy in following change that fixes
breakpoint slot counts during its modification.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-3-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:07 +01:00
Jiri Olsa
cbd9d9f114 hw_breakpoint: Pass bp_type directly as find_slot_idx() argument
Pass bp_type directly as a find_slot_idx() argument,
so we don't need to have whole event to get the
breakpoint slot type. It will be used in following
changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Milind Chabbi <chabbi.milind@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <onestero@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180312134548.31532-2-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13 06:56:07 +01:00
Masami Hiramatsu
b6b76dd62c error-injection: Fix to prohibit jump optimization
Since the kprobe which was optimized by jump can not change
the execution path, the kprobe for error-injection must not
be optimized. To prohibit it, set a dummy post-handler as
officially stated in Documentation/kprobes.txt.

Fixes: 4b1a29a7f5 ("error-injection: Support fault injection framework")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-12 16:16:00 +01:00
leilei.lin
33801b9474 perf/core: Fix installing cgroup events on CPU
There's two problems when installing cgroup events on CPUs: firstly
list_update_cgroup_event() only tries to set cpuctx->cgrp for the
first event, if that mismatches on @cgrp we'll not try again for later
additions.

Secondly, when we install a cgroup event into an active context, only
issue an event reprogram when the event matches the current cgroup
context. This avoids a pointless event reprogramming.

Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com>
[ Improved the changelog and comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: brendan.d.gregg@gmail.com
Cc: eranian@gmail.com
Cc: linux-kernel@vger.kernel.org
Cc: yang_oliver@hotmail.com
Link: http://lkml.kernel.org/r/20180306093637.28247-1-linxiulei@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:51 +01:00
Peter Zijlstra
8d5bce0c37 perf/core: Optimize perf_rotate_context() event scheduling
The event schedule order (as per perf_event_sched_in()) is:

 - cpu  pinned
 - task pinned
 - cpu  flexible
 - task flexible

But perf_rotate_context() will unschedule cpu-flexible even if it
doesn't need a rotation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:50 +01:00
Peter Zijlstra
8703a7cfe1 perf/core: Fix tree based event rotation
Similar to how first programming cpu=-1 and then cpu=# is wrong, so is
rotating both. It was especially wrong when we were still programming
the PMU in this same order, because in that scenario we might never
actually end up running cpu=# events at all.

Cure this by using the active_list to pick the rotation event; since
at programming we already select the left-most event.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:50 +01:00
Peter Zijlstra
6e6804d2fa perf/core: Simpify perf_event_groups_for_each()
The last argument is, and always must be, the same.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:50 +01:00
Peter Zijlstra
6668128a9e perf/core: Optimize ctx_sched_out()
When an event group contains more events than can be scheduled on the
hardware, iterating the full event group for ctx_sched_out is a waste
of time.

Keep track of the events that got programmed on the hardware, such
that we can iterate this smaller list in order to schedule them out.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:50 +01:00
Peter Zijlstra
8343aae661 perf/core: Remove perf_event::group_entry
Now that all the grouping is done with RB trees, we no longer need
group_entry and can replace the whole thing with sibling_list.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Peter Zijlstra
1cac7b1ae3 perf/core: Fix event schedule order
Scheduling in events with cpu=-1 before events with cpu=# changes
semantics and is undesirable in that it would priorize these events.

Given that groups->index is across all groups we actually have an
inter-group ordering, meaning we can merge-sort two groups, which is
just what we need to preserve semantics.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Peter Zijlstra
161c85fab7 perf/core: Cleanup the rb-tree code
Trivial comment and code fixups..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Alexey Budankov
8e1a2031e4 perf/cor: Use RB trees for pinned/flexible groups
Change event groups into RB trees sorted by CPU and then by a 64bit
index, so that multiplexing hrtimer interrupt handler would be able
skipping to the current CPU's list and ignore groups allocated for the
other CPUs.

New API for manipulating event groups in the trees is implemented as well
as adoption on the API in the current implementation.

pinned_group_sched_in() and flexible_group_sched_in() API are
introduced to consolidate code enabling the whole group from pinned
and flexible groups appropriately.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/372f9c8b-0cfe-4240-e44d-83d863d40813@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Peter Zijlstra
9e5b127d6f perf/core: Fix perf_output_read_group()
Mark reported his arm64 perf fuzzer runs sometimes splat like:

  armv8pmu_read_counter+0x1e8/0x2d8
  armpmu_event_update+0x8c/0x188
  armpmu_read+0xc/0x18
  perf_output_read+0x550/0x11e8
  perf_event_read_event+0x1d0/0x248
  perf_event_exit_task+0x468/0xbb8
  do_exit+0x690/0x1310
  do_group_exit+0xd0/0x2b0
  get_signal+0x2e8/0x17a8
  do_signal+0x144/0x4f8
  do_notify_resume+0x148/0x1e8
  work_pending+0x8/0x14

which asserts that we only call pmu::read() on ACTIVE events.

The above callchain does:

  perf_event_exit_task()
    perf_event_exit_task_context()
      task_ctx_sched_out() // INACTIVE
      perf_event_exit_event()
        perf_event_set_state(EXIT) // EXIT
        sync_child_event()
          perf_event_read_event()
            perf_output_read()
              perf_output_read_group()
                leader->pmu->read()

Which results in doing a pmu::read() on an !ACTIVE event.

I _think_ this is 'new' since we added attr.inherit_stat, which added
the perf_event_read_event() to the exit path, without that
perf_event_read_output() would only trigger from samples and for
@event to trigger a sample, it's leader _must_ be ACTIVE too.

Still, adding this check makes it consistent with the @sub case for
the siblings.

Reported-and-Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:48 +01:00
Ingo Molnar
9884afa2fd Linux 4.16-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqlyPEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNa0H/RIa/StQuYu/SBwa
 JRqQFmkIsx+gG+FyamJrqGzRfyjounES8PbfyaN3cCrzYgeRwMp1U/bZW6/l5tkb
 OjTtrCJ6CJaa21fC/7aqn3rhejHciKyk83EinMu5WjDpsQcaF2xKr3SaPa62Ja24
 fhawKq3CnUa+OUuAbicVX8yn4viUB6x8FjSN/IWfp3Cs4IBR7SGxxD7A4MET9FbQ
 5OOu0al8ly9QeCggTtJyk+cApeLfexEBTbUur9gm7GcH9jhUtJSyZCZsDJx6M2yb
 CwdgF4fyk58c1fuHvTFb0AdUns55ba3nicybRHHMVbDpZIG9v4/M1yJETHHf5cD7
 t3rFjrY=
 =+Ldf
 -----END PGP SIGNATURE-----

Merge tag 'v4.16-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:14:57 +01:00
Linus Torvalds
8ad4424350 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Another set of perf updates:

   - Fix a Skylake Uncore event format declaration

   - Prevent perf pipe mode from crahsing which was caused by a missing
     buffer allocation

   - Make the perf top popup message which tells the user that it uses
     fallback mode on older kernels a debug message.

   - Make perf context rescheduling work correcctly

   - Robustify the jump error drawing in perf browser mode so it does
     not try to create references to NULL initialized offset entries

   - Make trigger_on() robust so it does not enable the trigger before
     everything is set up correctly to handle it

   - Make perf auxtrace respect the --no-itrace option so it does not
     try to queue AUX data for decoding.

   - Prevent having different number of field separators in CVS output
     lines when a counter is not supported.

   - Make the perf kallsyms man page usage behave like it does for all
     other perf commands.

   - Synchronize the kernel headers"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix ctx_event_type in ctx_resched()
  perf tools: Fix trigger class trigger_on()
  perf auxtrace: Prevent decoding when --no-itrace
  perf stat: Fix CVS output format for non-supported counters
  tools headers: Sync x86's cpufeatures.h
  tools headers: Sync copy of kvm UAPI headers
  perf record: Fix crash in pipe mode
  perf annotate browser: Be more robust when drawing jump arrows
  perf top: Fix annoying fallback message on older kernels
  perf kallsyms: Fix the usage on the man page
  perf/x86/intel/uncore: Fix Skylake UPI event format
2018-03-11 14:49:49 -07:00
Linus Torvalds
02bf0ef028 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
 "rt_mutex_futex_unlock() grew a new irq-off call site, but the function
  assumes that its always called from irq enabled context.

  Use (un)lock_irqsafe() to handle the new call site correctly"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites
2018-03-11 14:46:54 -07:00
Ingo Molnar
c4fb5f3700 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Miscellaneous fixes, perhaps most notably removing obsolete
   code whose only purpose in life was to gather information for
   the now-removed RCU debugfs facility.  Other notable changes
   include removing NO_HZ_FULL_ALL in favor of the nohz_full kernel
   boot parameter, minor optimizations for expedited grace periods,
   some added tracing, creating an RCU-specific workqueue using Tejun's
   new WQ_MEM_RECLAIM flag, and several cleanups to code and comments.

 - SRCU cleanups and optimizations.

 - Torture-test updates, perhaps most notably the adding of ARMv8
   support, but also including numerous cleanups and usability fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-11 10:42:16 +01:00
Ingo Molnar
d88f1f1fdb Merge branch 'linus' into locking/core, to pick up fixes and dependencies
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-10 10:19:28 +01:00
Kees Cook
0862ca422b bug: use %pB in BUG and stack protector failure
The BUG and stack protector reports were still using a raw %p.  This
changes it to %pB for more meaningful output.

Link: http://lkml.kernel.org/r/20180301225704.GA34198@beast
Fixes: ad67b74d24 ("printk: hash addresses printed with %p")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Richard Weinberger <richard.weinberger@gmail.com>,
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-09 16:40:01 -08:00
Boqun Feng
6b0ef92fee rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites
When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and
kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a
HARDIRQ-safe->HARDIRQ-unsafe deadlock:

 ================================
 WARNING: inconsistent lock state
 4.16.0-rc4+ #1 Not tainted
 --------------------------------
 inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
 takes:
 __schedule+0xbe/0xaf0
 {IN-HARDIRQ-W} state was registered at:
   _raw_spin_lock+0x2a/0x40
   scheduler_tick+0x47/0xf0
...
 other info that might help us debug this:
  Possible unsafe locking scenario:
        CPU0
        ----
   lock(&rq->lock);
   <Interrupt>
     lock(&rq->lock);
  *** DEADLOCK ***
 1 lock held by rcu_torture_rea/724:
 rcu_torture_read_lock+0x0/0x70
 stack backtrace:
 CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
 Call Trace:
  lock_acquire+0x90/0x200
  ? __schedule+0xbe/0xaf0
  _raw_spin_lock+0x2a/0x40
  ? __schedule+0xbe/0xaf0
  __schedule+0xbe/0xaf0
  preempt_schedule_irq+0x2f/0x60
  retint_kernel+0x1b/0x2d
 RIP: 0010:rcu_read_unlock_special+0x0/0x680
  ? rcu_torture_read_unlock+0x60/0x60
  __rcu_read_unlock+0x64/0x70
  rcu_torture_read_unlock+0x17/0x60
  rcu_torture_reader+0x275/0x450
  ? rcutorture_booster_init+0x110/0x110
  ? rcu_torture_stall+0x230/0x230
  ? kthread+0x10e/0x130
  kthread+0x10e/0x130
  ? kthread_create_worker_on_cpu+0x70/0x70
  ? call_usermodehelper_exec_async+0x11a/0x150
  ret_from_fork+0x3a/0x50

This happens with the following even sequence:

	preempt_schedule_irq();
	  local_irq_enable();
	  __schedule():
	    local_irq_disable(); // irq off
	    ...
	    rcu_note_context_switch():
	      rcu_note_preempt_context_switch():
	        rcu_read_unlock_special():
	          local_irq_save(flags);
	          ...
		  raw_spin_unlock_irqrestore(...,flags); // irq remains off
	          rt_mutex_futex_unlock():
	            raw_spin_lock_irq();
	            ...
	            raw_spin_unlock_irq(); // accidentally set irq on

	    <return to __schedule()>
	    rq_lock():
	      raw_spin_lock(); // acquiring rq->lock with irq on

which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause
deadlocks in scheduler code.

This problem was introduced by commit 02a7c234e5 ("rcu: Suppress
lockdep false-positive ->boost_mtx complaints"). That brought the user
of rt_mutex_futex_unlock() with irq off.

To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with
*lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock()
with irq off.

Fixes: 02a7c234e5 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints")
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.feng@gmail.com
2018-03-09 11:06:16 +01:00
Quentin Monnet
6d8cb045cd bpf: comment why dots in filenames under BPF virtual FS are not allowed
When pinning a file under the BPF virtual file system (traditionally
/sys/fs/bpf), using a dot in the name of the location to pin at is not
allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be
rejected with -EPERM.

This check was introduced at the same time as the BPF file system
itself, with commit b2197755b2 ("bpf: add support for persistent
maps/progs"). At this time, it was checked in a function called
"bpf_dname_reserved()", which made clear that using a dot was reserved
for future extensions.

This function disappeared and the check was moved elsewhere with commit
0c93b7d85d ("bpf: reject invalid names right in ->lookup()"), and the
meaning of the dot ban was lost.

The present commit simply adds a comment in the source to explain to the
reader that the usage of dots is reserved for future usage.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-09 10:30:30 +01:00
Song Liu
bd903afeb5 perf/core: Fix ctx_event_type in ctx_resched()
In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is
added. However, ctx_resched() calculates ctx_event_type before checking
this condition. As a result, pinned events will NOT get higher priority
than flexible events.

The following shows this issue on an Intel CPU (where ref-cycles can
only use one hardware counter).

  1. First start:
       perf stat -C 0 -e ref-cycles  -I 1000
  2. Then, in the second console, run:
       perf stat -C 0 -e ref-cycles:D -I 1000

The second perf uses pinned events, which is expected to have higher
priority. However, because it failed in ctx_resched(). It is never
run.

This patch fixes this by calculating ctx_event_type after re-evaluating
event_type.

Reported-by: Ephraim Park <ephiepark@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.com>
Cc: <kernel-team@fb.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts")
Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 08:03:02 +01:00
gaurav jindal
d17067e448 sched/completions: Use bool in try_wait_for_completion()
Since the return type of the function is bool, the internal
'ret' variable should be bool too.

Signed-off-by: Gaurav Jindal<gauravjindal1104@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180221125407.GA14292@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 08:00:18 +01:00
Vincent Guittot
31e77c93e4 sched/fair: Update blocked load when newly idle
When NEWLY_IDLE load balance is not triggered, we might need to update the
blocked load anyway. We can kick an ilb so an idle CPU will take care of
updating blocked load or we can try to update them locally before entering
idle. In the latter case, we reuse part of the nohz_idle_balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518622006-16089-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:28 +01:00
Peter Zijlstra
47ea54121e sched/fair: Move idle_balance()
We're going to want to call nohz_idle_balance() or parts thereof from
idle_balance(). Since we already have a forward declaration of
idle_balance() move it down such that it's below nohz_idle_balance()
avoiding the need for a forward declaration for that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:25 +01:00
Peter Zijlstra
dd707247ab sched/nohz: Merge CONFIG_NO_HZ_COMMON blocks
Now that we have two back-to-back NO_HZ_COMMON blocks, merge them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:24 +01:00
Peter Zijlstra
af3fe03c56 sched/fair: Move rebalance_domains()
This pure code movement results in two #ifdef CONFIG_NO_HZ_COMMON
sections landing next to each other.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:23 +01:00
Peter Zijlstra
63928384fa sched/nohz: Optimize nohz_idle_balance()
Avoid calling update_blocked_averages() when it does not in fact have
any by re-using/extending update_nohz_stats().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:22 +01:00
Vincent Guittot
1936c53ce8 sched/fair: Reduce the periodic update duration
Instead of using the cfs_rq_is_decayed() which monitors all *_avg
and *_sum, we create a cfs_rq_has_blocked() which only takes care of
util_avg and load_avg. We are only interested by these 2 values which are
decaying faster than the *_sum so we can stop the periodic update earlier.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518517879-2280-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:22 +01:00
Vincent Guittot
f643ea2207 sched/nohz: Stop NOHZ stats when decayed
Stopped the periodic update of blocked load when all idle CPUs have fully
decayed. We introduce a new nohz.has_blocked that reflect if some idle
CPUs has blocked load that have to be periodiccally updated. nohz.has_blocked
is set everytime that a Idle CPU can have blocked load and it is then clear
when no more blocked load has been detected during an update. We don't need
atomic operation but only to make cure of the right ordering when updating
nohz.idle_cpus_mask and nohz.has_blocked.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: brendan.jackman@arm.com
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@foss.arm.com
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1518517879-2280-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:21 +01:00
Peter Zijlstra
ea14b57e8a sched/cpufreq: Provide migration hint
It was suggested that a migration hint might be usefull for the
CPU-freq governors.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:20 +01:00
Peter Zijlstra
00357f5ec5 sched/nohz: Clean up nohz enter/exit
The primary observation is that nohz enter/exit is always from the
current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be
an atomic.

Secondary is that we appear to have 2 nearly identical hooks in the
nohz enter code, set_cpu_sd_state_idle() and
nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into
nohz_balance_{enter,exit}_idle.

Removes an atomic op from both enter and exit paths.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:19 +01:00
Peter Zijlstra
e022e0d38a sched/fair: Update blocked load from NEWIDLE
Since we already iterate CPUs looking for work on NEWIDLE, use this
iteration to age the blocked load. If the domain for which this is
done completely spand the idle set, we can push the ILB based aging
forward.

Suggested-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:19 +01:00
Peter Zijlstra
a4064fb614 sched/fair: Add NOHZ stats balancing
Teach the idle balancer about the need to update statistics which have
a different periodicity from regular balancing.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:18 +01:00
Peter Zijlstra
4550487a99 sched/fair: Restructure nohz_balance_kick()
The current:

	if (nohz_kick_needed())
		nohz_balancer_kick()

is pointless complexity, fold them into a single call and avoid the
various conditions at the call site.

When we introduce multiple different needs to kick the ilb, the above
construct also becomes a problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:17 +01:00
Peter Zijlstra
b7031a02ec sched/fair: Add NOHZ_STATS_KICK
Split the NOHZ idle balancer into doing two separate actions:

 - update blocked load statistic

 - actually load-balance

Since the latter requires the former, ensure this happens. For now
always tag both bits at the same time.

Prepares for a future where we can toggle only the STATS bit.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:16 +01:00
Peter Zijlstra
a22e47a4e3 sched/core: Convert nohz_flags to atomic_t
Using atomic_t allows us to use the more flexible bitops provided
there. Also its smaller.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:16 +01:00
Peter Zijlstra
8f111bc357 cpufreq/schedutil: Rewrite CPUFREQ_RT support
Instead of trying to duplicate scheduler state to track if an RT task
is running, directly use the scheduler runqueue state for it.

This vastly simplifies things and fixes a number of bugs related to
sugov and the scheduler getting out of sync wrt this state.

As a consequence we not also update the remove cfs/dl state when
iterating the shared mask.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:15 +01:00
Peter Zijlstra
4042d003a0 cpufreq/schedutil: Remove unused CPUFREQ_DL
Bitrot...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:14 +01:00
Norbert Manthey
13a453c241 sched/fair: Add ';' after label attributes
Due to using GCC defines for configuration, some labels might be unused in
certain configurations. While adding a __maybe_unused to the label is
fine in general, the line has to be terminated with ';'. This is also
reflected in the GCC documentation, but GCC parsed the previous variant
without an error message.

This has been spotted while compiling with goto-cc, the compiler for the
CPROVER tool suite.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
Signed-off-by: Michael Tautschnig <tautschn@amazon.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1519717660-16157-1-git-send-email-nmanthey@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:59:13 +01:00
Ingo Molnar
fc4c5a3828 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09 07:32:20 +01:00
Leon Yu
3f553b308b module: propagate error in modules_open()
otherwise kernel can oops later in seq_release() due to dereferencing null
file->private_data which is only set if seq_open() succeeds.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: seq_release+0xc/0x30
Call Trace:
 close_pdeo+0x37/0xd0
 proc_reg_release+0x5d/0x60
 __fput+0x9d/0x1d0
 ____fput+0x9/0x10
 task_work_run+0x75/0x90
 do_exit+0x252/0xa00
 do_group_exit+0x36/0xb0
 SyS_exit_group+0xf/0x10

Fixes: 516fb7f2e7 ("/proc/module: use the same logic as /proc/kallsyms for address exposure")
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # 4.15+
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2018-03-08 21:58:51 +01:00
Linus Torvalds
e67548254b MIPS fixes for 4.16-rc5
A miscellaneous pile of MIPS fixes for 4.16:
  - Move put_compat_sigset() to evade hardened usercopy warnings (4.16)
  - Select ARCH_HAVE_PC_{SERIO,PARPORT} for Loongson64 platforms (4.16)
  - Fix kzalloc() failure handling in ath25 (3.19) and Octeon (4.0)
  - Fix disabling of IPIs during BMIPS suspend (3.19)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd80NauSabkiESfLYbAtpk944dnoFAlqhPTwACgkQbAtpk944
 dnq+WA//UR0iGZWc+FlpSAKjvVMWecvjQx81RhoBsFL18TmhDp9dtAVrm+oDmlJB
 /WpnA+BZCM58oEeXYNZ4dc4k4nD9VQTrFksaFJkVKBNyuVEiZbv3u21v6NKbUYBw
 0S4J2YKkUWtLIPFi5aymzPWn1bajY/Q2wi/1REeTIdcVeghWQDd/iShjYpSnvpCY
 XGvZebJyzFVigX404Z7WFDNoO9GhsZwdOMe00nW536ph2LFyCE6U65VzwdmvVZkr
 95kpnfQAb5aYv1/2jFEhyoX2ddhHuCmT+TfJ5Db68dd3AXkV/pdOjcNMdPOhmlFB
 ebDlFw91XKoAT360M/JtQkamZt09Zzl+Ea7lJ7Me4N6JGmEcqIjrZc7tNTUKknnW
 2W8WhrDpuZ2x4x3jx0ckGSvFFYhtkcFqNTktTjMYOSwmSiyd1Txe2VPLSVHy3d3J
 SbE3ioqnbIq+LHiVEqtFxmmsqrfgrh36v+Mc3kihQCCybTji+dGH8GR3bF7ccLOW
 6rpENPPVlP6/A/UNeNfH4d6apjhVBLuo1qdo22KEDKz6o9OUI3UKTKUG9wGhzvho
 rV+2LaYgtn4qmYoqjxBp2NgxZRkuw5TdYOXQL0XQl8te2L7bWyGXqWazfhbwMLpT
 p6yZaMIzPJhpPbhsyh/gTjqqPFmf86lFQW6SXwsbOT1aZnK9tGs=
 =X1dj
 -----END PGP SIGNATURE-----

Merge tag 'mips_fixes_4.16_4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips

Pull MIPS fixes from James Hogan:
 "A miscellaneous pile of MIPS fixes for 4.16:

   - move put_compat_sigset() to evade hardened usercopy warnings (4.16)

   - select ARCH_HAVE_PC_{SERIO,PARPORT} for Loongson64 platforms (4.16)

   - fix kzalloc() failure handling in ath25 (3.19) and Octeon (4.0)

   - fix disabling of IPIs during BMIPS suspend (3.19)"

* tag 'mips_fixes_4.16_4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips:
  MIPS: BMIPS: Do not mask IPIs during suspend
  MIPS: Loongson64: Select ARCH_MIGHT_HAVE_PC_SERIO
  MIPS: Loongson64: Select ARCH_MIGHT_HAVE_PC_PARPORT
  signals: Move put_compat_sigset to compat.h to silence hardened usercopy
  MIPS: OCTEON: irq: Check for null return on kzalloc allocation
  MIPS: ath25: Check for kzalloc allocation failure
2018-03-08 10:03:12 -08:00
Borislav Petkov
5ad7510537 panic: Add closing panic marker parenthesis
Otherwise it looks unbalanced.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/20180306094920.16917-2-bp@alien8.de
2018-03-08 12:01:10 +01:00
Teng Qin
95da0cdb72 bpf: add support to read sample address in bpf program
This commit adds new field "addr" to bpf_perf_event_data which could be
read and used by bpf programs attached to perf events. The value of the
field is copied from bpf_perf_event_data_kern.addr and contains the
address value recorded by specifying sample_type with PERF_SAMPLE_ADDR
when calling perf_event_open.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-08 02:22:34 +01:00
Oliver O'Halloran
167f5594b5 kernel/memremap: Remove stale devres_free() call
devm_memremap_pages() was re-worked in e8d5134833 "memremap: change
devm_memremap_pages interface to use struct dev_pagemap" to take a
caller allocated struct dev_pagemap as a function parameter. A call to
devres_free() was left in the error cleanup path which results in a
kernel panic if the remap fails for some reason. Remove it to fix the
panic and let devm_memremap_pages() fail gracefully.

Fixes: e8d5134833 ("memremap: change devm_memremap_pages interface...")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-06 10:58:54 -08:00
Ingo Molnar
8af31363cd Linux 4.16-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqceRweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG59gH/0CVX4x6EobO/PQu
 CzLVtAoRGFuIghB6Gmbx3Q1Ck4sn4q2SqUKTtkf03yauRGvnJsEmd6wEZ8f2IOHy
 f30nX9s+4irzpQUIum4rH9KP6SMVJfNXlSVSisnamA6MbhPre3/NRcAIBUxdE4cK
 lP81TaT6Nvp5cOySlPjPdWSbN4B1froFQ6rZ/lvG406QzqCvKvlS39h6IYjOF7Ds
 zB/h3RkyuK9YyxFUO338RTEQ583esc0jTiTN4Pzb6nH3x8aTawDqGrwI2B4mkTLw
 vNSPPE2VW9to0cZX+J7TH+uusPNXIlHZCD9tXwqWe5M+sCrE2FuydnmpZIf1A2LY
 aWs0KQs=
 =4nyn
 -----END PGP SIGNATURE-----

Merge tag 'v4.16-rc4' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-06 07:30:22 +01:00
David S. Miller
0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Linus Torvalds
547046141f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Use an appropriate TSQ pacing shift in mac80211, from Toke
    Høiland-Jørgensen.

 2) Just like ipv4's ip_route_me_harder(), we have to use skb_to_full_sk
    in ip6_route_me_harder, from Eric Dumazet.

 3) Fix several shutdown races and similar other problems in l2tp, from
    James Chapman.

 4) Handle missing XDP flush properly in tuntap, for real this time.
    From Jason Wang.

 5) Out-of-bounds access in powerpc ebpf tailcalls, from Daniel
    Borkmann.

 6) Fix phy_resume() locking, from Andrew Lunn.

 7) IFLA_MTU values are ignored on newlink for some tunnel types, fix
    from Xin Long.

 8) Revert F-RTO middle box workarounds, they only handle one dimension
    of the problem. From Yuchung Cheng.

 9) Fix socket refcounting in RDS, from Ka-Cheong Poon.

10) Don't allow ppp unit registration to an unregistered channel, from
    Guillaume Nault.

11) Various hv_netvsc fixes from Stephen Hemminger.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (98 commits)
  hv_netvsc: propagate rx filters to VF
  hv_netvsc: filter multicast/broadcast
  hv_netvsc: defer queue selection to VF
  hv_netvsc: use napi_schedule_irqoff
  hv_netvsc: fix race in napi poll when rescheduling
  hv_netvsc: cancel subchannel setup before halting device
  hv_netvsc: fix error unwind handling if vmbus_open fails
  hv_netvsc: only wake transmit queue if link is up
  hv_netvsc: avoid retry on send during shutdown
  virtio-net: re enable XDP_REDIRECT for mergeable buffer
  ppp: prevent unregistered channels from connecting to PPP units
  tc-testing: skbmod: fix match value of ethertype
  mlxsw: spectrum_switchdev: Check success of FDB add operation
  net: make skb_gso_*_seglen functions private
  net: xfrm: use skb_gso_validate_network_len() to check gso sizes
  net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
  net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
  rds: Incorrect reference counting in TCP socket creation
  net: ethtool: don't ignore return from driver get_fecparam method
  vrf: check forwarding on the original netdevice when generating ICMP dest unreachable
  ...
2018-03-05 11:29:24 -08:00
Linus Torvalds
4c4ce3022d Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A small set of fixes from the timer departement:

   - Add a missing timer wheel clock forward when migrating timers off a
     unplugged CPU to prevent operating on a stale clock base and
     missing timer deadlines.

   - Use the proper shift count to extract data from a register value to
     prevent evaluating unrelated bits

   - Make the error return check in the FSL timer driver work correctly.
     Checking an unsigned variable for less than zero does not really
     work well.

   - Clarify the confusing comments in the ARC timer code"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Forward timer base before migrating timers
  clocksource/drivers/arc_timer: Update some comments
  clocksource/drivers/mips-gic-timer: Use correct shift count to extract data
  clocksource/drivers/fsl_ftm_timer: Fix error return checking
2018-03-04 11:34:49 -08:00
Ingo Molnar
14a7405b2e sched/core: Undefine tracepoint creation at the end of core.c
Make it easier to concatenate all the scheduler .c files for single-module
compilation.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04 12:39:34 +01:00
Ingo Molnar
02d8ec9456 sched/deadline, rt: Rename queue_push_tasks/queue_pull_task to create separate namespace
There are similarly named functions in both of these modules:

  kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq)
  kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq)
  kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq)
  kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq)
  kernel/sched/deadline.c:	queue_push_tasks(rq);
  kernel/sched/deadline.c:	queue_pull_task(rq);
  kernel/sched/deadline.c:			queue_push_tasks(rq);
  kernel/sched/deadline.c:			queue_pull_task(rq);
  kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq)
  kernel/sched/rt.c:static inline void queue_pull_task(struct rq *rq)
  kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq)
  kernel/sched/rt.c:	queue_push_tasks(rq);
  kernel/sched/rt.c:	queue_pull_task(rq);
  kernel/sched/rt.c:			queue_push_tasks(rq);
  kernel/sched/rt.c:			queue_pull_task(rq);

... which makes it harder to grep for them. Prefix them with
deadline_ and rt_, respectively.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04 12:39:34 +01:00
Ingo Molnar
a92057e14b sched/idle: Merge kernel/sched/idle.c and kernel/sched/idle_task.c
Merge these two small .c modules as they implement two aspects
of idle task handling.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04 12:39:33 +01:00
Ingo Molnar
325ea10c08 sched/headers: Simplify and clean up header usage in the scheduler
Do the following cleanups and simplifications:

 - sched/sched.h already includes <asm/paravirt.h>, so no need to
   include it in sched/core.c again.

 - order the <linux/sched/*.h> headers alphabetically

 - add all <linux/sched/*.h> headers to kernel/sched/sched.h

 - remove all unnecessary includes from the .c files that
   are already included in kernel/sched/sched.h.

Finally, make all scheduler .c files use a single common header:

  #include "sched.h"

... which now contains a union of the relied upon headers.

This makes the various .c files easier to read and easier to handle.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04 12:39:29 +01:00
Linus Torvalds
20f14172cb Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "A 4.16 regression fix, three fixes for -stable, and a cleanup fix:

   - During the merge window support for the new ACPI NVDIMM Platform
     Capabilities structure disabled support for "deep flush", a
     force-unit- access like mechanism for persistent memory. Restore
     that mechanism.

   - VFIO like RDMA is yet one more memory registration / pinning
     interface that is incompatible with Filesystem-DAX. Disable long
     term pins of Filesystem-DAX mappings via VFIO.

   - The Filesystem-DAX detection to prevent long terms pins mistakenly
     also disabled Device-DAX pins which are not subject to the same
     block- map collision concerns.

   - Similar to the setup path, softlockup warnings can trigger in the
     shutdown path for large persistent memory namespaces. Teach
     for_each_device_pfn() to perform cond_resched() in all cases.

   - Boaz noticed that the might_sleep() in dax_direct_access() is stale
     as of the v4.15 kernel.

  These have received a build success notification from the 0day robot,
  and the longterm pin fixes have appeared in -next. However, I recently
  rebased the tree to remove some other fixes that need to be reworked
  after review feedback.

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  memremap: fix softlockup reports at teardown
  libnvdimm: re-enable deep flush for pmem devices via fsync()
  vfio: disable filesystem-dax page pinning
  dax: fix vma_is_fsdax() helper
  dax: ->direct_access does not sleep anymore
2018-03-03 14:32:00 -08:00
Ingo Molnar
97fb7a0a89 sched: Clean up and harmonize the coding style of the scheduler code base
A good number of small style inconsistencies have accumulated
in the scheduler core, so do a pass over them to harmonize
all these details:

 - fix speling in comments,

 - use curly braces for multi-line statements,

 - remove unnecessary parentheses from integer literals,

 - capitalize consistently,

 - remove stray newlines,

 - add comments where necessary,

 - remove invalid/unnecessary comments,

 - align structure definitions and other data types vertically,

 - add missing newlines for increased readability,

 - fix vertical tabulation where it's misaligned,

 - harmonize preprocessor conditional block labeling
   and vertical alignment,

 - remove line-breaks where they uglify the code,

 - add newline after local variable definitions,

No change in functionality:

  md5:
     1191fa0a890cfa8132156d2959d7e9e2  built-in.o.before.asm
     1191fa0a890cfa8132156d2959d7e9e2  built-in.o.after.asm

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-03 15:50:21 +01:00
Mario Leinweber
c2e513821d sched/deadline: Clean up various coding style details
- Fixed style error: Missing space before the open parenthesis
- Fixed style warnings: 2x Missing blank line after declaration

One warning left: else after return
 (I don't feel comfortable fixing that without side effects)

Signed-off-by: Mario Leinweber <marioleinweber@web.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20180302182007.28691-1-marioleinweber@web.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-03 15:50:20 +01:00
Dan Williams
949b93250a memremap: fix softlockup reports at teardown
The cond_resched() currently in the setup path needs to be duplicated in
the teardown path. Rather than require each instance of
for_each_device_pfn() to open code the same sequence, embed it in the
helper.

Link: https://github.com/intel/ixpdimm_sw/issues/11
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Fixes: 7138970383 ("mm, zone_device: Replace {get, put}_zone_device_page()...")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-02 19:34:50 -08:00
Matt Redfearn
fde9fc766e
signals: Move put_compat_sigset to compat.h to silence hardened usercopy
Since commit afcc90f862 ("usercopy: WARN() on slab cache usercopy
region violations"), MIPS systems booting with a compat root filesystem
emit a warning when copying compat siginfo to userspace:

WARNING: CPU: 0 PID: 953 at mm/usercopy.c:81 usercopy_warn+0x98/0xe8
Bad or missing usercopy whitelist? Kernel memory exposure attempt
detected from SLAB object 'task_struct' (offset 1432, size 16)!
Modules linked in:
CPU: 0 PID: 953 Comm: S01logging Not tainted 4.16.0-rc2 #10
Stack : ffffffff808c0000 0000000000000000 0000000000000001 65ac85163f3bdc4a
	65ac85163f3bdc4a 0000000000000000 90000000ff667ab8 ffffffff808c0000
	00000000000003f8 ffffffff808d0000 00000000000000d1 0000000000000000
	000000000000003c 0000000000000000 ffffffff808c8ca8 ffffffff808d0000
	ffffffff808d0000 ffffffff80810000 fffffc0000000000 ffffffff80785c30
	0000000000000009 0000000000000051 90000000ff667eb0 90000000ff667db0
	000000007fe0d938 0000000000000018 ffffffff80449958 0000000020052798
	ffffffff808c0000 90000000ff664000 90000000ff667ab0 00000000100c0000
	ffffffff80698810 0000000000000000 0000000000000000 0000000000000000
	0000000000000000 0000000000000000 ffffffff8010d02c 65ac85163f3bdc4a
	...
Call Trace:
[<ffffffff8010d02c>] show_stack+0x9c/0x130
[<ffffffff80698810>] dump_stack+0x90/0xd0
[<ffffffff80137b78>] __warn+0x100/0x118
[<ffffffff80137bdc>] warn_slowpath_fmt+0x4c/0x70
[<ffffffff8021e4a8>] usercopy_warn+0x98/0xe8
[<ffffffff8021e68c>] __check_object_size+0xfc/0x250
[<ffffffff801bbfb8>] put_compat_sigset+0x30/0x88
[<ffffffff8011af24>] setup_rt_frame_n32+0xc4/0x160
[<ffffffff8010b8b4>] do_signal+0x19c/0x230
[<ffffffff8010c408>] do_notify_resume+0x60/0x78
[<ffffffff80106f50>] work_notifysig+0x10/0x18
---[ end trace 88fffbf69147f48a ]---

Commit 5905429ad8 ("fork: Provide usercopy whitelisting for
task_struct") noted that:

"While the blocked and saved_sigmask fields of task_struct are copied to
userspace (via sigmask_to_save() and setup_rt_frame()), it is always
copied with a static length (i.e. sizeof(sigset_t))."

However, this is not true in the case of compat signals, whose sigset
is copied by put_compat_sigset and receives size as an argument.

At most call sites, put_compat_sigset is copying a sigset from the
current task_struct. This triggers a warning when
CONFIG_HARDENED_USERCOPY is active. However, by marking this function as
static inline, the warning can be avoided because in all of these cases
the size is constant at compile time, which is allowed. The only site
where this is not the case is handling the rt_sigpending syscall, but
there the copy is being made from a stack local variable so does not
trigger the warning.

Move put_compat_sigset to compat.h, and mark it static inline. This
fixes the WARN on MIPS.

Fixes: afcc90f862 ("usercopy: WARN() on slab cache usercopy region violations")
Signed-off-by: Matt Redfearn <matt.redfearn@mips.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Dmitry V . Levin" <ldv@altlinux.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/18639/
Signed-off-by: James Hogan <jhogan@kernel.org>
2018-03-02 21:31:55 +00:00
David S. Miller
a5f7b0eeb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-02-28

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Add schedule points and reduce the number of loop iterations
   the test_bpf kernel module is performing in order to not hog
   the CPU for too long, from Eric.

2) Fix an out of bounds access in tail calls in the ppc64 BPF
   JIT compiler, from Daniel.

3) Fix a crash on arm64 on unaligned BPF xadd operations that
   could be triggered via interpreter and JIT, from Daniel.

Please not that once you merge net into net-next at some point, there
is a minor merge conflict in test_verifier.c since test cases had
been added at the end in both trees. Resolution is trivial: keep all
the test cases from both trees.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 21:42:07 -05:00
Linus Torvalds
7bec4a9646 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk fix from Petr Mladek:
 "Make sure that we wake up userspace loggers. This fixes a race
  introduced by the console waiter logic during this merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  printk: Wake klogd when passing console_lock owner
2018-03-01 10:06:39 -08:00
Lingutla Chandrasekhar
c52232a49e timers: Forward timer base before migrating timers
On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
live CPU. This happens from the control thread which initiated the unplug.

If the CPU on which the control thread runs came out from a longer idle
period then the base clock of that CPU might be stale because the control
thread runs prior to any event which forwards the clock.

In such a case the timers from the unplugged CPU are queued on the live CPU
based on the stale clock which can cause large delays due to increased
granularity of the outer timer wheels which are far away from base:;clock.

But there is a worse problem than that. The following sequence of events
illustrates it:

 - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.

   The timer is queued at wheel level 2, with resulting expiry time = 60032
   (due to level granularity).

 - CPU1 enters idle @60007, with next timer expiry @60020.

 - CPU0 is hotplugged at @60009

 - CPU1 exits idle and runs the control thread which migrates the
   timers from CPU0

   timer1 is now queued in level 0 for immediate handling in the next
   softirq because the requested expiry time 59969 is before CPU1 base->clk
   60007

 - CPU1 runs code which forwards the base clock which succeeds because the
   next expiring timer. which was collected at idle entry time is still set
   to 60020.

   So it forwards beyond 60007 and therefore misses to expire the migrated
   timer1. That timer gets expired when the wheel wraps around again, which
   takes between 63 and 630ms depending on the HZ setting.

Address both problems by invoking forward_timer_base() for the control CPUs
timer base. All other places, which might run into a similar problem
(mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
that.

[ tglx: Massaged comment and changelog ]

Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Co-developed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: linux-arm-msm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
2018-02-28 23:34:33 +01:00
Borislav Petkov
04860d48a8 locking/lockdep: Show unadorned pointers
Show unadorned pointers in lockdep reports - lockdep is a debugging
facility and hashing pointers there doesn't make a whole lotta sense.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20180226134926.23069-1-bp@alien8.de
2018-02-28 15:25:44 +01:00
Petr Mladek
c14376de3a printk: Wake klogd when passing console_lock owner
wake_klogd is a local variable in console_unlock(). The information
is lost when the console_lock owner using the busy wait added by
the commit dbdda842fe ("printk: Add console owner and waiter
logic to load balance console writes"). The following race is
possible:

CPU0				CPU1
console_unlock()

  for (;;)
     /* calling console for last message */

				printk()
				  log_store()
				    log_next_seq++;

     /* see new message */
     if (seen_seq != log_next_seq) {
	wake_klogd = true;
	seen_seq = log_next_seq;
     }

     console_lock_spinning_enable();

				  if (console_trylock_spinning())
				     /* spinning */

     if (console_lock_spinning_disable_and_check()) {
	printk_safe_exit_irqrestore(flags);
	return;

				  console_unlock()
				    if (seen_seq != log_next_seq) {
				    /* already seen */
				    /* nothing to do */

Result: Nobody would wakeup klogd.

One solution would be to make a global variable from wake_klogd.
But then we would need to manipulate it under a lock or so.

This patch wakes klogd also when console_lock is passed to the
spinning waiter. It looks like the right way to go. Also userspace
should have a chance to see and store any "flood" of messages.

Note that the very late klogd wake up was a historic solution.
It made sense on single CPU systems or when sys_syslog() operations
were synchronized using the big kernel lock like in v2.1.113.
But it is questionable these days.

Fixes: dbdda842fe ("printk: Add console owner and waiter logic to load balance console writes")
Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Cc: Tejun Heo <tj@kernel.org>
Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-02-27 10:25:50 +01:00
Linus Torvalds
85a2d939c0 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "Yet another pile of melted spectrum related changes:

   - sanitize the array_index_nospec protection mechanism: Remove the
     overengineered array_index_nospec_mask_check() magic and allow
     const-qualified types as index to avoid temporary storage in a
     non-const local variable.

   - make the microcode loader more robust by properly propagating error
     codes. Provide information about new feature bits after micro code
     was updated so administrators can act upon.

   - optimizations of the entry ASM code which reduce code footprint and
     make the code simpler and faster.

   - fix the {pmd,pud}_{set,clear}_flags() implementations to work
     properly on paravirt kernels by removing the address translation
     operations.

   - revert the harmful vmexit_fill_RSB() optimization

   - use IBRS around firmware calls

   - teach objtool about retpolines and add annotations for indirect
     jumps and calls.

   - explicitly disable jumplabel patching in __init code and handle
     patching failures properly instead of silently ignoring them.

   - remove indirect paravirt calls for writing the speculation control
     MSR as these calls are obviously proving the same attack vector
     which is tried to be mitigated.

   - a few small fixes which address build issues with recent compiler
     and assembler versions"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  KVM/VMX: Optimize vmx_vcpu_run() and svm_vcpu_run() by marking the RDMSR path as unlikely()
  KVM/x86: Remove indirect MSR op calls from SPEC_CTRL
  objtool, retpolines: Integrate objtool with retpoline support more closely
  x86/entry/64: Simplify ENCODE_FRAME_POINTER
  extable: Make init_kernel_text() global
  jump_label: Warn on failed jump_label patching attempt
  jump_label: Explicitly disable jump labels in __init code
  x86/entry/64: Open-code switch_to_thread_stack()
  x86/entry/64: Move ASM_CLAC to interrupt_entry()
  x86/entry/64: Remove 'interrupt' macro
  x86/entry/64: Move the switch_to_thread_stack() call to interrupt_entry()
  x86/entry/64: Move ENTER_IRQ_STACK from interrupt macro to interrupt_entry
  x86/entry/64: Move PUSH_AND_CLEAR_REGS from interrupt macro to helper function
  x86/speculation: Move firmware_restrict_branch_speculation_*() from C to CPP
  objtool: Add module specific retpoline rules
  objtool: Add retpoline validation
  objtool: Use existing global variables for options
  x86/mm/sme, objtool: Annotate indirect call in sme_encrypt_execute()
  x86/boot, objtool: Annotate indirect jump in secondary_startup_64()
  x86/paravirt, objtool: Annotate indirect calls
  ...
2018-02-26 09:34:21 -08:00
David S. Miller
ba6056a41c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-02-26

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Various improvements for BPF kselftests: i) skip unprivileged tests
   when kernel.unprivileged_bpf_disabled sysctl knob is set, ii) count
   the number of skipped tests from unprivileged, iii) when a test case
   had an unexpected error then print the actual but also the unexpected
   one for better comparison, from Joe.

2) Add a sample program for collecting CPU state statistics with regards
   to how long the CPU resides in cstate and pstate levels. Based on
   cpu_idle and cpu_frequency trace points, from Leo.

3) Various x64 BPF JIT optimizations to further shrink the generated
   image size in order to make it more icache friendly. When tested on
   the Cilium generated programs, image size reduced by approx 4-5% in
   best case mainly due to how LLVM emits unsigned 32 bit constants,
   from Daniel.

4) Improvements and fixes on the BPF sockmap sample programs: i) fix
   the sockmap's Makefile to include nlattr.o for libbpf, ii) detach
   the sock ops programs from the cgroup before exit, from Prashant.

5) Avoid including xdp.h in filter.h by just forward declaring the
   struct xdp_rxq_info in filter.h, from Jesper.

6) Fix the BPF kselftests Makefile for cgroup_helpers.c by only declaring
   it a dependency for test_dev_cgroup.c but not every other test case
   where it is not needed, from Jesper.

7) Adjust rlimit RLIMIT_MEMLOCK for test_tcpbpf_user selftest since the
   default is insufficient for creating the 'global_map' used in the
   corresponding BPF program, from Yonghong.

8) Likewise, for the xdp_redirect sample, Tushar ran into the same when
   invoking xdp_redirect and xdp_monitor at the same time, therefore
   in order to have the sample generically work bump the limit here,
   too. Fix from Tushar.

9) Avoid an unnecessary NULL check in BPF_CGROUP_RUN_PROG_INET_SOCK()
   since sk is always guaranteed to be non-NULL, from Yafang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26 10:37:24 -05:00
Linus Torvalds
c23a757591 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A small set of fixes:

   - UAPI data type correction for hyperv

   - correct the cpu cores field in /proc/cpuinfo on CPU hotplug

   - return proper error code in the resctrl file system failure path to
     avoid silent subsequent failures

   - correct a subtle accounting issue in the new vector allocation code
     which went unnoticed for a while and caused suspend/resume
     failures"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
  x86/topology: Fix function name in documentation
  x86/intel_rdt: Fix incorrect returned value when creating rdgroup sub-directory in resctrl file system
  x86/apic/vector: Handle vector release on CPU unplug correctly
  genirq/matrix: Handle CPU offlining proper
  x86/headers/UAPI: Use __u64 instead of u64 in <uapi/asm/hyperv.h>
2018-02-25 16:58:55 -08:00
David S. Miller
f74290fdb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-24 00:04:20 -05:00
Paul E. McKenney
338c46403f Merge branches 'fixes.2018.02.23a', 'srcu.2018.02.20a' and 'torture.2018.02.20a' into HEAD
fixes.2018.02.23a: Miscellaneous fixes
srcu.2018.02.20a: SRCU updates
torture.2018.02.20a: Torture-test updates
2018-02-23 15:15:41 -08:00
Paul E. McKenney
ad7c946b35 rcu: Create RCU-specific workqueues with rescuers
RCU's expedited grace periods can participate in out-of-memory deadlocks
due to all available system_wq kthreads being blocked and there not being
memory available to create more.  This commit prevents such deadlocks
by allocating an RCU-specific workqueue_struct at early boot time, and
providing it with a rescuer to ensure forward progress.  This uses the
shiny new init_rescuer() function provided by Tejun (but indirectly).

This commit also causes SRCU to use this new RCU-specific
workqueue_struct.  Note that SRCU's use of workqueues never blocks them
waiting for readers, so this should be safe from a forward-progress
viewpoint.  Note that this moves SRCU from system_power_efficient_wq
to a normal workqueue.  In the unlikely event that this results in
measurable degradation, a separate power-efficient workqueue will be
creates for SRCU.

Reported-by: Prateek Sood <prsood@codeaurora.org>
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
2018-02-23 15:14:40 -08:00
Linus Torvalds
9cb9c07d6b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix TTL offset calculation in mac80211 mesh code, from Peter Oh.

 2) Fix races with procfs in ipt_CLUSTERIP, from Cong Wang.

 3) Memory leak fix in lpm_trie BPF map code, from Yonghong Song.

 4) Need to use GFP_ATOMIC in BPF cpumap allocations, from Jason Wang.

 5) Fix potential deadlocks in netfilter getsockopt() code paths, from
    Paolo Abeni.

 6) Netfilter stackpointer size checks really are needed to validate
    user input, from Florian Westphal.

 7) Missing timer init in x_tables, from Paolo Abeni.

 8) Don't use WQ_MEM_RECLAIM in mac80211 hwsim, from Johannes Berg.

 9) When an ibmvnic device is brought down then back up again, it can be
    sent queue entries from a previous session, handle this properly
    instead of crashing. From Thomas Falcon.

10) Fix TCP checksum on LRO buffers in mlx5e, from Gal Pressman.

11) When we are dumping filters in cls_api, the output SKB is empty, and
    the filter we are dumping is too large for the space in the SKB, we
    should return -EMSGSIZE like other netlink dump operations do.
    Otherwise userland has no signal that is needs to increase the size
    of its read buffer. From Roman Kapl.

12) Several XDP fixes for virtio_net, from Jesper Dangaard Brouer.

13) Module refcount leak in netlink when a dump start fails, from Jason
    Donenfeld.

14) Handle sub-optimal GSO sizes better in TCP BBR congestion control,
    from Eric Dumazet.

15) Releasing bpf per-cpu arraymaps can take a long time, add a
    condtional scheduling point. From Eric Dumazet.

16) Implement retpolines for tail calls in x64 and arm64 bpf JITs. From
    Daniel Borkmann.

17) Fix page leak in gianfar driver, from Andy Spencer.

18) Missed clearing of estimator scratch buffer, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
  net_sched: gen_estimator: fix broken estimators based on percpu stats
  gianfar: simplify FCS handling and fix memory leak
  ipv6 sit: work around bogus gcc-8 -Wrestrict warning
  macvlan: fix use-after-free in macvlan_common_newlink()
  bpf, arm64: fix out of bounds access in tail call
  bpf, x64: implement retpoline for tail call
  rxrpc: Fix send in rxrpc_send_data_packet()
  net: aquantia: Fix error handling in aq_pci_probe()
  bpf: fix rcu lockdep warning for lpm_trie map_free callback
  bpf: add schedule points in percpu arrays management
  regulatory: add NUL to request alpha2
  ibmvnic: Fix early release of login buffer
  net/smc9194: Remove bogus CONFIG_MAC reference
  net: ipv4: Set addr_type in hash_keys for forwarded case
  tcp_bbr: better deal with suboptimal GSO
  smsc75xx: fix smsc75xx_set_features()
  netlink: put module reference if dump start fails
  selftests/bpf/test_maps: exit child process without error in ENOMEM case
  selftests/bpf: update gitignore with test_libbpf_open
  selftests/bpf: tcpbpf_kern: use in6_* macros from glibc
  ..
2018-02-23 15:14:17 -08:00
Linus Torvalds
2eb02aa94f Merge branch 'fixes-v4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem fixes from James Morris:

 - keys fixes via David Howells:
      "A collection of fixes for Linux keyrings, mostly thanks to Eric
       Biggers:

        - Fix some PKCS#7 verification issues.

        - Fix handling of unsupported crypto in X.509.

        - Fix too-large allocation in big_key"

 - Seccomp updates via Kees Cook:
      "These are fixes for the get_metadata interface that landed during
       -rc1. While the new selftest is strictly not a bug fix, I think
       it's in the same spirit of avoiding bugs"

 - an IMA build fix from Randy Dunlap

* 'fixes-v4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  integrity/security: fix digsig.c build error with header file
  KEYS: Use individual pages in big_key for crypto buffers
  X.509: fix NULL dereference when restricting key with unsupported_sig
  X.509: fix BUG_ON() when hash algorithm is unsupported
  PKCS#7: fix direct verification of SignerInfo signature
  PKCS#7: fix certificate blacklisting
  PKCS#7: fix certificate chain verification
  seccomp: add a selftest for get_metadata
  ptrace, seccomp: tweak get_metadata behavior slightly
  seccomp, ptrace: switch get_metadata types to arch independent
2018-02-23 15:04:24 -08:00
Daniel Borkmann
ca36960211 bpf: allow xadd only on aligned memory
The requirements around atomic_add() / atomic64_add() resp. their
JIT implementations differ across architectures. E.g. while x86_64
seems just fine with BPF's xadd on unaligned memory, on arm64 it
triggers via interpreter but also JIT the following crash:

  [  830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
  [...]
  [  830.916161] Internal error: Oops: 96000021 [#1] SMP
  [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
  [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
  [  830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
  [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
  [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
  [  831.012485] sp : ffff00001ccabc20
  [  831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
  [  831.021087] x27: 0000000000000001 x26: 0000000000000000
  [  831.026387] x25: 000000c168d9db98 x24: 0000000000000000
  [  831.031686] x23: ffff000008203878 x22: ffff000009488000
  [  831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
  [  831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
  [  831.047585] x17: 0000000000000000 x16: 0000000000000000
  [  831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
  [  831.058184] x13: 0000000000000000 x12: 0000000000000000
  [  831.063484] x11: 0000000000000001 x10: 0000000000000000
  [  831.068783] x9 : 0000000000000000 x8 : 0000000000000000
  [  831.074083] x7 : 0000000000000000 x6 : 000580d428000000
  [  831.079383] x5 : 0000000000000018 x4 : 0000000000000000
  [  831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
  [  831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
  [  831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
  [  831.102577] Call trace:
  [  831.105012]  __ll_sc_atomic_add+0x4/0x18
  [  831.108923]  __bpf_prog_run32+0x4c/0x70
  [  831.112748]  bpf_test_run+0x78/0xf8
  [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
  [  831.120567]  SyS_bpf+0x77c/0x1110
  [  831.123873]  el0_svc_naked+0x30/0x34
  [  831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)

Reason for this is because memory is required to be aligned. In
case of BPF, we always enforce alignment in terms of stack access,
but not when accessing map values or packet data when the underlying
arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.

xadd on packet data that is local to us anyway is just wrong, so
forbid this case entirely. The only place where xadd makes sense in
fact are map values; xadd on stack is wrong as well, but it's been
around for much longer. Specifically enforce strict alignment in case
of xadd, so that we handle this case generically and avoid such crashes
in the first place.

Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-02-23 14:33:39 -08:00
Linus Torvalds
8961ca441b exynos, meson, ipuv3, secondary gpu, cirrus, edid quirk fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaj38KAAoJEAx081l5xIa+AY8P/0oX+UPtjNxVqUTzeejxxZG7
 EpmcJWP2SENnkOSdiyPMLI4SIOgv0B+73hX6ATbsVx9nseqxAJyoAFJZCQy7ioS3
 RjB6wXi/WQrxrXc3MU5FUp8AfPLvZx2BlAHGqyuk3V2f3fIjl0tWmMuxgdc0WX1j
 wzzHNBEKoXG5WVVEOXJZq5xd8s35QTdhqGpqrvl1ruHtqmnls8n67qPB9F7F6lHm
 Iwi6MlvIxwoLIuWj0cJyOoUdw0Z6/MQ+Of8zW1E0NJIfgfa9LKjtIRUacJvOndRP
 Oq9XUCI/6gmNswmdktz65w1SfuU/cq9j46FuBh23QYNvYfuYgtvL0xhQPYF08vtK
 83X1Sop8Pzz9f2jCL2TPKLF37TetNpMT1gTP/NsGirRc+cvZTMBl1+OcWO47oTYZ
 TZ70L7GSJOdJV/n5vdCE5bSBS/thvLC5tyUGgRH+y7E6Lt2HouVN3ulkKb/stuQ3
 ee9NbI16YXZepK3+Z4YUdFziC40BO7K0LGlyAjs9G95LBRQNq9jNJLXTog5vSUJa
 3DFjEqQ558iciGkmYx4cQhlCqYvzuNClutz2D4RN7LqA5wHKqt4LWwTgjnUk9Z82
 lvNm3IGB+HiXEWpQmEuQeMqC+Xwxfdx3n+s3I7TpztdbgIWJM4KqAa4OKKK2NUM6
 qxEYwcQ2P84obOwBkVu3
 =LRdE
 -----END PGP SIGNATURE-----

Merge tag 'drm-fixes-for-v4.16-rc3' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "A bunch of fixes for rc3:

  Exynos:
   - fixes for using monotonic timestamps
   - register definitions
   - removal of unused file

  ipu-v3L
   - minor changes
   - make some register arrays const+static
   - fix some leaks

  meson:
   - fix for vsync

  atomic:
   - fix for memory leak

  EDID parser:
   - add quirks for some more non-desktop devices
   - 6-bit panel fix.

  drm_mm:
   - fix a bug in the core drm mm hole handling

  cirrus:
   - fix lut loading regression

  Lastly there is a deadlock fix around runtime suspend for secondary
  GPUs.

  There was a deadlock between one thread trying to wait for a workqueue
  job to finish in the runtime suspend path, and the workqueue job it
  was waiting for in turn waiting for a runtime_get_sync to return.

  The fixes avoids it by not doing the runtime sync in the workqueue as
  then we always wait for all those tasks to complete before we runtime
  suspend"

* tag 'drm-fixes-for-v4.16-rc3' of git://people.freedesktop.org/~airlied/linux: (25 commits)
  drm/tve200: fix kernel-doc documentation comment include
  drm/edid: quirk Sony PlayStation VR headset as non-desktop
  drm/edid: quirk Windows Mixed Reality headsets as non-desktop
  drm/edid: quirk Oculus Rift headsets as non-desktop
  drm/meson: fix vsync buffer update
  drm: Handle unexpected holes in color-eviction
  drm: exynos: Use proper macro definition for HDMI_I2S_PIN_SEL_1
  drm/exynos: remove exynos_drm_rotator.h
  drm/exynos: g2d: Delete an error message for a failed memory allocation in two functions
  drm/exynos: fix comparison to bitshift when dealing with a mask
  drm/exynos: g2d: use monotonic timestamps
  drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA
  gpu: ipu-csi: add 10/12-bit grayscale support to mbus_code_to_bus_cfg
  gpu: ipu-cpmem: add 16-bit grayscale support to ipu_cpmem_set_image
  gpu: ipu-v3: prg: fix device node leak in ipu_prg_lookup_by_phandle
  gpu: ipu-v3: pre: fix device node leak in ipu_pre_lookup_by_phandle
  drm/amdgpu: Fix deadlock on runtime suspend
  drm/radeon: Fix deadlock on runtime suspend
  drm/nouveau: Fix deadlock on runtime suspend
  drm: Allow determining if current task is output poll worker
  ...
2018-02-23 10:31:31 -08:00
Thomas Gleixner
651ca2c004 genirq/matrix: Handle CPU offlining proper
At CPU hotunplug the corresponding per cpu matrix allocator is shut down and
the allocated interrupt bits are discarded under the assumption that all
allocated bits have been either migrated away or shut down through the
managed interrupts mechanism.

This is not true because interrupts which are not started up might have a
vector allocated on the outgoing CPU. When the interrupt is started up
later or completely shutdown and freed then the allocated vector is handed
back, triggering warnings or causing accounting issues which result in
suspend failures and other issues.

Change the CPU hotplug mechanism of the matrix allocator so that the
remaining allocations at unplug time are preserved and global accounting at
hotplug is correctly readjusted to take the dormant vectors into account.

Fixes: 2f75d9e1c9 ("genirq: Implement bitmap matrix allocator")
Reported-by: Yuriy Vostrikov <delamonpansie@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yuriy Vostrikov <delamonpansie@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180222112316.849980972@linutronix.de
2018-02-22 22:05:43 +01:00
Yonghong Song
6c5f61023c bpf: fix rcu lockdep warning for lpm_trie map_free callback
Commit 9a3efb6b66 ("bpf: fix memory leak in lpm_trie map_free callback function")
fixed a memory leak and removed unnecessary locks in map_free callback function.
Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on,
running tools/testing/selftests/bpf/test_lpm_map will have:

  [   98.294321] =============================
  [   98.294807] WARNING: suspicious RCU usage
  [   98.295359] 4.16.0-rc2+ #193 Not tainted
  [   98.295907] -----------------------------
  [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage!
  [   98.297657]
  [   98.297657] other info that might help us debug this:
  [   98.297657]
  [   98.298663]
  [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
  [   98.299536] 2 locks held by kworker/2:1/54:
  [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
  [   98.301381]  #1:  ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0

Since actual trie tree removal happens only after no other
accesses to the tree are possible, replacing
  rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock))
with
  rcu_dereference_protected(*slot, 1)
fixed the issue.

Fixes: 9a3efb6b66 ("bpf: fix memory leak in lpm_trie map_free callback function")
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-22 21:29:12 +01:00
Eric Dumazet
32fff239de bpf: add schedule points in percpu arrays management
syszbot managed to trigger RCU detected stalls in
bpf_array_free_percpu()

It takes time to allocate a huge percpu map, but even more time to free
it.

Since we run in process context, use cond_resched() to yield cpu if
needed.

Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-22 21:27:06 +01:00
James Morris
645ae5c51e - Fix seccomp GET_METADATA to deal with field sizes correctly (Tycho Andersen)
- Add selftest to make sure GET_METADATA doesn't regress (Tycho Andersen)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJajhgGAAoJEIly9N/cbcAmG0QP/As52uMMTdLcCNFLrBB3CoKY
 OZOhxpP3TdZ7sBvEnSJKSCLiT5gfyUkMOm+q8us6SbjFyelmcbliZ8n25tSMis8A
 QkLBAlOx/goSZyKuv4Cp2uLcq51g8G5uI4vXyHtic6rsxT7qhyQgs+ByMEhXBOj/
 T2+b6UJiENNw58FhrPnnDBLj5enzsxJx2zbZeuz82WsWGaJr6yWI8VoLWz3i0JAK
 mr4tQXkjn6J9hHmfDHs/aTwx8wFUVETs/F5gmTcRwVo/fA4/sD7csKmpIH/pGi4h
 uOJuwnjAq5rDhWzTu96hbSLglSwZ6ONJiS+3c1lOL86q7ZDOwzZxU7ltSc2wVsF0
 j5sKD6vVVS/bJkdoNIWDvETxNc2eRY2UQPTdiCsPCYkxLRwerGu+nmeiYxBmbo86
 fJc65Opcy8srEG68qTUYxI36A2TqhLocqwcPBL/NLdI0EjZevvXMbuu+ymOZPcRN
 suvyfNzi7feDuifpDLE5NfLTTdtcMF0XwiRPQtDyLonFcG+lDCA5umEcZysg5mI3
 pEl9BFbGdz83rdLCIj5LZ3P6OZZQG2oCxigKm7V7/X9VpHv6/5KOBpwXoVWllLc+
 h3K+1weJ9PgRBMEI4oT7CaZRRHZwst1BbY/ZFfCVibOX3eiNSTWgWkTV1cECmNPG
 K0yqDL0171z3vTjCSpSR
 =JPlU
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v4.16-rc3' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux into fixes-v4.16-rc3

- Fix seccomp GET_METADATA to deal with field sizes correctly (Tycho Andersen)
- Add selftest to make sure GET_METADATA doesn't regress (Tycho Andersen)
2018-02-22 10:50:24 -08:00
Linus Torvalds
238ca35707 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "16 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: don't defer struct page initialization for Xen pv guests
  lib/Kconfig.debug: enable RUNTIME_TESTING_MENU
  vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems
  selftests/memfd: add run_fuse_test.sh to TEST_FILES
  bug.h: work around GCC PR82365 in BUG()
  mm/swap.c: make functions and their kernel-doc agree (again)
  mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc
  ida: do zeroing in ida_pre_get()
  mm, swap, frontswap: fix THP swap if frontswap enabled
  certs/blacklist_nohashes.c: fix const confusion in certs blacklist
  kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
  mm, mlock, vmscan: no more skipping pagevecs
  mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats
  Kbuild: always define endianess in kconfig.h
  include/linux/sched/mm.h: re-inline mmdrop()
  tools: fix cross-compile var clobbering
2018-02-22 10:45:46 -08:00
Luck, Tony
bef3efbeb8 efivarfs: Limit the rate for non-root to read files
Each read from a file in efivarfs results in two calls to EFI
(one to get the file size, another to get the actual data).

On X86 these EFI calls result in broadcast system management
interrupts (SMI) which affect performance of the whole system.
A malicious user can loop performing reads from efivarfs bringing
the system to its knees.

Linus suggested per-user rate limit to solve this.

So we add a ratelimit structure to "user_struct" and initialize
it for the root user for no limit. When allocating user_struct for
other users we set the limit to 100 per second. This could be used
for other places that want to limit the rate of some detrimental
user action.

In efivarfs if the limit is exceeded when reading, we take an
interruptible nap for 50ms and check the rate limit again.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 10:21:02 -08:00
Tycho Andersen
63bb0045b9 ptrace, seccomp: tweak get_metadata behavior slightly
Previously if users passed a small size for the input structure size, they
would get get odd behavior. It doesn't make sense to pass a structure
smaller than at least filter_off size, so let's just give -EINVAL in this
case.

This changes userspace visible behavior, but was only introduced in commit
26500475ac ("ptrace, seccomp: add support for retrieving seccomp
metadata") in 4.16-rc2, so should be safe to change if merged before then.

Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-02-21 16:56:03 -08:00
David Rientjes
88913bd8ea kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
chan->n_subbufs is set by the user and relay_create_buf() does a kmalloc()
of chan->n_subbufs * sizeof(size_t *).

kmalloc_slab() will generate a warning when this fails if
chan->subbufs * sizeof(size_t *) > KMALLOC_MAX_SIZE.

Limit chan->n_subbufs to the maximum allowed kmalloc() size.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802061216100.122576@chino.kir.corp.google.com
Fixes: f6302f1bcd ("relay: prevent integer overflow in relay_open()")
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-21 15:35:43 -08:00
Andrew Morton
d34bc48f82 include/linux/sched/mm.h: re-inline mmdrop()
As Peter points out, Doing a CALL+RET for just the decrement is a bit silly.

Fixes: d70f2a14b7 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
Acked-by: Peter Zijlstra (Intel) <peterz@infraded.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-21 15:35:42 -08:00
Dave Airlie
dfe8db2237 Fixes for 4.16. I contains fixes for deadlock on runtime suspend on few
drivers, a memory leak on non-blocking commits, a crash on color-eviction.
 The is also meson and edid fixes, plus a fix for a doc warning.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJajY3SAAoJEEN0HIUfOBk0/5IP/jTa0VKe7UurEzj9Vzgt4USu
 tVre4MGN42peY2PbVSsBmvHAOeyII7la1/NkiFi8wZKQ2MXw43NenKOcRLDW0r9b
 6U8Tlq3sU//NdUDAiLLx9hKb+i31ag+wodvULt0PKtEWDsxWDSRZUo792as2YUkC
 VxHuIQywNABohn2Ya8Og1dON25GD7zRzNzH7O+g+fds/Qvav0504u2v10jBKJC0D
 IB2oc3ZtJR8n0dFpzhnEB7YkxyvkrsWZQ1LtutGFgrr54F0KVHvAm/VMZ5qzyCRi
 kvJN81OFo0xpdE7ZMSQ5YAvcPsEC5ifSNaaxpawsM904H7fS6FNhHMg7cGGi1f7R
 B8YbLrdy+mBnQPNNbPcDPQA+YN/tRv4rRmmdLdkDbdY1GM/JJ4C7PTuLL6mX1iWU
 DuHiaFS0KZGoS0XCVbvhLkPt5fsmvp+QxBpeNAtxgOdn2pRquDmGZ1jTVEG2mw5U
 rqoPURa3urqdSwj8ba0jbJo6WBAmb1uWeyJ7xpyUVhR9SR30+URYVWwJEPDOgTnQ
 PaEzjobntgDLaq5NbhpEvmYmylv1SPkucGtkCtwPxIrrh5Z84pZTJ1th2ogfn3Ti
 VL25dTlzFpsjEMgC72wCi0eiP7qLVTX9vHYZBzkeIjIWDH0rCnCFxvjwmD/aVUbz
 Ex1/fGNEVkFupcYLu7m4
 =555h
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-fixes-2018-02-21' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Fixes for 4.16. I contains fixes for deadlock on runtime suspend on few
drivers, a memory leak on non-blocking commits, a crash on color-eviction.
The is also meson and edid fixes, plus a fix for a doc warning.

* tag 'drm-misc-fixes-2018-02-21' of git://anongit.freedesktop.org/drm/drm-misc:
  drm/tve200: fix kernel-doc documentation comment include
  drm/meson: fix vsync buffer update
  drm: Handle unexpected holes in color-eviction
  drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA
  drm/amdgpu: Fix deadlock on runtime suspend
  drm/radeon: Fix deadlock on runtime suspend
  drm/nouveau: Fix deadlock on runtime suspend
  drm: Allow determining if current task is output poll worker
  workqueue: Allow retrieval of current task's work struct
  drm/atomic: Fix memleak on ERESTARTSYS during non-blocking commits
2018-02-22 08:39:26 +10:00
David S. Miller
bf006d18b7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-02-20

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a memory leak in LPM trie's map_free() callback function, where
   the trie structure itself was not freed since initial implementation.
   Also a synchronize_rcu() was needed in order to wait for outstanding
   programs accessing the trie to complete, from Yonghong.

2) Fix sock_map_alloc()'s error path in order to correctly propagate
   the -EINVAL error in case of too large allocation requests. This
   was just recently introduced when fixing close hooks via ULP layer,
   fix from Eric.

3) Do not use GFP_ATOMIC in __cpu_map_entry_alloc(). Reason is that this
   will not work with the recent __ptr_ring_init_queue_alloc() conversion
   to kvmalloc_array(), where in case of fallback to vmalloc() that GFP
   flag is invalid, from Jason.

4) Fix two recent syzkaller warnings: i) fix bpf_prog_array_copy_to_user()
   when a prog query with a big number of ids was performed where we'd
   otherwise trigger a warning from allocator side, ii) fix a missing
   mlock precharge on arraymaps, from Daniel.

5) Two fixes for bpftool in order to avoid breaking JSON output when used
   in batch mode, from Quentin.

6) Move a pr_debug() in libbpf in order to avoid having an otherwise
   uninitialized variable in bpf_program__reloc_text(), from Jeremy.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 15:37:37 -05:00
Tejun Heo
d1897c9538 cgroup: fix rule checking for threaded mode switching
A domain cgroup isn't allowed to be turned threaded if its subtree is
populated or domain controllers are enabled.  cgroup_enable_threaded()
depended on cgroup_can_be_thread_root() test to enforce this rule.  A
parent which has populated domain descendants or have domain
controllers enabled can't become a thread root, so the above rules are
enforced automatically.

However, for the root cgroup which can host mixed domain and threaded
children, cgroup_can_be_thread_root() doesn't check any of those
conditions and thus first level cgroups ends up escaping those rules.

This patch fixes the bug by adding explicit checks for those rules in
cgroup_enable_threaded().

Reported-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8cfd8147df ("cgroup: implement cgroup v2 thread support")
Cc: stable@vger.kernel.org # v4.14+
2018-02-21 11:39:22 -08:00
Josh Poimboeuf
9fbcc57aa1 extable: Make init_kernel_text() global
Convert init_kernel_text() to a global function and use it in a few
places instead of manually comparing _sinittext and _einittext.

Note that kallsyms.h has a very similar function called
is_kernel_inittext(), but its end check is inclusive.  I'm not sure
whether that's intentional behavior, so I didn't touch it.

Suggested-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4335d02be8d45ca7d265d2f174251d0b7ee6c5fd.1519051220.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 16:54:06 +01:00
Josh Poimboeuf
dc1dd184c2 jump_label: Warn on failed jump_label patching attempt
Currently when the jump label code encounters an address which isn't
recognized by kernel_text_address(), it just silently fails.

This can be dangerous because jump labels are used in a variety of
places, and are generally expected to work.  Convert the silent failure
to a warning.

This won't warn about attempted writes to tracepoints in __init code
after initmem has been freed, as those are already guarded by the
entry->code check.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/de3a271c93807adb7ed48f4e946b4f9156617680.1519051220.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 16:54:06 +01:00
Josh Poimboeuf
3335224470 jump_label: Explicitly disable jump labels in __init code
After initmem has been freed, any jump labels in __init code are
prevented from being written to by the kernel_text_address() check in
__jump_label_update().  However, this check is quite broad.  If
kernel_text_address() were to return false for any other reason, the
jump label write would fail silently with no warning.

For jump labels in module init code, entry->code is set to zero to
indicate that the entry is disabled.  Do the same thing for core kernel
init code.  This makes the behavior more consistent, and will also make
it more straightforward to detect non-init jump label write failures in
the next patch.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c52825c73f3a174e8398b6898284ec20d4deb126.1519051220.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 16:54:05 +01:00
Frederic Weisbecker
dcdedb2415 sched/nohz: Remove the 1 Hz tick code
Now that the 1Hz tick is offloaded to workqueues, we can safely remove
the residual code that used to handle it locally.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-7-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 09:49:09 +01:00
Frederic Weisbecker
d84b31313e sched/isolation: Offload residual 1Hz scheduler tick
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.

The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
outsource these scheduler ticks to the global workqueue so that a
housekeeping CPU handles those remotely. The sched_class::task_tick()
implementations have been audited and look safe to be called remotely
as the target runqueue and its current task are passed in parameter
and don't seem to be accessed locally.

Note that in the case of using isolcpus, it's still up to the user to
affine the global workqueues to the housekeeping CPUs through
/sys/devices/virtual/workqueue/cpumask or domains isolation
"isolcpus=nohz,domain".

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 09:49:09 +01:00
Frederic Weisbecker
1bda3f8087 sched/isolation: Isolate workqueues when "nohz_full=" is set
As we prepare for offloading the residual 1hz scheduler ticks to
workqueue, let's affine those to housekeepers so that they don't
interrupt the CPUs that don't want to be disturbed.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 09:49:08 +01:00
Frederic Weisbecker
22ab8bc02a nohz: Allow to check if remote CPU tick is stopped
This check is racy but provides a good heuristic to determine whether
a CPU may need a remote tick or not.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-4-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 09:49:08 +01:00
Frederic Weisbecker
a364298359 nohz: Convert tick_nohz_tick_stopped() to bool
It makes this function more self-explanatory about what it does and how
to use it.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-3-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 09:49:08 +01:00
Frederic Weisbecker
77a021be38 sched/core: Rename init_rq_hrtick() to hrtick_rq_init()
Do that rename in order to normalize the hrtick namespace.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 09:49:07 +01:00
Mel Gorman
7347fc87df sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
If wake_affine() pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.

Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting:

 Hmean     send-64         354.67 (   0.00%)      352.15 (  -0.71%)
 Hmean     send-128        702.91 (   0.00%)      693.84 (  -1.29%)
 Hmean     send-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
 Hmean     send-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
 Hmean     send-2048      9687.44 (   0.00%)     9624.45 (  -0.65%)
 Hmean     send-3312     14577.64 (   0.00%)    14514.35 (  -0.43%)
 Hmean     send-4096     16393.62 (   0.00%)    16488.30 (   0.58%)
 Hmean     send-8192     26877.26 (   0.00%)    26431.63 (  -1.66%)
 Hmean     send-16384    38683.43 (   0.00%)    38264.91 (  -1.08%)
 Hmean     recv-64         354.67 (   0.00%)      352.15 (  -0.71%)
 Hmean     recv-128        702.91 (   0.00%)      693.84 (  -1.29%)
 Hmean     recv-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
 Hmean     recv-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
 Hmean     recv-2048      9687.43 (   0.00%)     9624.45 (  -0.65%)
 Hmean     recv-3312     14577.59 (   0.00%)    14514.35 (  -0.43%)
 Hmean     recv-4096     16393.55 (   0.00%)    16488.20 (   0.58%)
 Hmean     recv-8192     26876.96 (   0.00%)    26431.29 (  -1.66%)
 Hmean     recv-16384    38682.41 (   0.00%)    38263.94 (  -1.08%)

 NUMA alloc hit                 1465986     1423090
 NUMA alloc miss                      0           0
 NUMA interleave hit                  0           0
 NUMA alloc local               1465897     1423003
 NUMA base PTE updates             1473        1420
 NUMA huge PMD updates                0           0
 NUMA page range updates           1473        1420
 NUMA hint faults                  1383        1312
 NUMA hint local faults             451         124
 NUMA hint local percent             32           9

There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine() and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine() and automatic
NUMA balancing fighting each other constantly is undesirable.

However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine:

 nas-mpi
                           4.15.0                 4.15.0
                     sdnuma-v1r23       delayretry-v1r23
 Time cg.D      557.00 (   0.00%)      431.82 (  22.47%)
 Time ep.D       77.83 (   0.00%)       79.01 (  -1.52%)
 Time is.D       26.46 (   0.00%)       26.64 (  -0.68%)
 Time lu.D      727.14 (   0.00%)      597.94 (  17.77%)
 Time mg.D      191.35 (   0.00%)      146.85 (  23.26%)

               4.15.0      4.15.0
         sdnuma-v1r23delayretry-v1r23
 User        75665.20    70413.30
 System      20321.59     8861.67
 Elapsed       766.13      634.92

 Minor Faults                  16528502     7127941
 Major Faults                      4553        5068
 NUMA alloc local               6963197     6749135
 NUMA base PTE updates        366409093   107491434
 NUMA huge PMD updates           687556      198880
 NUMA page range updates      718437765   209317994
 NUMA hint faults              13643410     4601187
 NUMA hint local faults         9212593     3063996
 NUMA hint local percent             67          66

Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.

There were questions on NAS OMP and how it behaved related to threads
being bound to CPUs. First, there are more gains than losses with this
patch applied and a reduction in system CPU usage:

nas-omp
                      4.16.0-rc1             4.16.0-rc1
                     sdnuma-v2r1        delayretry-v2r1
Time bt.D      436.71 (   0.00%)      430.05 (   1.53%)
Time cg.D      201.02 (   0.00%)      180.87 (  10.02%)
Time ep.D       32.84 (   0.00%)       32.68 (   0.49%)
Time is.D        9.63 (   0.00%)        9.64 (  -0.10%)
Time lu.D      331.20 (   0.00%)      304.80 (   7.97%)
Time mg.D       54.87 (   0.00%)       52.72 (   3.92%)
Time sp.D     1108.78 (   0.00%)      917.10 (  17.29%)
Time ua.D      378.81 (   0.00%)      398.83 (  -5.28%)

          4.16.0-rc1  4.16.0-rc1
         sdnuma-v2r1delayretry-v2r1
User       305633.08   296751.91
System        451.75      357.80
Elapsed      2595.73     2368.13

However, it does not close the gap between binding and being unbound. There
is negligible difference between the performance of the baseline and a
patched kernel when threads are bound so it is not presented here:

                      4.16.0-rc1             4.16.0-rc1
                 delayretry-bind     delayretry-unbound
Time bt.D      385.02 (   0.00%)      430.05 ( -11.70%)
Time cg.D      144.02 (   0.00%)      180.87 ( -25.59%)
Time ep.D       32.85 (   0.00%)       32.68 (   0.52%)
Time is.D       10.52 (   0.00%)        9.64 (   8.37%)
Time lu.D      285.31 (   0.00%)      304.80 (  -6.83%)
Time mg.D       43.21 (   0.00%)       52.72 ( -22.01%)
Time sp.D      820.24 (   0.00%)      917.10 ( -11.81%)
Time ua.D      337.09 (   0.00%)      398.83 ( -18.32%)

          4.16.0-rc1  4.16.0-rc1
        delayretry-binddelayretry-unbound
User       277731.25   296751.91
System        261.29      357.80
Elapsed      2100.55     2368.13

Unfortunately, while performance is improved by the patch, there is still
quite a long way to go before it's equivalent to hard binding.

Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:45 +01:00
Mel Gorman
2c83362734 sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on.  This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine().

This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.

A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes:

                                  4.15.0                 4.15.0
                            noexit-v1r23           sdnuma-v1r23
 Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
 Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
 Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
 Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
 Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)

               4.15.0      4.15.0
         noexit-v1r23 sdnuma-v1r23
 User         5434.12     5188.41
 System       4878.77     3467.09
 Elapsed     10259.06     8624.21

That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.

There is also a noticable impact on hackbench such as this example using
processes and pipes:

 hackbench-process-pipes
                               4.15.0                 4.15.0
                         noexit-v1r23           sdnuma-v1r23
 Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
 Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
 Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
 Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
 Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
 Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
 Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
 Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
 Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
 Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
 Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
 Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
 Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
 Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
 Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)

It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable:

                                      4.15.0                 4.15.0
                                noexit-v1r23           sdnuma-v1r23
 Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
 Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
 Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
 Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
 Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
 Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
 Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
 Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
 Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
 Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
 Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
 Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
 Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
 Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
 Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
 Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
 Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
 Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)

However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate:

 NUMA base PTE updates             4620        1473
 NUMA huge PMD updates                0           0
 NUMA page range updates           4620        1473
 NUMA hint faults                  4301        1383
 NUMA hint local faults            1309         451
 NUMA hint local percent             30          32
 NUMA pages migrated               1335         491
 AutoNUMA cost                      21%          6%

There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:43 +01:00
Peter Zijlstra
24d0c1d6e6 sched/fair: Do not migrate due to a sync wakeup on exit
When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:42 +01:00
Mel Gorman
082f764a2f sched/fair: Do not migrate on wake_affine_weight() if weights are equal
wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.

This patch modifies wake_affine_weight() to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.

For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows:

                                      4.15.0                 4.15.0
                                       16rc0          noequal-v1r23
 Hmean     send-64         355.47 (   0.00%)      349.50 (  -1.68%)
 Hmean     send-128        697.98 (   0.00%)      693.35 (  -0.66%)
 Hmean     send-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
 Hmean     send-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
 Hmean     send-2048      9637.02 (   0.00%)     9601.34 (  -0.37%)
 Hmean     send-3312     14355.37 (   0.00%)    14414.51 (   0.41%)
 Hmean     send-4096     16464.97 (   0.00%)    16301.37 (  -0.99%)
 Hmean     send-8192     26722.42 (   0.00%)    26428.95 (  -1.10%)
 Hmean     send-16384    38137.81 (   0.00%)    38046.11 (  -0.24%)
 Hmean     recv-64         355.47 (   0.00%)      349.50 (  -1.68%)
 Hmean     recv-128        697.98 (   0.00%)      693.35 (  -0.66%)
 Hmean     recv-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
 Hmean     recv-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
 Hmean     recv-2048      9636.95 (   0.00%)     9601.30 (  -0.37%)
 Hmean     recv-3312     14355.32 (   0.00%)    14414.48 (   0.41%)
 Hmean     recv-4096     16464.74 (   0.00%)    16301.16 (  -0.99%)
 Hmean     recv-8192     26721.63 (   0.00%)    26428.17 (  -1.10%)
 Hmean     recv-16384    38136.00 (   0.00%)    38044.88 (  -0.24%)
 Stddev    send-64           7.30 (   0.00%)        4.75 (  34.96%)
 Stddev    send-128         15.15 (   0.00%)       22.38 ( -47.66%)
 Stddev    send-256         13.99 (   0.00%)       19.14 ( -36.81%)
 Stddev    send-1024       105.73 (   0.00%)       67.38 (  36.27%)
 Stddev    send-2048       294.57 (   0.00%)      223.88 (  24.00%)
 Stddev    send-3312       302.28 (   0.00%)      271.74 (  10.10%)
 Stddev    send-4096       195.92 (   0.00%)      121.10 (  38.19%)
 Stddev    send-8192       399.71 (   0.00%)      563.77 ( -41.04%)
 Stddev    send-16384     1163.47 (   0.00%)     1103.68 (   5.14%)
 Stddev    recv-64           7.30 (   0.00%)        4.75 (  34.96%)
 Stddev    recv-128         15.15 (   0.00%)       22.38 ( -47.66%)
 Stddev    recv-256         13.99 (   0.00%)       19.14 ( -36.81%)
 Stddev    recv-1024       105.73 (   0.00%)       67.38 (  36.27%)
 Stddev    recv-2048       294.59 (   0.00%)      223.89 (  24.00%)
 Stddev    recv-3312       302.24 (   0.00%)      271.75 (  10.09%)
 Stddev    recv-4096       196.03 (   0.00%)      121.14 (  38.20%)
 Stddev    recv-8192       399.86 (   0.00%)      563.65 ( -40.96%)
 Stddev    recv-16384     1163.79 (   0.00%)     1103.86 (   5.15%)

The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.

The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency:

                                      4.15.0                 4.15.0
                                       16rc0          noequal-v1r23
 Lat 50.00th-qrtle-1        38.00 (   0.00%)       38.00 (   0.00%)
 Lat 75.00th-qrtle-1        49.00 (   0.00%)       41.00 (  16.33%)
 Lat 90.00th-qrtle-1        52.00 (   0.00%)       50.00 (   3.85%)
 Lat 95.00th-qrtle-1        54.00 (   0.00%)       51.00 (   5.56%)
 Lat 99.00th-qrtle-1        63.00 (   0.00%)       60.00 (   4.76%)
 Lat 99.50th-qrtle-1        66.00 (   0.00%)       61.00 (   7.58%)
 Lat 99.90th-qrtle-1        78.00 (   0.00%)       65.00 (  16.67%)
 Lat 50.00th-qrtle-2        38.00 (   0.00%)       38.00 (   0.00%)
 Lat 75.00th-qrtle-2        42.00 (   0.00%)       43.00 (  -2.38%)
 Lat 90.00th-qrtle-2        46.00 (   0.00%)       48.00 (  -4.35%)
 Lat 95.00th-qrtle-2        49.00 (   0.00%)       50.00 (  -2.04%)
 Lat 99.00th-qrtle-2        55.00 (   0.00%)       57.00 (  -3.64%)
 Lat 99.50th-qrtle-2        58.00 (   0.00%)       60.00 (  -3.45%)
 Lat 99.90th-qrtle-2        65.00 (   0.00%)       68.00 (  -4.62%)
 Lat 50.00th-qrtle-4        41.00 (   0.00%)       41.00 (   0.00%)
 Lat 75.00th-qrtle-4        45.00 (   0.00%)       46.00 (  -2.22%)
 Lat 90.00th-qrtle-4        50.00 (   0.00%)       50.00 (   0.00%)
 Lat 95.00th-qrtle-4        54.00 (   0.00%)       53.00 (   1.85%)
 Lat 99.00th-qrtle-4        61.00 (   0.00%)       61.00 (   0.00%)
 Lat 99.50th-qrtle-4        65.00 (   0.00%)       64.00 (   1.54%)
 Lat 99.90th-qrtle-4        76.00 (   0.00%)       82.00 (  -7.89%)
 Lat 50.00th-qrtle-8        48.00 (   0.00%)       46.00 (   4.17%)
 Lat 75.00th-qrtle-8        55.00 (   0.00%)       54.00 (   1.82%)
 Lat 90.00th-qrtle-8        60.00 (   0.00%)       59.00 (   1.67%)
 Lat 95.00th-qrtle-8        63.00 (   0.00%)       63.00 (   0.00%)
 Lat 99.00th-qrtle-8        71.00 (   0.00%)       69.00 (   2.82%)
 Lat 99.50th-qrtle-8        74.00 (   0.00%)       73.00 (   1.35%)
 Lat 99.90th-qrtle-8        98.00 (   0.00%)       90.00 (   8.16%)
 Lat 50.00th-qrtle-16       56.00 (   0.00%)       55.00 (   1.79%)
 Lat 75.00th-qrtle-16       68.00 (   0.00%)       67.00 (   1.47%)
 Lat 90.00th-qrtle-16       77.00 (   0.00%)       78.00 (  -1.30%)
 Lat 95.00th-qrtle-16       82.00 (   0.00%)       84.00 (  -2.44%)
 Lat 99.00th-qrtle-16       90.00 (   0.00%)       93.00 (  -3.33%)
 Lat 99.50th-qrtle-16       93.00 (   0.00%)       97.00 (  -4.30%)
 Lat 99.90th-qrtle-16      110.00 (   0.00%)      110.00 (   0.00%)
 Lat 50.00th-qrtle-32       68.00 (   0.00%)       62.00 (   8.82%)
 Lat 75.00th-qrtle-32       90.00 (   0.00%)       83.00 (   7.78%)
 Lat 90.00th-qrtle-32      110.00 (   0.00%)      100.00 (   9.09%)
 Lat 95.00th-qrtle-32      122.00 (   0.00%)      111.00 (   9.02%)
 Lat 99.00th-qrtle-32      145.00 (   0.00%)      133.00 (   8.28%)
 Lat 99.50th-qrtle-32      154.00 (   0.00%)      143.00 (   7.14%)
 Lat 99.90th-qrtle-32     2316.00 (   0.00%)      515.00 (  77.76%)
 Lat 50.00th-qrtle-35       69.00 (   0.00%)       72.00 (  -4.35%)
 Lat 75.00th-qrtle-35       92.00 (   0.00%)       95.00 (  -3.26%)
 Lat 90.00th-qrtle-35      111.00 (   0.00%)      114.00 (  -2.70%)
 Lat 95.00th-qrtle-35      122.00 (   0.00%)      124.00 (  -1.64%)
 Lat 99.00th-qrtle-35      142.00 (   0.00%)      144.00 (  -1.41%)
 Lat 99.50th-qrtle-35      150.00 (   0.00%)      154.00 (  -2.67%)
 Lat 99.90th-qrtle-35     6104.00 (   0.00%)     5640.00 (   7.60%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:08 +01:00
Mel Gorman
eeb6039863 sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed
On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:07 +01:00
Mel Gorman
7ebb66a12f sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:49:07 +01:00
Ingo Molnar
ed02934395 Linux 4.16-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqKKI0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGRNAH/0v3+nuJ0oiHE1Cl
 fH89F9Ma17j8oTo28byRPi7X5XJfJAqANhHa209rguvnC27y3ew/l9k93HoxG12i
 ttvyKFDQulQbytfJZXw8lhUyYGXVsTpyNaihPe/NtqPdIxNgfrXsUN9EIEtcnuS2
 SiAj51jUySDRNR4ST6TOx4ulDm1zLrmA28WHOBNOTvDi4jTQMt1TsngHfF5AySBB
 lD4RTRDDwWDWtdMI7euYSq019TiDXCxmwQ94vZjrqmjmSQcl/yCK/JzEV33SZslg
 4WqGIllxONvP/UlwxZLaJ+RrslqxNgDVqQKwJdfYhGaWvpgPFtS1s86zW6IgyXny
 02jJfD0=
 =DLWn
 -----END PGP SIGNATURE-----

Merge tag 'v4.16-rc2' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21 08:48:35 +01:00
Paul E. McKenney
85ba6bfe8b torture: Provide more sensible nreader/nwriter defaults for rcuperf
The default values for nreader and nwriter are apparently not all that
user-friendly, resulting in people doing scalability tests that ran all
runs at large scale.  This commit therefore makes both the nreaders and
nwriters module default to the number of CPUs, and adds a comment to
rcuperf.c stating that the number of CPUs should be specified using the
nr_cpus kernel boot parameter.  This commit also eliminates the redundant
rcuperf scripting specification of default values for these parameters.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:22:01 -08:00
Paul E. McKenney
db0c1a8aba rcutorture: Record which grace-period primitives are tested
The rcu_torture_writer() function adapts to requested testing from module
parameters as well as the function pointers in the structure referenced
by cur_ops.  However, as long as the module parameters do not conflict
with the function pointers, this adaptation is silent.  This silence can
result in confusion as to exactly what was tested, which could in turn
result in untested RCU code making its way into mainline.

This commit therefore makes rcu_torture_writer() announce exactly which
portions of RCU's API it ends up testing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:58 -08:00
Paul E. McKenney
f7c0e6ad4b rcutorture: Re-enable testing of dynamic expediting
During boot, normal grace periods are processed as expedited.  When
rcutorture is built into the kernel, it starts during boot and thus
detects that normal grace periods are unconditionally expedited.
Therefore, rcutorture concludes that there is no point in trying
to dynamically enable expediting, do it disables this aspect of testing,
which is a bit of an overreaction to the temporary boot-time expediting.

This commit therefore rechecks forced expediting throughout the test,
enabling dynamic expediting if normal grace periods are processed
normally at any point.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:57 -08:00
Paul E. McKenney
eb0339934f rcutorture: Avoid fake-writer use of undefined primitives
Currently the rcu_torture_fakewriter() function invokes cur_ops->sync()
and cur_ops->exp_sync() without first checking to see if they are in
fact non-NULL.  This results in kernel NULL pointer dereferences when
testing RCU implementations that choose not to provide the full set of
primitives.  Given that it is perfectly reasonable to have specialized
RCU implementations that provide only a subset of the RCU API, this is
a bug in rcutorture.

This commit therefore makes rcu_torture_fakewriter() check function
pointers before invoking them, thus allowing it to test subsetted
RCU implementations.

Reported-by: Lihao Liang <lianglihao@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:56 -08:00
Paul E. McKenney
e0d31a34c6 rcutorture: Abstract function and module names
This commit moves to __func__ for function names and for KBUILD_MODNAME
for module names, all in the name of better resilience to change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:56 -08:00
Paul E. McKenney
68a675d433 rcutorture: Replace multi-instance kzalloc() with kcalloc()
This commit replaces array-allocation calls to kzalloc() with
equivalent calls to kcalloc().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:55 -08:00
Paul E. McKenney
6308f34775 rcu: Remove SRCU throttling
The code in srcu_gp_end() inserts a delay every 0x3ff grace periods in
order to prevent SRCU grace-period work from consuming an entire CPU
when there is a long sequence of expedited SRCU grace-period requests.
However, all of SRCU's grace-period work is carried out in workqueues,
which are in turn within kthreads, which are automatically throttled as
needed by the scheduler.  In particular, if there is plenty of idle time,
there is no point in throttling.

This commit therefore removes the expedited SRCU grace-period throttling.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:13 -08:00
Byungchul Park
a72da917f1 srcu: Remove dead code in srcu_gp_end()
Of course, compilers will optimize out a dead code. Anyway, remove
any dead code for better readibility.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:12 -08:00
Ildar Ismagilov
8ddbd8832d srcu: Reduce scans of srcu_data in counter wrap check
Currently, given a multi-level srcu_node tree, SRCU can scan the full
set of srcu_data structures at each level when cleaning up after a grace
period.  This, though harmless otherwise, represents pointless overhead.
This commit therefore eliminates this overhead by scanning the srcu_data
structures only when traversing the leaf srcu_node structures.

Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:12 -08:00
Ildar Ismagilov
a35d13ec36 srcu: Prevent sdp->srcu_gp_seq_needed_exp counter wrap
SRCU checks each srcu_data structure's grace-period number for counter
wrap four times per cycle by default.  This frequency guarantees that
normal comparisons will detect potential wrap.  However, the expedited
grace-period number is not checked.  The consquences are not too horrible
(a failure to expedite a grace period when requested), but it would be
good to avoid such things.  This commit therefore adds this check to
the expedited grace-period number.

Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:11 -08:00
Paul E. McKenney
cb4081cd4e srcu: Abstract function name
This commit moves to __func__ for function names in the name of better
resilience to change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:21:11 -08:00
Paul E. McKenney
65963d2461 rcu: Make expedited RCU CPU selection avoid unnecessary stores
This commit reworks the first loop in sync_rcu_exp_select_cpus()
to avoid doing unnecssary stores to other CPUs' rcu_data
structures.  This speeds up that first loop by roughly a factor of
two on an old x86 system.  In the case where the system is mostly
idle, this loop incurs a large fraction of the overhead of the
synchronize_rcu_expedited().  There is less benefit on busy systems
because the overhead of the smp_call_function_single() in the second
loop dominates in that case.

However, it is not unusual to do configuration chances involving
RCU grace periods (both expedited and normal) while the system is
mostly idle, so this optimization is worth doing.

While we are in the area, this commit also adds parentheses to arguments
used by the for_each_leaf_node_possible_cpu() macro.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:12:29 -08:00
Paul E. McKenney
7f5d42d051 rcu: Trace expedited GP delays due to transitioning CPUs
If a CPU is transitioning to or from offline state, an expedited
grace period may undergo a timed wait.  This timed wait can unduly
delay grace periods, so this commit adds a trace statement to make
it visible.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:12:28 -08:00
Paul E. McKenney
9a414201ae rcu: Add more tracing of expedited grace periods
This commit adds more tracing of expedited grace periods to enable
improved debugging of slowdowns.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:12:27 -08:00
Ildar Ismagilov
274afd6bfa rcu: Fix misprint in srcu_funnel_exp_start
The srcu_funnel_exp_start() function checks to see if the srcu_struct
structure's expedited grace period counter needs updating to reflect a
newly arrived request for an expedited SRCU grace period.  Unfortunately,
the check is backwards, so this commit reverses the sense of the test.

Signed-off-by: Ildar Ismagilov <devix84@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:12:27 -08:00
Matthew Wilcox
a32e01ee68 rcu: Use wrapper for lockdep asserts
Commits c0b334c5bf and ea9b0c8a26 introduced new sparse warnings
by accessing rcu_node->lock directly and ignoring the __private
marker.  Introduce a new wrapper and use it.  Also fix a similar problem
in srcutree.c introduced by a3883df393.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:12:26 -08:00
Liu, Changcheng
65518db86b rcu: Remove redundant nxttail index macro define
RCU's nxttail has been optimized to be a rcu_segcblist, which is
a multi-tailed linked list with macros defined for the indexes for
each tail.  The indexes have been defined in linux/rcu_segcblist.h,
so this commit removes the redundant definitions in kernel/rcu/tree.h.

Signed-off-by: Liu Changcheng <changcheng.liu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:31 -08:00
Paul E. McKenney
bfbd767d4d rcu: Consolidate rcu.h #ifdefs
The kernel/rcu/rcu.h file has a pair of consecutive #ifdefs on
CONFIG_TINY_RCU, so this commit consolidates them, thus saving a few
lines of code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:30 -08:00
Paul E. McKenney
d07aee2c03 rcu: More clearly identify grace-period kthread stack dump
It is not always obvious that the stack dump from a starved grace-period
kthread isn't instead that of a CPU stalling the current grace period.
This commit therefore adds a pr_err() flagging these dumps.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:29 -08:00
Paul E. McKenney
d62df57370 rcu: Remove obsolete force-quiescent-state statistics for debugfs
The debugfs interface displayed statistics on RCU-pending checks but
this interface has since been removed.  This commit therefore removes the
no-longer-used rcu_state structure's ->n_force_qs_lh and ->n_force_qs_ngp
fields along with their updates.  (Though the ->n_force_qs_ngp field
was actually not used at all, embarrassingly enough.)

If this information proves necessary in the future, the corresponding
event traces will be added.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:29 -08:00
Paul E. McKenney
01c495f72a rcu: Remove obsolete __rcu_pending() statistics for debugfs
The debugfs interface displayed statistics on RCU-pending checks
but this interface has since been removed.  This commit therefore
removes the no-longer-used rcu_data structure's ->n_rcu_pending,
->n_rp_core_needs_qs, ->n_rp_report_qs, ->n_rp_cb_ready,
->n_rp_cpu_needs_gp, ->n_rp_gp_completed, ->n_rp_gp_started,
->n_rp_nocb_defer_wakeup, and ->n_rp_need_nothing fields along with
their updates.

If this information proves necessary in the future, the corresponding
event traces will be added.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:28 -08:00
Paul E. McKenney
62df63e048 rcu: Remove obsolete callback-invocation statistics for debugfs
The debugfs interface displayed statistics on RCU callback invocation but
this interface has since been removed.  This commit therefore removes the
no-longer-used rcu_data structure's ->n_cbs_invoked and ->n_nocbs_invoked
fields along with their updates.

If this information proves necessary in the future, the corresponding
event traces will be added.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:27 -08:00
Paul E. McKenney
bec06785fe rcu: Remove obsolete boost statistics for debugfs
The debugfs interface displayed statistics on RCU priority boosting,
but this interface has since been removed.  This commit therefore
removes the no-longer-used rcu_data structure's ->n_tasks_boosted,
->n_exp_boosts, and ->n_exp_boosts and their updates.

If this information proves necessary in the future, the corresponding
event traces will be added.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:27 -08:00
Tejun Heo
3caa973b7a rcu: Call touch_nmi_watchdog() while printing stall warnings
When RCU stall warning triggers, it can print out a lot of messages
while holding spinlocks.  If the console device is slow (e.g. an
actual or IPMI serial console), it may end up triggering NMI hard
lockup watchdog like the following.

*** CPU printking while holding RCU spinlock

  PID: 4149739  TASK: ffff881a46baa880  CPU: 13  COMMAND: "CPUThreadPool8"
   #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
   #1 [ffff881fff945e58] nmi_handle at ffffffff81020653
   #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
   #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
   #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
      [exception RIP: io_serial_in+21]
      RIP: ffffffff81630e55  RSP: ffff881fff943b88  RFLAGS: 00000002
      RAX: 000000000000ca00  RBX: ffffffff8230e188  RCX: 0000000000000000
      RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffff8230e188
      RBP: ffff881fff943bb0   R8: 0000000000000000   R9: ffffffff820cb3c4
      R10: 0000000000000019  R11: 0000000000002000  R12: 00000000000026e1
      R13: 0000000000000020  R14: ffffffff820cd398  R15: 0000000000000035
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
  --- <NMI exception stack> ---
   #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
   #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
   #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
   #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
   #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
  #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
  #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
  #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
  #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
  #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
  #15 [ffff881fff943d70] printk at ffffffff8119a8a4
  #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
  #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
  #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
  #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
  #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
  #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b

*** CPU triggering the hardlockup watchdog

  PID: 4149709  TASK: ffff88010f88c380  CPU: 26  COMMAND: "CPUThreadPool35"
   #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
   #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
   #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
   #3 [ffff883fff105b08] panic at ffffffff8119a6ae
   #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
   #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
   #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
   #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
   #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
   #9 [ffff883fff105e58] nmi_handle at ffffffff81020653
  #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
  #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
  #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
      [exception RIP: queued_spin_lock_slowpath+248]
      RIP: ffffffff810da958  RSP: ffff883fff103e68  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: 0000000000000046  RCX: 00000000006d0000
      RDX: ffff883fff49a950  RSI: 0000000000d10101  RDI: ffffffff81e54300
      RBP: ffff883fff103e80   R8: ffff883fff11a950   R9: 0000000000000000
      R10: 000000000e5873ba  R11: 000000000000010f  R12: ffffffff81e54300
      R13: 0000000000000000  R14: ffff88010f88c380  R15: ffffffff81e54300
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
  #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
  #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
  #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
  #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
  #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
  #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
  #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
  --- <IRQ stack> ---

Avoid spuriously triggering NMI hardlockup watchdog by touching it
from the print functions.  show_state_filter() shares the same problem
and solution.

v2: Relocate the comment to where it belongs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:26 -08:00
Paul E. McKenney
3016611eed rcu: Fix CPU offload boot message when no CPUs are offloaded
In CONFIG_RCU_NOCB_CPU=y kernels, if the boot parameters indicate that
none of the CPUs should in fact be offloaded, the following somewhat
obtuse message appears:

	Offload RCU callbacks from CPUs: .

This commit therefore makes the message at least grammatically correct
in this case:

	Offload RCU callbacks from CPUs: (none)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-20 16:10:19 -08:00
David S. Miller
f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Linus Torvalds
9ca2c16f3b Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Perf tool updates and kprobe fixes:

   - perf_mmap overwrite mode fixes/overhaul, prep work to get 'perf
     top' using it, making it bearable to use it in large core count
     systems such as Knights Landing/Mill Intel systems (Kan Liang)

   - s/390 now uses syscall.tbl, just like x86-64 to generate the
     syscall table id -> string tables used by 'perf trace' (Hendrik
     Brueckner)

   - Use strtoull() instead of home grown function (Andy Shevchenko)

   - Synchronize kernel ABI headers, v4.16-rc1 (Ingo Molnar)

   - Document missing 'perf data --force' option (Sangwon Hong)

   - Add perf vendor JSON metrics for ARM Cortex-A53 Processor (William
     Cohen)

   - Improve error handling and error propagation of ftrace based
     kprobes so failures when installing kprobes are not silently
     ignored and create disfunctional tracepoints"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  kprobes: Propagate error from disarm_kprobe_ftrace()
  kprobes: Propagate error from arm_kprobe_ftrace()
  Revert "tools include s390: Grab a copy of arch/s390/include/uapi/asm/unistd.h"
  perf s390: Rework system call table creation by using syscall.tbl
  perf s390: Grab a copy of arch/s390/kernel/syscall/syscall.tbl
  tools/headers: Synchronize kernel ABI headers, v4.16-rc1
  perf test: Fix test trace+probe_libc_inet_pton.sh for s390x
  perf data: Document missing --force option
  perf tools: Substitute yet another strtoull()
  perf top: Check the latency of perf_top__mmap_read()
  perf top: Switch default mode to overwrite mode
  perf top: Remove lost events checking
  perf hists browser: Add parameter to disable lost event warning
  perf top: Add overwrite fall back
  perf evsel: Expose the perf_missing_features struct
  perf top: Check per-event overwrite term
  perf mmap: Discard legacy interface for mmap read
  perf test: Update mmap read functions for backward-ring-buffer test
  perf mmap: Introduce perf_mmap__read_event()
  perf mmap: Introduce perf_mmap__read_done()
  ...
2018-02-18 12:38:40 -08:00
Linus Torvalds
2d6c4e40ab Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "A small set of updates mostly for irq chip drivers:

   - MIPS GIC fix for spurious, masked interrupts

   - fix for a subtle IPI bug in GICv3

   - do not probe GICv3 ITSs that are marked as disabled

   - multi-MSI support for GICv2m

   - various small cleanups"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro
  irqchip/bcm: Remove hashed address printing
  irqchip/gic-v2m: Add PCI Multi-MSI support
  irqchip/gic-v3: Ignore disabled ITS nodes
  irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq()
  irqchip/gic-v3: Change pr_debug message to pr_devel
  irqchip/mips-gic: Avoid spuriously handling masked interrupts
2018-02-18 12:22:04 -08:00
Ingo Molnar
7057bb975d Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-17 11:39:28 +01:00
Lukas Wunner
27d4ee0307 workqueue: Allow retrieval of current task's work struct
Introduce a helper to retrieve the current task's work struct if it is
a workqueue worker.

This allows us to fix a long-standing deadlock in several DRM drivers
wherein the ->runtime_suspend callback waits for a specific worker to
finish and that worker in turn calls a function which waits for runtime
suspend to finish.  That function is invoked from multiple call sites
and waiting for runtime suspend to finish is the correct thing to do
except if it's executing in the context of the worker.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de
2018-02-16 22:24:25 +01:00
Andy Shevchenko
0b24a0bbe2 irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro
...instead of open coding file operations followed by custom ->open()
callbacks per each attribute.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-02-16 14:22:34 +00:00
Jessica Yu
297f9233b5 kprobes: Propagate error from disarm_kprobe_ftrace()
Improve error handling when disarming ftrace-based kprobes. Like with
arm_kprobe_ftrace(), propagate any errors from disarm_kprobe_ftrace() so
that we do not disable/unregister kprobes that are still armed. In other
words, unregister_kprobe() and disable_kprobe() should not report success
if the kprobe could not be disarmed.

disarm_all_kprobes() keeps its current behavior and attempts to
disarm all kprobes. It returns the last encountered error and gives a
warning if not all probes could be disarmed.

This patch is based on Petr Mladek's original patchset (patches 2 and 3)
back in 2015, which improved kprobes error handling, found here:

   https://lkml.org/lkml/2015/2/26/452

However, further work on this had been paused since then and the patches
were not upstreamed.

Based-on-patches-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/20180109235124.30886-3-jeyu@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-16 09:12:58 +01:00
Jessica Yu
12310e3437 kprobes: Propagate error from arm_kprobe_ftrace()
Improve error handling when arming ftrace-based kprobes. Specifically, if
we fail to arm a ftrace-based kprobe, register_kprobe()/enable_kprobe()
should report an error instead of success. Previously, this has lead to
confusing situations where register_kprobe() would return 0 indicating
success, but the kprobe would not be functional if ftrace registration
during the kprobe arming process had failed. We should therefore take any
errors returned by ftrace into account and propagate this error so that we
do not register/enable kprobes that cannot be armed. This can happen if,
for example, register_ftrace_function() finds an IPMODIFY conflict (since
kprobe_ftrace_ops has this flag set) and returns an error. Such a conflict
is possible since livepatches also set the IPMODIFY flag for their ftrace_ops.

arm_all_kprobes() keeps its current behavior and attempts to arm all
kprobes. It returns the last encountered error and gives a warning if
not all probes could be armed.

This patch is based on Petr Mladek's original patchset (patches 2 and 3)
back in 2015, which improved kprobes error handling, found here:

   https://lkml.org/lkml/2015/2/26/452

However, further work on this had been paused since then and the patches
were not upstreamed.

Based-on-patches-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/20180109235124.30886-2-jeyu@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-16 09:12:52 +01:00
Daniel Borkmann
9c2d63b843 bpf: fix mlock precharge on arraymaps
syzkaller recently triggered OOM during percpu map allocation;
while there is work in progress by Dennis Zhou to add __GFP_NORETRY
semantics for percpu allocator under pressure, there seems also a
missing bpf_map_precharge_memlock() check in array map allocation.

Given today the actual bpf_map_charge_memlock() happens after the
find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
is there to bail out early before we go and do the map setup work
when we find that we hit the limits anyway. Therefore add this for
array map as well.

Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-02-15 21:34:33 -08:00
Paul E. McKenney
a7c8655b07 sched/isolation: Eliminate NO_HZ_FULL_ALL
Commit 6f1982fedd ("sched/isolation: Handle the nohz_full= parameter")
broke CONFIG_NO_HZ_FULL_ALL=y kernels.  This breakage is due to the code
under CONFIG_NO_HZ_FULL_ALL failing to invoke the shiny new housekeeping
functions.  This means that rcutorture scenario TREE04 now emits RCU CPU
stall warnings due to the RCU grace-period kthreads not being awakened
at a time of their choosing, or perhaps even not at all:

[   27.731422] rcu_bh kthread starved for 21001 jiffies! g18446744073709551369 c18446744073709551368 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=3
[   27.731423] rcu_bh          I14936     9      2 0x80080000
[   27.731435] Call Trace:
[   27.731440]  __schedule+0x31a/0x6d0
[   27.731442]  schedule+0x31/0x80
[   27.731446]  schedule_timeout+0x15a/0x320
[   27.731453]  ? call_timer_fn+0x130/0x130
[   27.731457]  rcu_gp_kthread+0x66c/0xea0
[   27.731458]  ? rcu_gp_kthread+0x66c/0xea0

Because no one has complained about CONFIG_NO_HZ_FULL_ALL=y being broken,
I hypothesize that no one is in fact using it, other than rcutorture.
This commit therefore eliminates CONFIG_NO_HZ_FULL_ALL and updates
rcutorture's config files to instead use the nohz_full= kernel parameter
to put the desired CPUs into nohz_full mode.

Fixes: 6f1982fedd ("sched/isolation: Handle the nohz_full= parameter")

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Jonathan Corbet <corbet@lwn.net>
2018-02-15 15:40:37 -08:00
Lihao Liang
398953e62c rcu: Remove unnecessary spinlock in rcu_boot_init_percpu_data()
Since rcu_boot_init_percpu_data() is only called at boot time,
there is no data race and spinlock is not needed.

Signed-off-by: Lihao Liang <lianglihao@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-02-15 15:40:36 -08:00
Linus Torvalds
1388c80438 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes:

   - fix rq->lock lockdep annotation bug

   - fix/improve update_curr_rt() and update_curr_dl() accounting

   - update documentation

   - remove unused macro"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cpufreq: Remove unused SUGOV_KTHREAD_PRIORITY macro
  sched/core: Fix DEBUG_SPINLOCK annotation for rq->lock
  sched/rt: Make update_curr_rt() more accurate
  sched/deadline: Make update_curr_dl() more accurate
  membarrier-sync-core: Document architecture support
2018-02-15 09:28:47 -08:00
Linus Torvalds
e9e3b3002f Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "This contains two qspinlock fixes and three documentation and comment
  fixes"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/semaphore: Update the file path in documentation
  locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()
  locking/qspinlock: Ensure node->count is updated before initialising node
  locking/qspinlock: Ensure node is initialised before updating prev->next
  Documentation/locking/mutex-design: Update to reflect latest changes
2018-02-15 09:05:26 -08:00
Joe Stringer
544bdebc6f bpf: Remove unused callee_saved array
This array appears to be completely unused, remove it.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-15 10:22:55 +01:00
Daniel Borkmann
9c481b908b bpf: fix bpf_prog_array_copy_to_user warning from perf event prog query
syzkaller tried to perform a prog query in perf_event_query_prog_array()
where struct perf_event_query_bpf had an ids_len of 1,073,741,353 and
thus causing a warning due to failed kcalloc() allocation out of the
bpf_prog_array_copy_to_user() helper. Given we cannot attach more than
64 programs to a perf event, there's no point in allowing huge ids_len.
Therefore, allow a buffer that would fix the maximum number of ids and
also add a __GFP_NOWARN to the temporary ids buffer.

Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Fixes: 0911287ce3 ("bpf: fix bpf_prog_array_copy_to_user() issues")
Reported-by: syzbot+cab5816b0edbabf598b3@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-02-14 08:59:37 -08:00
Jason Wang
7fc17e909e bpf: cpumap: use GFP_KERNEL instead of GFP_ATOMIC in __cpu_map_entry_alloc()
There're several implications after commit 0bf7800f17 ("ptr_ring:
try vmalloc() when kmalloc() fails") with the using of vmalloc() since
can't allow GFP_ATOMIC but mandate GFP_KERNEL. This will lead a WARN
since cpumap try to call with GFP_ATOMIC. Fortunately, entry
allocation of cpumap can only be done through syscall path which means
GFP_ATOMIC is not necessary, so fixing this by replacing GFP_ATOMIC
with GFP_KERNEL.

Reported-by: syzbot+1a240cdb1f4cc88819df@syzkaller.appspotmail.com
Fixes: 0bf7800f17 ("ptr_ring: try vmalloc() when kmalloc() fails")
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: akpm@linux-foundation.org
Cc: dhowells@redhat.com
Cc: hannes@cmpxchg.org
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-14 15:34:27 +01:00
Eric Dumazet
952fad8e32 bpf: fix sock_map_alloc() error path
In case user program provides silly parameters, we want
a map_alloc() handler to return an error, not a NULL pointer,
otherwise we crash later in find_and_alloc_map()

Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-02-13 19:19:15 -08:00
Yonghong Song
9a3efb6b66 bpf: fix memory leak in lpm_trie map_free callback function
There is a memory leak happening in lpm_trie map_free callback
function trie_free. The trie structure itself does not get freed.

Also, trie_free function did not do synchronize_rcu before freeing
various data structures. This is incorrect as some rcu_read_lock
region(s) for lookup, update, delete or get_next_key may not complete yet.
The fix is to add synchronize_rcu in the beginning of trie_free.
The useless spin_lock is removed from this function as well.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Mathieu Malaterre <malat@debian.org>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-02-13 19:15:16 -08:00
Kirill Tkhai
906f63ec1d net: Convert audit_net_ops
This patch starts to convert pernet_subsys, registered
from postcore initcalls.

audit_net_init() creates netlink socket, while audit_net_exit()
destroys it. The rest of the pernet_list are not interested
in the socket, so we make audit_net_ops async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:06 -05:00
Will Deacon
11dc13224c locking/qspinlock: Ensure node->count is updated before initialising node
When queuing on the qspinlock, the count field for the current CPU's head
node is incremented. This needn't be atomic because locking in e.g. IRQ
context is balanced and so an IRQ will return with node->count as it
found it.

However, the compiler could in theory reorder the initialisation of
node[idx] before the increment of the head node->count, causing an
IRQ to overwrite the initialised node and potentially corrupt the lock
state.

Avoid the potential for this harmful compiler reordering by placing a
barrier() between the increment of the head node->count and the subsequent
node initialisation.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13 14:50:14 +01:00